Stratifyd’s NLU models, which include Theme Detection, Theme Summarization, and Emerging Theme Analysis, are intended to run on datasets exceeding a minimum of 500 records. For best performance, datasets should be between 1 thousand and 200 thousand records. If your dataset is below the minimum size, increasing the size of your dataset is the best way to ensure your model does not fail. Further, increasing the size of your dataset to be within the 1 thousand to 200 thousand record range will help ensure that your output is comprehensive and meaningful.
Some users find that when their dataset is at or just above the minimum length, their NLUs continue to fail. This may be due to several reasons.
The dataset contains too many null values.
Although a dataset may exceed the 500 record minimum, if the many of records contain null or empty values for the field the NLU runs on, the model will likely fail. Ensuring that your dataset has a minimum of 500 non-null values will aid the model in processing successfully.
2. The length of the text records in your dataset is too short.
You may find that although your dataset is at or just above the minimum and the number of non-empty records is minimal or none, the NLU still fails. This may be because the length of text in the field you run your model on is too short. If your dataset is made up of primarily one-word records, and particularly if only a handful of one-word phrases predominate the dataset, it does not have enough variability for the model to successfully converge. Further, even if your model does converge, you may notice that your output is sparse: only a few ngrams (or many unmeaningful ngrams) appear in word clouds, and your topics appear to overlap significantly. Ensuring your dataset has variability in the values of the field you wish to analyze will aid in both successful processing and meaningful outputs, particularly if the size of the dataset is small.
If you have ameliorated the above concerns and still have issues, a final workaround lies in changing some model settings, specifically the Topic Threshold. The Topic Threshold determines the number of documents categorized into topics in the output. The default value of this parameter is 0.3. Lowering the Topic Threshold below 0.3 may aid in helping the model process if your dataset is small (but still above the minimum) or if your dataset lacks a desirable amount of variability. The Topic Threshold can be changed in the model settings by first clicking on “Switch to advanced setup” in the model settings pane of the NLU you wish to deploy, under the “Parameters” tab.