What is an N-Gram Generator?
The N-Gram Generator is what provides the raw n-gram data for the Unsupervised NLU model. It turns every text field into a list of n-grams by grouping terms that appear together.
It creates structure around your unstructured textual data by identifying word co-occurrences that are the most relevant within the entire corpus. Once it generates n-grams for every record, the Unsupervised NLU model scores the n-grams according to how strongly the grouped terms correlate.
The most commonly used n-gram is the bi-gram with two-word co-occurrences, but when you create the model, you can change the N-Gram Length property to 3 or 4 in the advanced settings. One other advanced setting specific to this model is the Minimum Word Count. Set to 2 by default, this sets the minimum frequency a word must occur within a record in order to be considered for n-gram analysis.
The n-grams returned by the Unsupervised NLU model are generated by the same N-Gram Generator, but word clouds from the Unsupervised NLU model look different from raw n-gram data because that model applies categorization and sentiment analysis.
The default visualization for the n-gram generator is a list rather than a word cloud because we add the n-grams for each piece of text back to the associated data record. If you change the visualization for the n-gram generator to a cloud, you may notice that the raw data looks different from the Unsupervised NLU model's n-gram in the following ways, depending on your input corpus.
- The NLU model n-gram automatically uses a range of colors based on sentiment analysis (red for negative, grey for neutral, and blue for positive). With the raw n-gram data, the palette of colors is used by default, but you can drag a dimension to the Color Dimension box.
- The NLU model n-gram varies the sizes of the n-grams based on the importance of the topics discovered in categorization while the raw n-gram data does not.
- The NLU model n-gram puts important n-grams in the forefront, while the raw n-gram data is alphabetic.
Why use the N-Gram Generator?
While the Unsupervised NLU model provides a richer n-gram feature, if your volume of data is too low, the Unsupervised NLU model may fail. The machine learning (ML) portion of that model requires a certain amount of data in order to converge. The N-Gram Generator allows you to use a word cloud visualization even with limited data.
How much data is required for the Unsupervised NLU model varies depending on the data that you have. For example, if your documents are long, like news articles, then you would need fewer documents. However, if the documents are very short, it might still fail even with a lot of documents.
If the data is very limited, the N-Gram Generator returns only unigrams. You can use unigrams in a word cloud visualization, but the usefulness is limited.
One Text field is required. Any field that you choose is mapped as a text field. If you choose multiple fields, it concatenates all of the text and treats it as a single mass of text for each record.
The model returns the following fields for use in widget visualizations.
- language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language Detection for more information.
- language.name: The full name of the detected language, e.g. English, French, Japanese.
- ngrams: Sets of words (two by default) that appear together frequently in the corpus.
- tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
- translated.language.code*: The two-letter language code of the translated (Translate To) language, e.g. en, fr, jp.
- translated.language.name*: The full name of the translated (Translate To) language, e.g. English, French, Japanese.
- translated.ngrams*: Sets of translated words (two by default) that appear together frequently in the corpus. Note that this is untranslated if you use Translate Model Output.
- translated.text*: The full verbatim text as translated.
- translated.unigrams*: A textual array of single translated words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
- unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
- Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
To create an n-gram data model
You can create the data model on the Models tab of the Home page or from within a dashboard. Here are the steps to create one from within a dashboard.
1. Open the dashboard to which you want to add the model.
2. Click the Data icon and in the Data Ingestion panel, expand your data stream and click the default Structured Data Analysis.
3. In the Edit Data Streams dialog that appears, above the Deployed Models list, click Deploy Model.
4. In the Deploy Model dialog that appears, scroll down to Basic Analyses and click N-Gram Generator.
5. In the Deploy a New Model wizard that appears, in the Unassigned Fields column, click the plus sign of the text field to use and click text to add it to the Assigned fields column, then click Next.
6. On the Complete and Submit page of the dialog, optionally change the Name, add Tags, and specify a Description for the model. These fields appear on the tile for the model.
7. Optionally scroll down and expand the Advanced section to set any of the options described in the table below.
You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog.
- N-Gram Length: Set the number of tokens to use when building buzzwords.
Default value: 2; Minimum value: 2; Maximum value: 4
- Minimum Word Count: Set the minimum frequency that a word must occur to be considered for analysis.
Default value: 2; Minimum value: 2
- Default Language: Set the default language to assume if the language is not detectable when applying a language-specific stopword list to clean the text.
Default value: en (English); Valid values: the two-letter language code for any supported language
- Token Threshold: Set the number of tokens to use during analysis. Lower the threshold to allow more tokens or raise it to limit the tokens.
Default value: 0; Minimum value: -3; Maximum value: 3
- Apply training filter to analysis: Set to true to apply your custom data training filter to your analysis results.
Default value: true
- Run Language Detection: Set to true to run language detection on the source text.
Default value: true
- Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords lists for more information.
- Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
- Schedule Model Retrain: Specify the number of days, weeks, months, or years after which to retrain your model.
- Custom Filters: Define a custom data training filter to refine the data returned.