What is Buzzword Analysis?
Buzzword Analysis provides the raw n-gram data for the Unsupervised NLU model. It turns every text field into a list of n-grams by grouping terms that appear together.
It creates structure around your unstructured textual data by identifying word co-occurrences that are the most relevant within the entire corpus. Once it generates n-grams for every record, the Unsupervised NLU model scores the n-grams according to how strongly the grouped terms correlate.
The most commonly used n-gram is the bi-gram with two-word co-occurrences, but when you create the model, you can change the N-Gram Length property to 3 or 4 in the advanced settings. One other advanced setting specific to this model is the Minimum Word Count. Set to 2 by default, this sets the minimum frequency a word must occur within a record in order to be considered for n-gram analysis.
The n-grams returned by the Unsupervised NLU model are generated by the same Buzzword Analysis, but word clouds from the Unsupervised NLU model look different from raw n-gram data because that model applies categorization and sentiment analysis.
The default visualization for the n-gram generator is a list rather than a word cloud because we add the n-grams for each piece of text back to the associated data record. If you change the visualization for the n-gram generator to a cloud, you may notice that the raw data looks different from the Unsupervised NLU model's n-gram in the following ways, depending on your input corpus.
The NLU model n-gram automatically uses a range of colors based on sentiment analysis (red for negative, grey for neutral, and blue for positive). With the raw n-gram data, the palette of colors is used by default, but you can drag a dimension to the Color Dimension box.
The NLU model n-gram varies the sizes of the n-grams based on the importance of the topics discovered in categorization while the raw n-gram data does not.
The NLU model n-gram puts important n-grams in the forefront, while the raw n-gram data is alphabetic.
Why use Buzzword Analysis?
While the Unsupervised NLU model provides a richer n-gram feature, if your volume of data is too low, the Unsupervised NLU model may fail. The machine learning (ML) portion of that model requires a certain amount of data in order to converge. The Buzzword Analysis allows you to use a word cloud visualization even with limited data.
How much data is required for the Unsupervised NLU model varies depending on the data that you have. For example, if your documents are long, like news articles, then you would need fewer documents. However, if the documents are very short, it might still fail even with a lot of documents.
If the data is very limited, the Buzzword Analysis returns only unigrams. You can use unigrams in a word cloud visualization, but the usefulness is limited.
One Text field is required. Any field that you choose is mapped as a text field. If you choose multiple fields, it concatenates all of the text and treats it as a single mass of text for each record.
The model returns the following fields for use in widget visualizations.
language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language Detection for more information.
language.name: The full name of the detected language, e.g. English, French, Japanese.
ngrams: Sets of words (two by default) that appear together frequently in the corpus.
tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
translated.language.code*: The two-letter language code of the translated (Translate To) language, e.g. en, fr, jp.
translated.language.name*: The full name of the translated (Translate To) language, e.g. English, French, Japanese.
translated.ngrams*: Sets of translated words (two by default) that appear together frequently in the corpus. Note that this is untranslated if you use Translate Model Output.
translated.text*: The full verbatim text as translated.
translated.unigrams*: A textual array of single translated words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
How to use Buzzword Analysis
1. You can create a model from within a workspace, or you can add one to from Models tab. Here, we create it from within a workspace by selecting the workspace tab in the navigation pane.
2. To access the Data Settings Panel menu, click the Data settings button, accompanied by the gear icon.
In the data settings panel that appears, you'll see a list of all data connections for that workspace. Make sure you have selected the data stream you want to work with in the Connected column.
3. In the Analyze tab, expand the section labelled Other to see available models. Choose Buzzword Analysis by clicking the + icon.
4. For the model to run successfully, you'll need to choose a text dimension. Make your selection and click Start Analysis.
Tip: See the section below for details on the advanced setup options available to you.
5. Depending on the size and complexity of your data, it may take some time for the analysis to finish running. When you return to the Data Settings, you'll see this analysis within the Deployed section at the top of the the Analyze tab.
You can set the following properties in the Advanced Setup section of the Create a new model or Deploy a new model dialog.
N-Gram Length: Set the number of tokens to use when building buzzwords.
Default value: 2; Minimum value: 2; Maximum value: 4
Minimum Word Count: Set the minimum frequency that a word must occur to be considered for analysis.
Default value: 2; Minimum value: 2
Default Language: Set the default language to assume if the language is not detectable when applying a language-specific stopword list to clean the text.
Default value: en (English); Valid values: the two-letter language code for any supported language
Token Threshold: Set the number of tokens to use during analysis. Lower the threshold to allow more tokens or raise it to limit the tokens.
Default value: 0; Minimum value: -3; Maximum value: 3
Apply training filter to analysis: Set to true to apply your custom data training filter to your analysis results.
Default value: true
Run Language Detection: Set to "true" to run language detection on the source text.
Default value: true
Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords lists for more information.
Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
Schedule Model Retrain: Specify the number of days, weeks, months, or years after which to retrain your model.
Custom Filters: Define a custom data training filter to refine the data returned.
We're here to help! Don't hesitate to contact us for further assistance via chat or submit a ticket!
Please sign in to leave a comment.