Why to use it
This attention neural network model is designed for text classification and is based on the Label-Embedding Attentive Model (LEAM). We could arguably obtain more accurate results by leveraging a large pre-trained model like Google's Bidirectional Encoder Representations from Transformers (BERT) or a Neural Network-based classification model. However, those models incur a high computational cost. LEAM offers a high degree of accuracy for a relatively low computational cost. The attention method employed in this model also returns important keywords used in obtaining classification results.
If you are unsure of whether this is the best model for your data, use the AutoLearn model to find the algorithm (or ensemble of algorithms) best suited to your data.
How it works
The Embedding attentive model embeds words and labels in vectors for each record, and then calculates the attention score between the labels and all of the words in the unstructured text record. This gives words that are closer to certain labels more weight in classification.
The reason that this works well is that the weighted keywords give accurate results with relatively low computational cost. For more details, see LEAM.
The model requires one Ground Truth field and at least one Unstructured Text or Training Feature field.
- ground truth: a field that objectively measures how the user feels, such as a star rating
- unstructured text: a field that collects free-form user feedback, that is, they type it rather than select from a list
- training feature: a field that collects structured user data, that is, data that is numerical or that is selected from options
The model returns the following fields for use in widget visualizations.
- keywords: A textual array of the most important words within the data stream. In supervised models, keywords are the terms that the model finds and uses to predict each label. Keywords are only generated if you select an Unstructured text field. If you select only a Training Feature, that is, selectable values, dates, or numbers, no keywords are generated.
- label: The predicted output based on what the model learned from your input data.
- language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language detection for more information.
- language.name: The full name of the detected language, e.g. English, French, Japanese.
- tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
- unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
- Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
To create an Embedding attentive model
You can create a model from within a dashboard, or you can add one to the Models page.
- Create from dashboard
- Create from Models page
You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog.
- Minimum Word Count: Set the minimum frequency of a given word to be considered for analysis.
Default value: 2; Minimum value: 2
- Resampling: If your data is skewed towards one class or another, you can resample your data and adjust the class distribution for better results.
Default value: none; Valid values: Over sample, Under sample
- K-fold Cross Validation: Set the number of equal-sized random subsamples to use for cross validation. The Ratio of training to validation set is ignored if you set this.
Default value: 1; Minimum value: 1; Maximum value: 10
See K-Fold Cross Validation in Wikipedia for more information.
- Ratio of training to validation set: The ratio that determines how much of the dataset is used as a training set and how much is used to validate the results.
Default value: 0.2 (80/20); Valid values: 90/10, 80/20, 70/30, 60/40
- Minimum number of records for a field: Set the minimum number of records for a field to be considered for analysis. This prunes any classes with fewer records than the value you set here.
Default value: 0
- Embedding dimension: Set the embedding dimension of each token to capture the relationship between words. Larger datasets can take higher values for a more meaningful analysis, but smaller datasets do not get any benefit from a high embedding dimension value.
Default value: 300; Minimum value: 10; Maximum value: 1000
- Number of epochs: Set the number of times the entire dataset is passed both forward and backward through the neural network.
Default value: 25; Minimum value: 1; Maximum value: 500
- Learning Rate: Set the amount of change to the model in each step of training to determine how quickly or slowly the model learns.
Default value: 1e-2; Minimum value: 1e-14; Maximum value: 100
- Dropout Rate: Set the ratio of randomly selected neurons to ignore during training to prevent overfitting.
Default value: 0.50; Minimum value: 0.001; Maximum value: 0.999
- Run Language Detection: Run language detection on the source text.
Default value: true
- Default Language: Assume this language if language detection fails. This is used to select a language-specific stopword list to apply to clean text.
Default value: English
- Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords dictionaries for more information.
- Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
See Customize Chinese token dictionaries for more information.