Why to use it
The Random Forest Model is helpful when you have an unbalanced dataset, and works better with such data than other methods as it provides importance weights. It works well with data that contains categorical fields. The decision boundary of the random forest is non-linear.
If you are unsure of whether you have an unbalanced dataset, use the AutoLearn model to find the algorithm best suited to your data.
How it works
The Random Forest Model builds a multitude of uncorrelated basic decision trees. Uncorrelated decision trees are those that have no mutual relationship, that is, a change in one does not affect the other.
A basic decision tree works through the data to classify it into groups that are as different from each other as possible, and that have members that are as similar to each other as possible. Each decision tree produces a prediction for the forest, and the forest in turn produces a prediction based on the combined wisdom of all of the trees.
The reason that this works well is that even though some predictions by individual decision trees may have errors, in most cases, the majority of them tend to be right. For more information on these concepts, see Understanding Random Forest on Towards Data Science.
The model requires one Ground Truth field and at least one Unstructured Text or Training Feature field.
- ground truth: a field that objectively measures how the user feels, such as a star rating
- unstructured text: a field that collects free-form user feedback, that is, they type it rather than select from a list
- training feature: a field that collects structured user data, that is, data that is numerical or that is selected from options
- keywords: A textual array of the most important words within the data stream. In supervised models, keywords are the terms that the model finds and uses to predict each label. Keywords are only generated if you select an Unstructured text field. If you select only a Training Feature, that is, selectable values, dates, or numbers, no keywords are generated.
- label: The predicted output based on what the model learned from your input data.
- language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language detection for more information.
- language.name: The full name of the detected language, e.g. English, French, Japanese.
- tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
- unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
- Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
To create a Random Forest Model
You can create a model from within a dashboard, or you can add one to the Models page.
- Create from dashboard
- Create from Models page
You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog.
- Resampling: If your data is skewed towards one class or another, you can resample your data and adjust the class distribution for better results.
Default value: none; Valid values: Over sample, Under sample
- Minimum Word Count: Set the minimum frequency of a given word to be considered for analysis.
Default value: 2; Minimum value: 2
- K-fold Cross Validation: Set the number of equal-sized random subsamples to use for cross validation. The Ratio of training to validation set is ignored if you set this.
Default value: 1; Minimum value: 1; Maximum value: 10
- Ratio of training to validation set: The ratio that determines how much of the dataset is used as a training set and how much is used to validate the results.
Default value: 0.2 (80/20); Valid values: 90/10, 80/20, 70/30, 60/40
- Minimum number of records for a field: Set the minimum number of records for a field to be considered for analysis. This prunes any classes with fewer records than the value you set here.
Default value: 0
- Model Generalization: Set the number of classifiers (decision trees) used to assemble the final model.
Default value: 10; Minimum value: 1; Maximum value: 100
- Depth of Classifier: Set the maximum depth of classifiers (decision trees).
Default value: 15; Minimum value: 1
- Run Language Detection: Run language detection on the source text.
Default value: true
- Default Language: Assume this language if language detection fails. This is used to select a language-specific stopword list to apply to clean text.
Default value: English
- Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords dictionaries for more information.
- Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
See Customize Chinese token dictionaries for more information.