Why to use it
The Support Vector Machine, also known as SVM or support-vector network, is a prediction algorithm that uses a non-linear decision boundary to gauge the probability of a record falling into two or more categories. You can select a different kernel function, including linear, to alter the decision boundary. It returns keywords used in its predictions if your training field contains text.
SVM works well with data that contains features that have meaningful distance. Note that scaling of features is recommended if you use this method.
Support vector machines work well in high dimensional spaces (for datasets with many attributes), and are even effective when you have more dimensions than samples. Also, since the decision function uses a subset of training points (support vectors), it is memory efficient.
Do not use SVM for streams containing more than 10,000 records because this model has a high computational cost with quadratic time complexity. If the number of features in your data is much greater than the number of records, avoid over-fitting when choosing Kernel functions. If you are unsure of which model to use, the AutoLearn model can find the algorithm best suited to your data.
How it works
The support vector machine algorithm finds hyperplanes in N-dimensional space (where N is the number of features) that clearly separate data points into classes. Fine tuning this algorithm, we try to find the hyperplanes with the most distance between the data points of the classes. This distance ensures that we can classify future data points with a higher level of certainty.
For more information on this concept, see the following articles.
- Support-Vector Machine — Introduction to Machine Learning Algorithms in Towards Data Science
- Support Vector Machines in scikit-learn
- Support-vector machine in Wikipedia
Kernel functions define a notion of similarity with little computational cost even in very high-dimensional spaces.
- RBF (default): The RBF (radial basis function) is an all-purpose kernel for unknown data.
- Linear: The Linear kernel is useful when your features are sparse.
- Poly: The Poly (polynomial) kernel is commonly used in image processing.
- Sigmoid: The Sigmoid (hyperbolic tangent) kernel is a kind of proxy for neural networks.
The model requires one Ground Truth field and at least one Unstructured Text or Training Feature field.
- ground truth: a field that objectively measures how the user feels, such as a star rating
- unstructured text: a field that collects free-form user feedback, that is, they type it rather than select from a list
- training feature: a field that collects structured user data, that is, data that is numerical or that is selected from options
The model returns the following fields for use in widget visualizations.
- keywords: A textual array of the most important words within the data stream. In supervised models, keywords are the terms that the model finds and uses to predict each label. Keywords are only generated if you select an Unstructured text field. If you select only a Training Feature, that is, selectable values, dates, or numbers, no keywords are generated.
- label: The predicted output based on what the model learned from your input data.
- language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language detection for more information.
- language.name: The full name of the detected language, e.g. English, French, Japanese.
- tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
- unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
- Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
To create a Feedforward Neural Network
You can create a model from within a dashboard, or you can add one to the Models page.
- Create from dashboard
- Create from Models page
You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog.
- Minimum Word Count: Set the minimum frequency of a given word to be considered for analysis.
Default value: 2; Minimum value: 2
- Resampling: If your data is skewed towards one class or another, you can resample your data and adjust the class distribution for better results.
Default value: none; Valid values: Over sample, Under sample
- Ratio of training to validation set: The ratio that determines how much of the dataset is used as a training set and how much is used to validate the results.
Default value: 0.2 (80/20); Valid values: 90/10, 80/20, 70/30, 60/40
- Minimum number of records for a field: Set the minimum number of records for a field to be considered for analysis. This prunes any classes with fewer records than the value you set here.
Default value: 0; Minimum value: 0; Maximum value: 50
- SVM Kernel: Select the kernel function to use in your model. See SVM Kernel for more information.
Default value: RBF; Valid values: Linear, Poly, RBF, Sigmoid
- Run Language Detection: Run language detection on the source text.
Default value: true
- Default Language: Assume this language if language detection fails. This is used to select a language-specific stopword list to apply to clean text.
Default value: English
- Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords dictionaries for more information.
- Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
See Customize Chinese token dictionaries for more information.