Why to use it
The Logistic Regression Model, also known as a logistic model or a logit model, is a baseline prediction algorithm that uses a linear decision boundary to gauge the probability of a record falling into two or more categories. This model features importance weights, and returns keywords used in its predictions if your training field contains text.
Compared with other algorithms, the computational cost of this model is low.
If you are unsure of which model to use, the AutoLearn model can find the algorithm best suited to your data.
How it works
In its basic form, the Logistic Regression Model uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model; it is a form of binomial regression.
Historically, an early artificial neuron was called a perceptron. It took any number of inputs, added weights to express the importance of each input, and produced a single binary output of 0 or 1. Whether the output is a zero or a one is determined by whether the weighted sum of the inputs is greater or less than a threshold value. As you might guess, since another name for a feedforward neural network is multilayer perceptron, this deep learning model combines multiple layers of perceptrons or nodes.
So the Feedforward Neural Network by default uses three layers of nodes, each of which (except for the inputs) is a neuron that uses a nonlinear activation function:
-
an input layer
-
a hidden layer
-
and an output layer
During training, the output values are compared with correct answers to create an error function, which is then fed back into the network to adjust the weights assigned to each input.
For more information on this concept, see the following articles.
-
Logistic Regression - Detailed Overview in Towards Data Science
-
Logistic Regression in scikit-learn: Machine Learning in Python
-
Logistic regression in Wikipedia
Several options for model regularization and optimization methods are available.
Model Regularization
Regularization prevents overfitting. See L1 and L2 Regularization Methods in Towards Data Science for more detailed information.
-
L1 (default): Lasso (Least Absolute Shrinkage and Selection Operator) regression. Reduces the number of features.
-
L2: Ridge regression. Keeps more features with higher accuracy. Use when your features are sparse.
Inverse Regularization Strength
This value is set to 1.000 by default; smaller values result in stronger regularization.
Regularization does not improve performance on the training data set, but it can improve performance on new, unseen data.
Increasing the regularization strength penalizes large weight coefficients to keep the model from focusing on peculiarities or imagining a pattern where there is none.
Regularization increases the bias if the model suffers from high variance (it overfits the training data). Too much bias results in underfitting (the model displays poor performance for both the training and test dataset). The goal is to minimize the cost function by finding feature weights that correspond to the global cost minimum.
Optimization Methods
-
Linear (default): Linear regression. Suggested for use with L1 regularization
-
SAGA: SGD (stochastic gradient descent) classifier. Suggested for use with L2 regularization.
Input fields
The model requires one Ground Truth field and at least one Unstructured Text or Training Feature field.
-
ground truth: a field that objectively measures how the user feels, such as a star rating
-
unstructured text: a field that collects free-form user feedback, that is, they type it rather than select from a list
-
training feature: a field that collects structured user data, that is, data that is numerical or that is selected from options
Output fields
The model returns the following fields for use in widget visualizations.
-
keywords: A textual array of the most important words within the data stream. In supervised models, keywords are the terms that the model finds and uses to predict each label. Keywords are only generated if you select an Unstructured text field. If you select only a Training Feature, that is, selectable values, dates, or numbers, no keywords are generated.
-
label: The predicted output based on what the model learned from your input data.
-
language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language detection for more information.
-
language.name: The full name of the detected language, e.g. English, French, Japanese.
-
tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
-
unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
-
Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
To create a Logistic Regression Model
You can create a model from within a dashboard, or you can add one to the Models page.
-
Create from dashboard
-
Create from Models page
Advanced settings
You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog.
-
Minimum Word Count: Set the minimum frequency of a given word to be considered for analysis.
Default value: 2; Minimum value: 2 -
Resampling: If your data is skewed towards one class or another, you can resample your data and adjust the class distribution for better results.
Default value: none; Valid values: Over sample, Under sample -
K-fold Cross Validation: Set the number of equal-sized random sub-samples to use for cross validation. The Ratio of training to validation set is ignored if you set this.
Default value: 1; Minimum value: 1; Maximum value: 10 -
Ratio of training to validation set: The ratio that determines how much of the dataset is used as a training set and how much is used to validate the results. This is ignored if K-fold is set.
Default value: 0.2 (80/20); Valid values: 90/10, 80/20, 70/30, 60/40 -
Minimum number of records for a field: Set the minimum number of records for a field to be considered for analysis. This prunes any classes with fewer records than the value you set here.
Default value: 0; Minimum value: 0; Maximum value: 50 -
Model Regularization: Select the regularization metric to use in your model. See Model Regularization for more information.
Default value: L2; Valid values: L1, L2 -
Inverse Regularization Strength: Set this number to a smaller value for stronger regularization. See Inverse Regularization Strength for more information.
Default value: 1.000; Minimum value: 0.01 -
Optimization Method: Select the learning optimization method to use in your model. See Optimization Methods for more information.
Default value: Linear; Valid values: SAGA, Linear -
Run Language Detection: Run language detection on the source text.
Default value: true -
Default Language: Assume this language if language detection fails. This is used to select a language-specific stopword list to apply to clean text.
Default value: English -
Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords dictionaries for more information. -
Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
See Customize Chinese token dictionaries for more information. -
Schedule Model Retrain: Set the period in days, weeks, months, or years at which to retrain the model.
Default value: 0 -
Add Filter: Select a field on which to filter training data.
Further questions?
We're here to help! Don't hesitate to contact us for further assistance via chat or submit a ticket!
Comments
0 comments
Please sign in to leave a comment.