The Semi-Automatic Taxonomy (SAT) model leverages on the features from both topic model (LDA) and taxonomy tree.

The Semi-Automatic Taxonomy model takes a data stream with a free-form text field as a training set, and applies an existing taxonomy tree on the stream. Each record in the stream is given one or more labels from the taxonomy tree (if applicable). Then the SAT model is trained on the stream as LDA, with the information from the labels as initialized parameters. When the training is finished, the trained model can evaluate more data. In the deployment phase, the trained model is applied on a selected data stream. The data stream can either be the training set, or a new one. Note that the stream should come from the same source as the training set if it is a new one. The reason is that the sample distribution should be consistent for the test and training set.

{For information on the feedback loop and retraining a model, see Model Feedback Loop.}

Input (Training)

The input of the training phase is:

  1. a taxonomy tree, and
  2. a data stream.

Output (Training)

The output of the training phase is a model added to the Models tab.

Input (Deployment)

The input of the deployment phase is a data stream selected by the user.

Output (Deployment)

The output of the deployment phase is one or more labels and keywords associated with each record in the stream. Each label and keyword are internally associated with a confidence score. The confidence score are also given by the SAT model. The number of labels for each record depends on how many labels come with a confidence score larger than a threshold (1/(topic # + 1)). In other words, if the confidence score of a label is greater than the threshold, then the label is given to the record. Consequentially, the top 10 keywords related to this label will be given. The keywords will be sorted by their confidence scores in a descending order. The confidence scores of the keywords will not be displayed but contained in the output. The confidence score for a label can be roughly regarded as the occurrence of sampled tokens from the input document, under the label. The confidence score for a keyword can be roughly regarded as the probability that the keyword occurs under the predicted label, compared to the rest of the labels.

Input fields

The model requires one Unstructured Text field, that is, a field that collects free-form user feedback that they type rather than select from a list.

Output fields 

The model returns the following fields for use in widget visualizations.

  • confidence: The confidence level about the accuracy of the prediction on a scale of 0 to 100.
  • language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language detection for more information.
  • language.name: The full name of the detected language, e.g. English, French, Japanese.
  • semiauto.taxonomy.keywords: A textual array of the most important words within the data stream. In supervised models, keywords are the terms that the model finds and uses to predict each label. Keywords are only generated if you select an Unstructured text field. If you select only a Training Feature, that is, selectable values, dates, or numbers, no keywords are generated.
  • semiauto.taxonomy.labels: The predicted output based on what the model learned from your input data. You can use this in a Taxonomy Collapsible Tree or a Taxonomy Folder View visualization.
  • tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
  • translated.language.code*: The two-letter language code of the translated (Translate To) language, e.g. en, fr, jp.
  • translated.language.name*: The full name of the translated (Translate To) language, e.g. English, French, Japanese.
  • translated.text*: The full verbatim text as translated.
  • translated.unigrams*: A textual array of single translated words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
  • unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
  • Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.

*In order to return the translated fields, you must subscribe to the Translate Text feature. Text translation involves an up-charge, as it uses a third-party translation service. If you want to use translation, please speak with your Stratifyd representative.

Once enabled, translate options appear in the Advanced section of the Deploy a New Model wizard. See Translate Text and Languages for more information.

Evaluation Metrics

Completion rate
This number reflects how many additional documents are labeled by applying use Semi-Automatic Taxonomy model. It is calculated using only documents not labeled by original taxonomy. It is the ratio of the number of documents labeled by the Semi-Automatic Taxonomy model vs. total number of unlabeled documents.

Match rate
This number reflects how closely the Semi-Automatic Taxonomy model matches your original model. It is calculated using only labeled documents. It is the ratio of the number of documents where the Semi-Automatic Taxonomy model label matches the original taxonomy label vs. total number of labeled documents.

To create a Semi-Automatic Taxonomy model

You can create a model from within a dashboard, or you can add one to the Models page.

You must have a taxonomy to use as a data source for the Semi-Automatic Taxonomy model. For more information, see Taxonomy Analysis.

From within a dashboard

1. In your dashboard, click the Data icon to open the Data Ingestion pane.


2. In the Data Ingestion pane that appears, next to the data stream to which you want to add a model, click the vertical ellipsis and select Edit.

3. In the Edit Data Streams dialog that appears, click Deploy Model.

4. In the Deploy Model dialog that appears, under Supervised models, click AutoLearn.

5. In the Deploy a New Model wizard that appears, in the Unassigned Fields column to the left, find a free-form text field to use, click the + next to it, and select Unstructured Text. 

6. The field is added to the list of Assigned Fields. You can optionally select additional fields. Click Next to continue.

7. On the Parameters page of the wizard, in the Taxonomies box, click the plus sign to add a taxonomy into which to categorize your data. 

8. In the Search dialog that appears, select the taxonomy that best fits the data stream to which you are applying the model. 

9. Back on the Parameters page of the wizard, you can optionally select additional taxonomies, then click Next.

10. On the Complete & Submit page of the wizard, you can optionally change the name, add tags, and add a description. These fields appear in the tile for the model once it is submitted.


11. Optionally click Advanced to reveal more settings. Some of these settings are common to all models, and some are specific to the Semi-Automatic Taxonomy model. See Advanced settings below for details.

12. Click Submit to save the settings. Back in the Edit Data Streams dialog, the new model is added to the list of Deployed models. Click Submit to begin training the model.

The model appears on the Data Ingestion pane in waiting status, and then in processing status. It may process for a while, depending on your data.

When it is finished, it appears in the up-to-date models list.

From the Models page

1. On the Models page, click the New Model button to add a new model.

2. In the Create a new model wizard that appears, click Semi-Automatic Taxonomy and then click Next. 

3. On the second page of the wizard, select the data stream to train with the model and click Next.

You can optionally add a filter before clicking Next. For more information on filters, see Filter and segment data.

4. On the third page of the wizard, under Unassigned Fields, find a free-form text field to use, click the + next to it, and select Unstructured Text. 

5. The field is added to the list of Assigned Fields. You can optionally select additional fields. Click Next to continue.

6. On the Parameters page of the wizard, in the Taxonomies box, click the plus sign to add a taxonomy into which to categorize your data. 

7. In the Search dialog that appears, select the taxonomy that best fits the data stream to which you are applying the model. 

8. Back on the Parameters page of the wizard, you can optionally select additional taxonomies, then click Next.

9. On the Complete & Submit page of the wizard, add a name and optionally add tags and a description. These fields appear in the tile for the model once it is submitted.


10. Optionally click Advanced to reveal more settings.

Some of these settings are common to all models, and some are specific to the Semi-Automatic Taxonomy model. See Advanced settings below for details.

11. Click Submit to begin training on the model. The following message appears.

When the model is ready, the message disappears and the model is added to the Models page.

On the Models page, you can interact with models in the following ways.

  • Model Info - Click the tile to open the Model Info dialog, where you can see details about the model and opt to schedule when and how often you want to retrain the model on newer data. See Retrain a model for more information.
  • Share - Click the vertical ellipsis at the top right of the tile and click Share to share the model with other users or groups.
  • Properties - Click the vertical ellipsis and click Properties to open the Model Properties dialog where you can edit the following.Name - The name of the model to display on the tile.Tags - Any tags for the model to help users to quickly find the models they need.Image - An image to represent the model. Select an included image or upload your own.Description - A description of the model.
  • Delete - Click the vertical ellipsis and click Delete to delete the model. Deleting is not reversible, but a dialog allows you to cancel.

Advanced settings

You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog. 

  • Sample Balance Factor: If the category distribution is highly imbalanced, set this to the fewest categories to improve the performance of the model.
    Default value: 5; Minimum value: 1; Maximum value: 100
  • Training Period: The number of times the model goes through the dataset. Too few may prevent the model from converging; too many may cause overfitting.
    Default value: 40; Minimum value: 2
  • Minimum Word Count: Set the minimum frequency of a given word to be considered for analysis.
    Default value: 2; Minimum value: 2
  • Process All Documents: Clear to process only documents that are missing a label.
    You must clear this option in order to use a Taxonomy before applying an Unsupervised NLU model.
  • Apply training filter to analysis: Apply your custom data training filter to your analysis results. (See Add Filter below.)
    Default value: true
  • Run Language Detection: Run language detection on the source text.
    Default value: true
  • Default Language: Assume this language if language detection fails. This is used to select a language-specific stopword list to apply to clean text.
    Default value: English
  • Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
    See Customize stopwords dictionaries for more information.
  • Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
    See Customize Chinese token dictionaries for more information.
  • Schedule Model Retrain: Specify the number of days, weeks, months, or years after which to retrain your model.
  • Add Filter: Select a field on which to filter training data. If Apply training filter to analysis is selected above, the filter also applies to your analysis results.

Did this answer your question?