Why to use it
This model provides the most comprehensive data analysis of any of our models. You can add your own customized lexicon (sentiment dictionary) to the already-powerful built-in one, and specify your own ground truth and other fields.
If you have a very small data set, this model may not converge. If this happens, you can use Basic analyses to extract the raw data even with a limited data set.
How much data is required for the Auto-Topic Predictive Model varies depending on the data that you have. For example, if your documents are long, like news articles, then you would need fewer documents. However, if the documents are very short, it might still fail even with a lot of documents.
How it works
Built on top of our proprietary Bayesian neural network and generative model, this model dynamically identifies context-based semantic topic groups in your data.
This is accomplished in a three-step process:
- The engine starts by performing natural language processing (NLP) in 26 languages.In this step, Stratifyd runs the following processes on your input documents:tokenizes data into corresponding n-grams (where n >= 2) lemmatizes data (groups together words with the same root, e.g. run and ran)stems data (removes endings such as -ing, -ed, -s, etc.)filters out spam, junk, and stop words (you can customize stop words)performs part-of-speech tagging and named entity extraction creates a large n-gram-based content network
- The engine runs a multi-model approach on top of the n-gram-based content network.Models include proprietary text analytics algorithms extended from our Bayesian neural network, generative model, long short-term memory (LSTM), and sequence-to-sequence (Seq2Seq) natural language understanding (NLU).In this step, Stratifyd runs the following processes on the n-gram-based content network:clusters data input into semantically meaningful groups, generates and visualizes the groups by statistical significance (i.e. the percentage attributed to each topic category in the Semantic Topic Visualization) and tags each topic with top representative terms in Buzzwords
- Stratifyd automatically processes all geographical (where), temporal (when), and contributor (who) data, as well as any structured data. It joins the data with the n-gram-based content network so that you can pivot and construct analytics questions against your dataset.
You can select any fields as input, but only one Text field is required. The following field types have special uses. See Mapping fields for more information.
- In order to enable sparklines in the list visualization of ngrams and other time-based visualization, map a Temporal field to the Date mapping. (This adds a time_stamp output field.)
- In order to use the map visualization, map a Geo field to any of the Geo mappings. (This adds locations, coordinates, and postcode output fields.)
The model returns the following fields for use in widget visualizations.
- language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language Detection for more information.
- language.name: The full name of the detected language, e.g. English, French, Japanese.
- ngrams: Sets of words (two by default) that appear together frequently in the corpus.
- sentiment.overall: The overall sentiment detected within the corpus, with values ranging from -5 to 5.
- tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
- topics: Clusters of similar themes or related terms within the corpus of text.
- translated.language.code*: The two-letter language code of the translated (Translate To) language, e.g. en, fr, jp.
- translated.language.name*: The full name of the translated (Translate To) language, e.g. English, French, Japanese.
- translated.ngrams*: Sets of translated words (two by default) that appear together frequently in the corpus. Note that this is untranslated if you use Translate Model Output.
- translated.text*: The full verbatim text as translated.
- translated.unigrams*: A textual array of single translated words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
- unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
- Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
*In order to return the translated fields, you must subscribe to the Translate Text feature. Text translation involves an up-charge, as it uses a third-party translation service. If you want to use translation, please speak with your Stratifyd representative.
Once enabled, translate options appear in the Advanced section of the Deploy a New Model wizard. See Translate Text and Languages for more information.}
To create a Auto-Topic Predictive Model
Within the dashboard you want to run the model on, click the Data Settings Panel button.
In the Select a Data settings panel that appears, you can Connect To Your Data by selecting an existing data stream or creating a new data steam through a connector.
In the middle column, Analyze, click to deploy new under Stratifyd Models.
In the next box that appears select the data stream you want to run the model on, next select the model you want to deploy (you can read more about each model in the Analyze Your Data collection), select the appropriate fields (such as text) then click start Analysis.
You return to the Data Settings panel and can see this analysis under deployed models in the analyze column.
You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog.
- Topic Threshold: Set the similarity threshold for categorizing the documents into topics.
Default value: 0.3; Minimum value: 0.01; Maximum value: 0.99
- Apply training filter to analysis: Apply your custom data training filter to your analysis results.
Default value: true
- Run Language Detection: Run language detection on the source text.
Default value: true
- Default Language: Assume this language if language detection fails. This is used to select a language-specific stopword list to apply to clean text.
Default value: English
- N-gram Length: Use this number of tokens when building buzzwords.
Default value: 2; Minimum value: 2; Maximum value: 4
- Minimum Word Count: Set the minimum frequency of a given word to be considered for analysis. Increase this number if you have a large number of documents. Since the analysis is heuristic-based, increasing the minimum word count makes the statistic more accurate when you have more documents.
Default value: 2; Minimum value: 2
- Token Threshold: Controls how many unhelpful tokens to filter out during analysis. Lower the threshold to allow more tokens and guarantee more data points when considering a feature, for instance if your model fails due to a lack of text.
Default value: 0; Minimum: -3; Maximum: 3
- Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
- Sentiment: Apply custom sentiment word lists based on your own domain knowledge or data properties. Stratifyd applies your sentiment dictionary to your analysis in addition to the built-in sentiment dictionary that contains around 80,000 words across all languages.
- Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords lists for more information.
- Schedule Model Retrain: Specify the number of days, weeks, months, or years after which to retrain your model.