What is Theme Detection?
This model provides the most comprehensive data analysis of any of our models. You can add your own customized lexicon (sentiment dictionary) to the already-powerful built-in one, and specify your own ground truth and other fields.
If you have a very small data set, this model may not converge. If this happens, you can use Basic analyses to extract the raw data even with a limited data set.
How much data is required for the Theme Detection model varies depending on the data that you have. For example, if your documents are long, like news articles, then you would need fewer documents. However, if the documents are very short, it might still fail even with a lot of documents.
How does this model work?
Built on top of our proprietary Bayesian neural network and generative model, this model dynamically identifies context-based semantic topic groups in your data.
This is accomplished in a three-step process:
-
The engine starts by performing natural language processing (NLP) in 26 languages. In this step, Stratifyd runs the following processes on your input documents:
-
tokenizes data into corresponding n-grams (where n >= 2)
-
lemmatizes data (groups together words with the same root, e.g. run and ran)
-
stems data (removes endings such as -ing, -ed, -s, etc.)
-
filters out spam, junk and stop words (you can customize stop words)
-
performs part-of-speech tagging and named entity extraction
-
creates a large n-gram-based content network
-
-
The engine runs a multi-model approach on top of the n-gram-based content network. Models include proprietary text analytics algorithms extended from our Bayesian neural network, generative model, long short-term memory (LSTM), and sequence-to-sequence (Seq2Seq) natural language understanding (NLU).In this step, Stratifyd runs the following processes on the n-gram-based content network:clusters data input into semantically meaningful groups, generates and visualizes the groups by statistical significance (i.e. the percentage attributed to each topic category in the Semantic Topic Visualization) and tags each topic with top representative terms in Buzzwords
-
Stratifyd automatically processes all geographical (where), temporal (when), and contributor (who) data, as well as any structured data. It joins the data with the n-gram-based content network so that you can pivot and construct analytics questions against your dataset.
Output fields
The model returns the following fields for use in widget visualizations.
-
language.code: The two-letter language code of the detected language, e.g. en, fr, jp. See Language Detection for more information.
-
language.name: The full name of the detected language, e.g. English, French, Japanese.
-
ngrams: Sets of words (two by default) that appear together frequently in the corpus.
-
sentiment.overall: The overall sentiment detected within the corpus, with values ranging from -5 to 5.
-
tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
-
topics: Clusters of similar themes or related terms within the corpus of text.
-
translated.language.code*: The two-letter language code of the translated (Translate To) language, e.g. en, fr, jp.
-
translated.language.name*: The full name of the translated (Translate To) language, e.g. English, French, Japanese.
-
translated.ngrams*: Sets of translated words (two by default) that appear together frequently in the corpus. Note that this is untranslated if you use Translate Model Output.
-
translated.text*: The full verbatim text as translated.
-
translated.unigrams*: A textual array of single translated words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
-
unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
-
Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
*In order to return the translated fields, you must subscribe to the Translate Text feature. Text translation involves an up-charge, as it uses a third-party translation service. If you want to use translation, please speak with your Stratifyd representative.
Once enabled, translate options appear in the Advanced section of the Deploy a New Model wizard.
To use the Theme Detection Model
1. You can create a model from within a workspace, or you can add one to the Models page. Here, we create it from within a workspace.
2. To access the Data Settings Panel menu, click the Data settings button, accompanied by the gear icon.
In the data settings panel that appears, you'll see a list of all data connections for that workspace. Make sure you have selected the data stream you want to work with in the Connected column.
3. In the Analyze tab, expand the section labelled What are our customers saying? to see available models.

6. Depending on the size and complexity of your data, it may take some time for the analysis to finish running. When you return to the Data Settings, you'll see this analysis within the Deployed section at the top of the the Analyze tab.
Using the Advanced Setup
You can set the following properties in the Advanced section of the Create a new model or Deploy a new model dialog.
-
Topic Threshold: Set the similarity threshold for categorizing the documents into topics.
Default value: 0.3; Minimum value: 0.01; Maximum value: 0.99 -
Apply training filter to analysis: Apply your custom data training filter to your analysis results.
Default value: true -
Run Language Detection: Run language detection on the source text.
Default value: true -
Default Language: Assume this language if language detection fails. This is used to select a language-specific stopword list to apply to clean text.
Default value: English -
N-gram Length: Use this number of tokens when building buzzwords.
Default value: 2; Minimum value: 2; Maximum value: 4 -
Minimum Word Count: Set the minimum frequency of a given word to be considered for analysis. Increase this number if you have a large number of documents. Since the analysis is heuristic-based, increasing the minimum word count makes the statistic more accurate when you have more documents.
Default value: 2; Minimum value: 2 -
Token Threshold: Controls how many unhelpful tokens to filter out during analysis. Lower the threshold to allow more tokens and guarantee more data points when considering a feature, for instance if your model fails due to a lack of text.
Default value: 0; Minimum: -3; Maximum: 3 -
Chinese Dictionary: Customize Chinese tokens that our engine uses when creating key n-grams in your analysis.
-
Sentiment: Apply custom sentiment word lists based on your own domain knowledge or data properties. Stratifyd applies your sentiment dictionary to your analysis in addition to the built-in sentiment dictionary that contains around 80,000 words across all languages.
-
Stopwords: Apply custom stopword lists to remove non-informative words from categorization.
See Customize stopwords lists for more information. -
Schedule Model Retrain: Specify the number of days, weeks, months, or years after which to retrain your model.
Special Input Fields - In order to use the map visualization, map a Geo field to any of the Geo mappings. (This adds locations, coordinates, and postcode output fields.)
- To enable sparklines in the list visualization of ngrams and other time-based visualization, map a Temporal field to the Date mapping. (This adds a time_stamp output field.)
Further questions?
We're here to help! Don't hesitate to contact us for further assistance via chat or submit a ticket!
Comments
0 comments
Please sign in to leave a comment.