Stratifyd includes a built-in stopwords list that includes the most commonly used uninformative words such as: “the,” “an,” “I,” etc. to reduce noise in textual data.
When these models tokenize the data, they check the words against the stopwords list and filter out the stopwords.
You can also create custom stopwords lists for each data source and apply them when you create data models. The Stopwords setting is in the Advanced section of the new model wizard for all applicable models. See articles on individual data models for more information.
Why use a custom stopwords list?
The built-in stopwords list is comprehensive for general purposes, but some analyses may achieve more meaningful results with the addition of a data-source-specific stopwords list.
You can use any list of words as stop words. The effect may not be dramatic because Stratifyd uses pointwise mutual information (PMI). PMI filters out many of the words you might list as stopwords by giving them very low PMI scores. However, you may still want to upload a list of stopwords to train the system for specific analyses.
You can apply multiple stopwords lists to the same data model and Stratifyd treats them as a single, combined list.
Supported languages
The following languages have a default stopwords list. During data ingestion, Stratifyd determines the language (if the Language Detection advanced option is not turned off) and checks whether a default stopwords list exists for that language. It also checks for any custom stopwords lists applied to the model, and applies these lists regardless of language.
-
ar Arabic
-
bg Bulgarian
-
br Breton
-
cz Czech
-
da Danish
-
de German
-
el Greeken English
-
es Spanish
-
fa Farsi
-
fi Finnish
-
fr French
-
hi Hindi
-
hu Hungarianhy Armenian
-
id Indonesian
-
it Italian
-
ja Japanese
-
lv Latvian
-
nl Dutch
-
no Norwegianpl Polish
-
pt Portuguese
-
ro Romanian
-
ru Russian
-
sv Swedish
-
tr Turkish
-
zh Chinese
Custom stopwords lists
A stopwords list is a list of terms that the analytics engine ignores when looking for patterns and connections in your data.
Most data sources have noisy terms that are specific to that source. For example, if you analyze data from the Indeed connector, you can see a high volume of "typical day" or "hardest part" bi-grams in the buzzword cloud. Reviewers look at the form in the image below as they fill out their textual responses, so the survey biases them to use those terms, making them good candidates to filter out to get to the organic content.
A new data source is run with the default stopwords list, and our in-dashboard editor allows you to select N-Grams, topics, contributors, or other features to suppress.
For example, RSS news feeds typically contain the same few sentences at the end of every article:
Reporting By Laila Bassam in Aleppo and Tom Perry, John Davison and Lisa Barrington in Beirut;
Writing by Angus McDowall in Beirut, editing by Peter Millership
Because they appear at the end of every news article in some publications, “Reporting By” and “Writing By” will likely be identified as buzzwords. These terms could potentially link unrelated documents since they are not related to the article topics.
To create stopwords via tuning
We recommend using the buzzword list in the topic wheel widget to tune your analyses. This allows you to go through the top terms in your list and strike out any bi-grams that are unhelpful in your analysis.
1. In a dashboard with ngrams in view, click on Tune Analysis from the drop down menu.

The Tune Analysis panel opens, showing any modifications you may have already made.
2. You have now entered Tuning Mode. In your buzzword widget, select any ngrams adding noise to your data. Anything you select in the cloud will be added to your Removed Buzzword (stopword) list.

3. Continue clicking buzzwords to remove, and then click the Reprocess command to add the bi-grams to a stopwords list.

4. In the Modify Advanced Settings dialog that appears, you can see your modifications. Enter a name for the list (or select a list if one is already saved) and click Reprocess.

The Data Settings tab shows your model processing. The modifications are removed from the Tune Analysis panel and added to the stopwords list.

When the model finishes processing, you can see that the stopwords no longer appear in the Buzzword list, and the new stopwords list appears in the Stopwords page where you can modify it and use it for other data sources and models.
Version control
Stratifyd preserves all versions of your custom stopwords lists within the system so that you can revert to a previous version for analysis. This feature is helpful when you are testing the effectiveness of adding and removing terms from the list.
-
The first import or creation of a list creates Version 1, or you can manually specify a version number.
-
The first edit made in the Custom Stopwords List dialog creates Version 2.
-
Each time you edit and save the list, it increments the version number.
To import a stopwords list
Here is a list of stopwords that excludes the most commonly used noisy terms in app store reviews. You can click this link to download it:
It is a simple text file with "stopword" as the first line, and a pair of stopwords on each line after it.
Because a stopwords list may contain many terms, you may prefer to import it rather than creating it within Stratifyd. Once imported, you can maintain it within Stratifyd.
The list can be a file of any type, but the best results come from a TXT or CSV file containing only the stopwords to use. Stopword pairs come about when you add to a stopwords list by clicking on bi-grams.
1. From the Home page, on the Advanced tab, click Stopwords to open the Stopwords folder where you can access any existing stopwords lists.

2. In the bottom right corner, click the icon to create a stopwords list.

3. In the New Custom Stopwords List dialog that appears, in the Title field, enter a title for your list.
4. Next to the Add a new term button, click the ellipsis icon and select Upload a file.

5. In the Open dialog that appears, navigate to the text file (TXT or CSV or other) you want to use, select it, and click Open.
6. The stopwords are added to the list. Click the minus sign next to any terms that you do not want to include.
7. When you are finished, click Save. The list appears in the Stopwords folder.
To create a stopwords list
1. From the Home page, on the Advanced tab, click Stopwords to open the Stopwords folder where you can access existing stopwords lists.

2. In the bottom right corner, click the icon to create a stopwords list.

3. In the New Custom Stopwords List dialog that appears, in the Title field, enter a title for your list.

4. Click Add a new term and on the new line that appears, enter a stopword or bi-gram.

5. Repeat for each word that you want to add.
6. When you are finished, click Save. The list appears in the Stopwords folder.
To share a stopwords list
You can share stopwords lists with team members in the same way that you share dashboards and models.
1. In the Stopwords folder, on the tile for the stopwords list that you want to share, click the vertical ellipsis icon and select Share.

2. In the Share dialog that appears, in the search box, type the name of a member with whom to share and then select it from the auto-fill list.

Any member groups that you have created appear under Groups. Click the plus sign to add a group. Click the minus sign next to any member or group to remove it.
3. By default, the member is added with Can View permissions. Click the Can View button to change permissions to any of the following.

-
Can Share: The user can modify and share the list.
-
Owner: The user can modify, share, and delete the list, and manage all shared members.
-
Can Edit: The user can modify and share the list.
4. Click Submit. The initials of members with whom it is shared appear on the tile.
To apply a stopwords list to an analysis
When you create or reprocess data in a data model that tokenizes text (Unsupervised NLU Model, Supervised models, Neural Sentiment model, N-Gram Generator), you can apply stopwords lists. When multiple stopwords lists are applied to one model, they are treated as one large list. You can apply a stopwords list on the Models page or in a dashboard. Here we start from a dashboard.
Changing stopwords lists requires the model to reprocess the data, so we recommend making copies of the model in order to compare different list results side-by-side in the same dashboard.
1. On your dashboard, click the Settings tab.

2. In the Deployed section, select the model to which you want to apply a stopwords list.

3. In the model settings dialog that appears, click the Switch to advanced setup.

4. Once in the advanced setup, switch to Tuning tab.

5. Select Add under the Stopwords section.

6. Select the stopword list you would like to apply.

7. Once the feedback list has been applied, rerun the model to incorporate the stopwords.

8. The list has now been applied. You will see your model rerunning. Once this model shows it's finished, the stopwords will not show up in the model output.

Stopwords Tips and Tricks
-
Be sure to give your analysis plenty of time to reprocess once a stopword list is applied. If you are reporting on a predetermined schedule (weekly, monthly, etc) and notice noise you would like to tune out, it may be best to re-run the analysis after reporting is finished.
-
Once you have started the analysis re-run, you cannot add more stopwords to the list. It is best to add as many buzzwords to the list as possible before re-running. Try filtering the data multiple ways to show multiple outputs before re-running (i.e. date, ratings, products, etc)
-
'Removed Buzzword' lists may be saved without being reprocessed. If you see noise in the Theme Detection you would removed but don't have time for the model to reprocess, you may start tuning the model, save the feedback list, and come back to reprocess the model later.
- If the buzzword cloud you're hoping to use is a unigram list, the stopwords will need to be applied to a completely stand alone unigram analysis. This can be done by running a Buzzword analysis or Theme Detection model and setting the ngram length to 1 in the analysis advanced settings.
Comments
0 comments
Please sign in to leave a comment.