Stratifyd includes a built-in stopwords list that includes the most commonly used uninformative words such as: “the,” “an,” “I,” etc. to reduce noise in textual data.

When these models tokenize the data, they check the words against the stopwords list and filter out the stopwords.

You can also create custom stopwords lists for each data source and apply them when you create data models. The Stopwords setting is in the Advanced section of the new model wizard for all applicable models. See articles on individual data models for more information.

Why use a custom stopwords list?

The built-in stopwords list is comprehensive for general purposes, but some analyses may achieve more meaningful results with the addition of a data-source-specific stopwords list. 

You can use any list of words as stop words. The effect may not be dramatic because Stratifyd uses pointwise mutual information (PMI). PMI filters out many of the words you might list as stopwords by giving them very low PMI scores. However, you may still want to upload a list of stopwords to train the system for specific analyses.

You can apply multiple stopwords lists to the same data model and Stratifyd treats them as a single, combined list.

Supported languages

The following languages have a default stopwords list. During data ingestion, Stratifyd determines the language (if the Language Detection advanced option is not turned off) and checks whether a default stopwords list exists for that language. It also checks for any custom stopwords lists applied to the model, and applies these lists regardless of language.

  • ar Arabic
  • bg Bulgarian
  • br Breton
  • cz Czech
  • da Danish
  • de German
  • el Greeken English
  • es Spanish
  • fa Farsi
  • fi Finnish
  • fr French
  • hi Hindi
  • hu Hungarianhy Armenian
  • id Indonesian
  • it Italian
  • ja Japanese
  • lv Latvian
  • nl Dutch
  • no Norwegianpl Polish
  • pt Portuguese
  • ro Romanian
  • ru Russian
  • sv Swedish
  • tr Turkish
  • zh Chinese

Custom stopwords lists

A stopwords list is a list of terms that the analytics engine ignores when looking for patterns and connections in your data. 

Most data sources have noisy terms that are specific to that source. For example, if you analyze data from the Indeed connector, you can see a high volume of "typical day" or "hardest part" bi-grams in the buzzword cloud. Reviewers look at the form in the image below as they fill out their textual responses, so the survey biases them to use those terms, making them good candidates to filter out to get to the organic content.

A new data source is run with the default stopwords list, and our in-dashboard editor allows you to select N-Grams, topics, contributors, or other features to suppress.

For example, RSS news feeds typically contain the same few sentences at the end of every article:

Reporting By Laila Bassam in Aleppo and Tom Perry, John Davison and Lisa Barrington in Beirut;
Writing by Angus McDowall in Beirut, editing by Peter Millership

Because they appear at the end of every news article in some publications, “Reporting By” and “Writing By” will likely be identified as buzzwords. These terms could potentially link unrelated documents since they are not related to the article topics.

To create stopwords via tuning

We recommend using the buzzword list in the topic wheel widget to tune your analyses. This allows you to go through the top terms in your list and strike out any bi-grams that are unhelpful in your analysis.

1. In a dashboard with the topic wheel widget in view, click the Tune Analysis icon.

The Tune Analysis panel opens, showing any modifications you may have already made. 

2. Click Enter Tuning Mode then in your topic wheel widget, in the Buzzword list, click a bi-gram to strike it out and add it to the stopwords list.

3. Continue clicking buzzwords to remove, and then click the Reprocess command to add the bi-grams to a stopwords list.

4. In the Modify Advanced Settings dialog that appears, in the Stopwords section, you can see your modifications. Enter a name for the list (or select a list if one is already saved) and click Reprocess.

The Data Ingestion panel opens and shows your model processing. The modifications are removed from the Tune Analysis panel and added to the stopwords list.. 

When the model finishes processing, you can see that the stopwords no longer appear in the Buzzword list, and the new stopwords list appears in the Stopwords page where you can modify it and use it for other data sources and models.

Version control

Stratifyd preserves all versions of your custom stopwords lists within the system so that you can revert to a previous version for analysis. This feature is helpful when you are testing the effectiveness of adding and removing terms from the list.

  • The first import or creation of a list creates Version 1, or you can manually specify a version number.
  • The first edit made in the Custom Stopwords List dialog creates Version 2.
  • Each time you edit and save the list, it increments the version number.

To import a stopwords list

Here is a list of stopwords that excludes the most commonly used noisy terms in app store reviews. You can click this link to download it:

Mobile App Stopwords List

It is a simple text file with "stopword" as the first line, and a pair of stopwords on each line after it. 

Because a stopwords list may contain many terms, you may prefer to import it rather than creating it within Stratifyd. Once imported, you can maintain it within Stratifyd.

The list can be a file of any type, but the best results come from a TXT or CSV file containing only the stopwords to use. Stopword pairs come about when you add to a stopwords list by clicking on bi-grams.

1. From the Home page, on the Advanced tab, click Stopwords to open the Stopwords folder where you can access any existing stopwords lists. 

2. In the bottom right corner, click the icon to create a stopwords list. 

3. In the New Custom Stopwords List dialog that appears, in the Title field, enter a title for your list.

4. Next to the Add a new term button, click the ellipsis icon and select Upload a file.

5. In the Open dialog that appears, navigate to the text file (TXT or CSV or other) that you want to use, select it, and click Open.

6. The stopwords are added to the list. Click the minus sign next to any terms that you do not want to include.

7. When you are finished, click Save. The list appears in the Stopwords folder.

To create a stopwords list

1. From the Home page, on the Advanced tab, click Stopwords to open the Stopwords folder where you can access any existing stopwords lists. 

2. In the bottom right corner, click the icon to create a stopwords list.

3. In the New Custom Stopwords List dialog that appears, in the Title field, enter a title for your list. 

4. Click Add a new term and on the new line that appears, enter a stopword or bi-gram. 

5. Repeat for each word that you want to add.

6. When you are finished, click Save. The list appears in the Stopwords folder.

To share a stopwords list

You can share stopwords lists with team members in the same way that you share dashboards and models.

1. In the Stopwords folder, on the tile for the stopwords list that you want to share, click the vertical ellipsis icon and select Share. 

2. In the Manage Members dialog that appears, in the search box, type the name of a member with whom to share and then select it from the auto-fill list.

Any member groups that you have created appear under Groups. Click the plus sign to add a group. Click the minus sign next to any member or group to remove it.

3. By default, the member is added with Can View permissions. Click the Can View button to change permissions to any of the following.

  • Can Share: The user can modify and share the list.
  • Owner: The user can modify, share, and delete the list, and manage all shared members.
  • Can Edit: The user can modify and share the list.

4. Click Submit. The initials of members with whom it is shared appear on the tile.

To apply a stopwords list to an analysis

When you create or reprocess data in a data model that tokenizes text (Unsupervised NLU Model, Supervised models, Neural Sentiment model, N-Gram Generator), you can apply stopwords lists. Multiple stopwords lists are treated as one large list. You can apply a stopwords list on the Models page or in a dashboard. Here we start from a dashboard.

Changing stopwords lists requires the model to reprocess the data, so we recommend making copies of the model in order to compare different list results side-by-side in the same dashboard.

1. On your dashboard, click the Data icon.

2. In the Data Ingestion panel that appears, next to the data stream to which you want to apply a stopwords list, click the vertical ellipsis icon and select Edit.

3. In the Edit Data Streams dialog that appears, click the tile of a model to which you want to apply it.

4. In the Edit model name wizard that appears, click Next.

5. On the Complete & Submit page of the wizard, click Advanced.

6. In the Stopwords box, click the plus sign. 

7. In the Search dialog that appears, click the tile of the stopwords list that you want to apply.

8. The list is added to the Stopwords box. To add more, click the plus sign again. When you are ready, click Submit. 

9. Back in the Edit Data Streams dialog, the model is marked with a RERUN tag. Click Submit. The model reprocesses using your stopwords list in addition to the built-in Stratifyd stopwords list. 

You can tell which stopwords lists are applied to a model by clicking the model.

Did this answer your question?