How does language detection work?

Language detection is what provides raw language data for the following models and analyses. All of these models run language detection by default, although you can opt to turn it off in the Advanced section when you create the model.

  • Auto-Topic Predictive Model (Unsupervised NLU)
  • All of the Supervised models
  • N-Gram Generator
  • Taxonomy Analysis
  • Sentiment Analysis
  • Neural Sentiment model

Language detection tokenizes your input documents into unigrams and predicts the language for each record. If it cannot determine the language for any record, it assumes the specified default language (English by default).

If you subscribe to the Translate Text feature, Language Detection works with that feature to determine the Translate From language when none is specified.

Here are some things to note about the way Language Detection works:

  • Language Detection only works on records with more than eight words.
  • If Language Detection cannot detect the language for a record, it marks the record as the language specified in the Default Language setting. If that matches your Translate To setting, it does not translate that record.
  • If there are mixed languages within a single record, Language Detection marks the record as the predominant language in that record (i.e. if there are eight English words, three Chinese words, and two Spanish words, it marks the record English).
  • If Language Detection finds that the language for a record matches the Translate To language setting, it does not translate that record.

What languages are supported?

Language Detection can detect the following 54 languages.

  • Afrikaans - language code: af
  • Arabic - language code: ar
  • Bulgarian - language code: bg
  • Bangla - language code: bn
  • Czech - language code: cs
  • Welsh - language code: cy
  • Danish - language code: da
  • German - language code: de
  • Greek - language code: el
  • English - language code: en
  • Spanish - language code: es
  • Estonian - language code: et
  • Persian - language code: fa
  • Finnish - language code: fi
  • French - language code: fr
  • Gujarati - language code: gu
  • Hebrew - language code: he
  • Hindi - language code: hi
  • Croatian - language code: hr
  • Hungarian - language code: hu
  • Indonesian - language code: id
  • Italian - language code: it
  • Japanese - language code: ja
  • Kannada - language code: kn
  • Korean - language code: ko
  • Lithuanian - language code: lt
  • Latvian - language code: lv
  • Macedonian - language code: mk
  • Malayalam - language code: ml
  • Marathi - language code: mr
  • Nepali - language code: ne
  • Dutch - language code: nl
  • Norwegian - language code: no
  • Punjabi - language code: pa
  • Polish - language code: pl
  • Portuguese - language code: pt
  • Romanian - language code: ro
  • Russian - language code: ru
  • Slovak - language code: sk
  • Slovenian - language code: sl
  • Somali - language code: so
  • Albanian - language code: sq
  • Swedish - language code: sv
  • Swahili - language code: sw
  • Tamil - language code: ta
  • Telugu - language code: te
  • Thai - language code: th
  • Tagalog - language code: tl
  • Turkish - language code: tr
  • Ukrainian - language code: uk
  • Urdu - language code: ur
  • Vietnamese - language code: vi
  • Simplified Chinese - language code: zh-cn
  • Traditional Chinese - language code: zh-tw

Why use Language Detection?

While the Auto-Topic Predictive Model (Unsupervised NLU) provides a language detection feature, if your volume of data is too low, the Auto-Topic Predictive Model (Unsupervised NLU) may fail. The machine learning (ML) portion of that model requires a certain amount of data in order to converge. Language Detection allows you to detect languages even with limited data.

How much data is required for the Auto-Topic Predictive Model (Unsupervised NLU) varies depending on the data that you have. For example, if your documents are long, like news articles, then you would need fewer documents. However, if the documents are very short, it might still fail even with a lot of documents.

Input fields

One Text field is required. Any field that you choose is mapped as a text field. If you choose multiple fields, it concatenates all of the text and treats it as a single mass of text for each record.

Output fields

The model returns the following fields for use in widget visualizations.

  • language.code: The two-letter language code of the detected language, e.g. en, fr, jp.
  • language.name: The full name of the detected language, e.g. English, French, Japanese.
  • tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
  • unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
  • Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.

To run Language Detection

Here are the steps to create a Language Detection analysis from within a dashboard.

1. Open the dashboard to which you want to add the analysis and open the data settings panel.

2. Select the deployed model you want to apply translation to or deploy a new model.

3. In the model dialog box that pops up, select to switch to advanced settings, and click to the Translation tab.

*

4. Toggle Language Detection to on (it defaults to on).

Did this answer your question?