How does language detection work?
Language detection is what provides raw language data for the following models and analyses. All of these models run language detection by default, although you can opt to turn it off in the Advanced section when you create the model.
-
Topic Models (Theme Detection, Theme Summarization, Emerging Themes)
-
All of the Supervised models
-
N-Gram Generator
-
Taxonomy Analysis
-
Neural Sentiment model
Language detection tokenizes your input documents into unigrams and predicts the language for each record. If it cannot determine the language for any record, it assumes the specified default language (English by default).
If you subscribe to the Translate Text feature, Language Detection works with that feature to determine the Translate From language when none is specified.
Here are some things to note about the way Language Detection works:
-
Language Detection only works on records with more than eight words.
-
If Language Detection cannot detect the language for a record, it marks the record as the language specified in the Default Language setting. If that matches your Translate To setting, it does not translate that record.
-
If there are mixed languages within a single record, Language Detection marks the record as the predominant language in that record (i.e. if there are eight English words, three Chinese words, and two Spanish words, it marks the record English).
-
If Language Detection finds that the language for a record matches the Translate To language setting, it does not translate that record.
What languages are supported?
Language Detection can detect the following 54 languages.
-
Afrikaans - language code: af
-
Arabic - language code: ar
-
Bulgarian - language code: bg
-
Bangla - language code: bn
-
Czech - language code: cs
-
Welsh - language code: cy
-
Danish - language code: da
-
German - language code: de
-
Greek - language code: el
-
English - language code: en
-
Spanish - language code: es
-
Estonian - language code: et
-
Persian - language code: fa
-
Finnish - language code: fi
-
French - language code: fr
-
Gujarati - language code: gu
-
Hebrew - language code: he
-
Hindi - language code: hi
-
Croatian - language code: hr
-
Hungarian - language code: hu
-
Indonesian - language code: id
-
Italian - language code: it
-
Japanese - language code: ja
-
Kannada - language code: kn
-
Korean - language code: ko
-
Lithuanian - language code: lt
-
Latvian - language code: lv
-
Macedonian - language code: mk
-
Malayalam - language code: ml
-
Marathi - language code: mr
-
Nepali - language code: ne
-
Dutch - language code: nl
-
Norwegian - language code: no
-
Punjabi - language code: pa
-
Polish - language code: pl
-
Portuguese - language code: pt
-
Romanian - language code: ro
-
Russian - language code: ru
-
Slovak - language code: sk
-
Slovenian - language code: sl
-
Somali - language code: so
-
Albanian - language code: sq
-
Swedish - language code: sv
-
Swahili - language code: sw
-
Tamil - language code: ta
-
Telugu - language code: te
-
Thai - language code: th
-
Tagalog - language code: tl
-
Turkish - language code: tr
-
Ukrainian - language code: uk
-
Urdu - language code: ur
-
Vietnamese - language code: vi
-
Simplified Chinese - language code: zh-cn
-
Traditional Chinese - language code: zh-tw
Why use Language Detection?
While Topic Models provide a language detection feature, if your volume of data is too low, these models may fail. The machine learning (ML) portion of that model requires a certain amount of data in order to converge. Language Detection allows you to detect languages even with limited data.
How much data is required for Topic Models varies depending on the data that you have. For example, if your documents are long, like news articles, then you would need fewer documents. However, if the documents are very short, it might still fail even with a lot of documents.
Input fields
One Text field is required. Any field that you choose is mapped as a text field. If you choose multiple fields, it concatenates all of the text and treats it as a single mass of text for each record.
Output fields
The model returns the following fields for use in widget visualizations.
-
language.code: The two-letter language code of the detected language, e.g. en, fr, jp.
-
language.name: The full name of the detected language, e.g. English, French, Japanese.
-
tokenized: A list of every word detected in the corpus. This is trivial for languages that use spaces between words, but for languages in which there are no spaces between words and multi-character words are possible, each requires a custom tokenizer.
-
unigrams: A textual array of single words within the data stream. Stratifyd calculates the total number of words and the number of unique values. Useful in a word cloud viewed with filters on average sentiment.
-
Data: A table containing all of the original data from the data stream, plus all of the analyzed data from the model.
To run Language Detection
Here are the steps to create a Language Detection analysis from within a dashboard.
1. Open the dashboard to which you want to add the analysis and open the data settings panel.
2. Select the deployed model you want to apply translation to or deploy a new model.
3. In the model dialog box that pops up, select to switch to advanced settings, and click to the Translation tab.
4. Toggle Language Detection to ON.
You can also select your desired translation engine, and set a minimum confidence threshold for sending a record to translation. For example, if we set the confidence level to .90, and a record is detected as English with 96% (.96) confidence, then the record will not be sent to translation. Setting a higher threshold will result in more records being sent to translation.
5. Once you are finished with making settings changes, click Rerun.
Further questions?
We're here to help! Don't hesitate to contact us for further assistance via chat or submit a ticket!
Comments
0 comments
Please sign in to leave a comment.