Skip to content

Supported languages

ricelang detects 50+ labels in total. Some come from a trained fastText classifier (Latin and Myanmar block — shared scripts); the rest come from Unicode-script rules (one language per script — no ML needed).

Trained classifier (25 labels)

These use the bundled pdsdetect.ftz model (~1.8 MB). All scored 99.2–100% on held-out validation.

Myanmar region

Label Language Notes
mya Burmese (Unicode) Myanmar script
zgi Burmese (Zawgyi) Legacy encoding — not a separate language; use cvt2uni()
ksw S'gaw Karen Myanmar script
pwo Pwo Western Karen Myanmar script
kvq Geba Karen Myanmar script (non-Roman version)
cnh Hakha Chin (Lai) Latin script
cfm Falam Chin Latin script
ctd Tedim Chin Latin script
eky Eastern Kayah Kayah Li script (also handled by Unicode rule)
shn Shan (Tai Yai) Myanmar block, Shan extension subrange
kac Jingphaw (Kachin) Latin script
mnw Mon Myanmar script

Broader SE / South Asia

Label Language
eng English
hin Hindi
khm Khmer
lao Lao
msa Malay
tam Tamil
tgl Tagalog
tha Thai
vie Vietnamese
zho Chinese

Regional & script variants

Label Language
ban Balinese (Latin)
hnn Hanunoo
sun Sundanese (Latin)

Script-rule freebies (27 labels)

These are detected by Unicode-block check alone — no ML, no training data needed. Accurate by construction (deterministic).

Already in the trained set, but caught by rule first (faster)

hin Devanagari · tam Tamil · tha Thai · lao Lao · khm Khmer · eky Kayah Li · zho CJK

New labels added via script rule

Label Language Script block
kor Korean Hangul (U+AC00–D7AF)
jpn Japanese Hiragana + Katakana
ell Greek Greek + Greek Extended
heb Hebrew Hebrew + presentation forms
hye Armenian Armenian
kat Georgian Georgian
amh Amharic Ethiopic
sin Sinhala Sinhala
bod Tibetan Tibetan
chr Cherokee Cherokee
nqo N'Ko N'Ko
mon Mongolian (traditional script) Mongolian
vai Vai Vai
ff Fulani (Adlam) Adlam
mww Hmong Pahawh Hmong
bax Bamum Bamum
lep Lepcha Lepcha
lif Limbu Limbu
saz Saurashtra Saurashtra
bug Buginese Buginese
jav Javanese (script) Javanese
cjm Cham Cham
mni Meetei (Manipuri) Meetei Mayek
nod Lanna (Northern Thai) Tai Tham
sat Santali Ol Chiki
khb Tai Lue New Tai Lue
tdd Tai Nüa Tai Le

Adding more monopoly-script languages is a one-line addition to SCRIPT_RULES in ricelang/scripts.py — see Hierarchical detector.

Languages not supported

The following are deliberately omitted because text-only character-n-gram detection cannot meaningfully distinguish them from a sibling language already in the set (the text is often literally identical):

  • Indonesian (ind) — shares ~80% vocabulary with msa. "Terima kasih" is grammatical Malay and grammatical Indonesian.
  • Rakhine (rki) — uses the same Myanmar script as mya and many short phrases are interchangeable.
  • Chinese Traditional (zho_hant) — most characters are identical to Simplified; the difference only shows up on a small set of specific characters.

For these cases, caller-side metadata (region, user profile, surrounding text) is the right disambiguator, not the model.

Mon detection

mnw is sourced from the Mon Wikipedia dump (135k paragraphs), not YouVersion (Mon has no Bible on the platform). All other trained labels come from YouVersion Bible scrapes.