Migrating from pyidaungsu¶
ricelang is a revamp of pyidaungsu.
If you're coming from pyidaungsu 0.1.x, here's everything that changed.
TL;DR¶
# before
import pyidaungsu as pds
pds.detect("...") # returns 'karen', 'mm_uni', 'mm_zg'
# after
import ricelang as pds
pds.detect("...") # returns ISO 639-3: 'ksw', 'mya', 'zgi'
Most function calls work as-is. The package name, import path, and label
strings changed. The function names (detect, tokenize, cvt2zg,
cvt2uni) are preserved.
Install change¶
Label changes¶
| pyidaungsu | ricelang | notes |
|---|---|---|
mm_uni |
mya |
Burmese Unicode (ISO 639-3) |
mm_zg |
zgi |
Burmese Zawgyi encoding |
karen |
ksw |
Now specifically S'gaw Karen (ISO 639-3); ricelang also distinguishes pwo (Pwo) and kvq (Geba) |
mon |
mnw |
Mon (ISO 639-3) |
shan |
shn |
Shan (ISO 639-3) |
| (none) | cnh, cfm, ctd |
Hakha / Falam / Tedim Chin (new) |
| (none) | eky |
Eastern Kayah (new) |
| (none) | kac |
Jingphaw / Kachin (new) |
| (none) | eng, hin, khm, lao, msa, tam, tgl, tha, vie, zho, ban, sun, hnn |
broader regional + South Asian (new) |
| (none) | 27 script-rule freebies (kor, jpn, ell, heb, hye, kat, amh, sin, bod, …) |
new via Unicode-block rules |
API differences¶
detect() returns None for out-of-scope¶
# pyidaungsu
pds.detect("🎉") # returned some confidently-wrong label
# ricelang
rl.detect("🎉") # None
Callers that relied on detect() always returning a string can pass
fallback="..." to keep that behavior:
Tokenization¶
tokenize() signature is unchanged for form="syllable" and form="word".
Legacy lang codes "mm", "karen", "mon", "shan" still work.
New in ricelang:
rl.tokenize(text, form="bpe") # multilingual BPE (default)
rl.tokenize(text, form="bpe", lang="mya") # per-language BPE
cvt2zgi is now an alias¶
cvt2zgi was documented in pyidaungsu's README but not actually exported.
ricelang exports it as an alias for cvt2zg so any documentation copy-paste
works.
What else is new¶
- CLI:
pip install ricelangregisters aricelangcommand. See CLI reference. - Hierarchical detector: a Unicode-script rule runs before the trained classifier. See How it works.
- BPE tokenizers: 24 per-language + 1 multilingual. See Tokenization.
- Demo server: a FastAPI app under
demo/exposes every endpoint with a form UI. - Reproducible training: scripts under
scripts/let you rebuild every bundled model from scratch using the corpus repo. See Training. uv-native:pyproject.toml, nosetup.py,uv.lockchecked in.
What's gone¶
download_model()— no longer needed. Models ship with the package.- The "syllable + word level" output dictionary structure — replaced by plain lists, same as before but cleaner.