Skip to content

Migrating from pyidaungsu

ricelang is a revamp of pyidaungsu. If you're coming from pyidaungsu 0.1.x, here's everything that changed.

TL;DR

# before
import pyidaungsu as pds
pds.detect("...")   # returns 'karen', 'mm_uni', 'mm_zg'

# after
import ricelang as pds
pds.detect("...")   # returns ISO 639-3: 'ksw', 'mya', 'zgi'

Most function calls work as-is. The package name, import path, and label strings changed. The function names (detect, tokenize, cvt2zg, cvt2uni) are preserved.

Install change

pip uninstall pyidaungsu
pip install ricelang

Label changes

pyidaungsu ricelang notes
mm_uni mya Burmese Unicode (ISO 639-3)
mm_zg zgi Burmese Zawgyi encoding
karen ksw Now specifically S'gaw Karen (ISO 639-3); ricelang also distinguishes pwo (Pwo) and kvq (Geba)
mon mnw Mon (ISO 639-3)
shan shn Shan (ISO 639-3)
(none) cnh, cfm, ctd Hakha / Falam / Tedim Chin (new)
(none) eky Eastern Kayah (new)
(none) kac Jingphaw / Kachin (new)
(none) eng, hin, khm, lao, msa, tam, tgl, tha, vie, zho, ban, sun, hnn broader regional + South Asian (new)
(none) 27 script-rule freebies (kor, jpn, ell, heb, hye, kat, amh, sin, bod, …) new via Unicode-block rules

API differences

detect() returns None for out-of-scope

# pyidaungsu
pds.detect("🎉")   # returned some confidently-wrong label

# ricelang
rl.detect("🎉")    # None

Callers that relied on detect() always returning a string can pass fallback="..." to keep that behavior:

rl.detect(text, fallback="unknown")

Tokenization

tokenize() signature is unchanged for form="syllable" and form="word". Legacy lang codes "mm", "karen", "mon", "shan" still work.

New in ricelang:

rl.tokenize(text, form="bpe")               # multilingual BPE (default)
rl.tokenize(text, form="bpe", lang="mya")   # per-language BPE

cvt2zgi is now an alias

cvt2zgi was documented in pyidaungsu's README but not actually exported. ricelang exports it as an alias for cvt2zg so any documentation copy-paste works.

What else is new

  • CLI: pip install ricelang registers a ricelang command. See CLI reference.
  • Hierarchical detector: a Unicode-script rule runs before the trained classifier. See How it works.
  • BPE tokenizers: 24 per-language + 1 multilingual. See Tokenization.
  • Demo server: a FastAPI app under demo/ exposes every endpoint with a form UI.
  • Reproducible training: scripts under scripts/ let you rebuild every bundled model from scratch using the corpus repo. See Training.
  • uv-native: pyproject.toml, no setup.py, uv.lock checked in.

What's gone

  • download_model() — no longer needed. Models ship with the package.
  • The "syllable + word level" output dictionary structure — replaced by plain lists, same as before but cleaner.