Python API reference¶
Everything is at the top level — import ricelang as rl exposes the full
public surface.
Detection¶
ricelang.detect¶
ricelang.detect.detect ¶
Return an ISO 639-3 language label for text.
Strategy
- If a monopoly Unicode-script rule fires (Korean, Japanese, Greek, Hebrew, Thai, Tamil, Hindi, ..., or any of the SE-Asia-specific scripts like Javanese, Cham, Meetei), return that label directly.
- If the text is in Latin or the Myanmar block (shared scripts used by multiple supported languages), defer to the trained fastText classifier.
- If neither — text contains no characters in any supported script
— return
fallback(defaultNone). Examples: emoji-only, garbage like "lkj qwerty 123", text in a script with no rule.
A few labels also marked as "trained" are intentionally captured by
the script rule first because their script is monopolistic in our
label set (e.g. hin for Devanagari, zho for CJK without
kana/Hangul, tha for Thai block, eky for Kayah Li). This
means the classifier never runs for those, which is faster and more
robust on very short input.
ricelang.predict¶
ricelang.detect.predict ¶
predict(
text: str | Sequence[str],
k: int = 1,
threshold: float = 0.0,
on_unicode_error: str = "strict",
)
Return raw fastText predictions for text (string or list of strings).
This always calls the ML classifier. For script-aware dispatch (which
catches Korean / Japanese / Greek / etc. without the model), use
:func:detect.
Tokenization¶
ricelang.tokenize¶
ricelang.tokenize.tokenize ¶
Tokenize text into syllables (default), words, or BPE subwords.
form="syllable"— regex-based syllable split.langmust be one of"mm","karen","mon","shan".form="word"— CRF-based word segmentation (Burmese only).form="bpe"— Byte-Pair Encoding subwords.langselects the bundled tokenizer:"multi"(default for BPE) uses the multilingual BPE; otherwise pass an ISO 639-3 code with a bundled model (mya,ksw,pwo,kvq,cnh,cfm,ctd,eky,shn).
Encoding conversion¶
ricelang.cvt2zg¶
ricelang.cvt2uni¶
ricelang.cvt2zgi¶
Alias for cvt2zg. Preserved for backwards compatibility
with the original pyidaungsu API.
Module metadata¶
ricelang.__version__¶
The installed library version, e.g. "0.4.3".
Lower-level utilities¶
For most use cases the top-level functions above are enough. These are exposed for callers that need finer control:
ricelang.scripts.script_detect¶
ricelang.scripts.script_detect ¶
Return a label, "__shared_latin", "__shared_mymr", or None.
- Concrete ISO label (e.g.
"kor","hin"): the text is in a script used by only one language we care about. No ML needed. "__shared_latin"/"__shared_mymr": the text is dominantly in a shared script and needs the ML classifier to disambiguate.None: no supported script dominates (e.g. emoji-only, pure punctuation, or text in some other script not in our table).
The Unicode-rule first stage of detect(). Returns:
- An ISO 639-3 label if a monopoly-script rule fires
"__shared_latin"or"__shared_mymr"if the text is in a shared script (caller should run the ML classifier)Noneif no supported script dominates the text
You won't normally call this directly — rl.detect() handles the dispatch.