Python API reference¶

Everything is at the top level — import ricelang as rl exposes the full public surface.

Detection¶

`ricelang.detect`¶

ricelang.detect.detect ¶

detect(
    text: str, fallback: Optional[str] = None
) -> Optional[str]

Return an ISO 639-3 language label for text.

Strategy

If a monopoly Unicode-script rule fires (Korean, Japanese, Greek, Hebrew, Thai, Tamil, Hindi, ..., or any of the SE-Asia-specific scripts like Javanese, Cham, Meetei), return that label directly.
If the text is in Latin or the Myanmar block (shared scripts used by multiple supported languages), defer to the trained fastText classifier.
If neither — text contains no characters in any supported script — return fallback (default None). Examples: emoji-only, garbage like "lkj qwerty 123", text in a script with no rule.

A few labels also marked as "trained" are intentionally captured by the script rule first because their script is monopolistic in our label set (e.g. hin for Devanagari, zho for CJK without kana/Hangul, tha for Thai block, eky for Kayah Li). This means the classifier never runs for those, which is faster and more robust on very short input.

`ricelang.predict`¶

ricelang.detect.predict ¶

predict(
    text: str | Sequence[str],
    k: int = 1,
    threshold: float = 0.0,
    on_unicode_error: str = "strict",
)

Return raw fastText predictions for text (string or list of strings).

This always calls the ML classifier. For script-aware dispatch (which catches Korean / Japanese / Greek / etc. without the model), use :func:detect.

Tokenization¶

`ricelang.tokenize`¶

ricelang.tokenize.tokenize ¶

tokenize(
    text: str, lang: str = "mm", form: Form = "syllable"
) -> list[str]

Tokenize text into syllables (default), words, or BPE subwords.

form="syllable" — regex-based syllable split. lang must be one of "mm", "karen", "mon", "shan".
form="word" — CRF-based word segmentation (Burmese only).
form="bpe" — Byte-Pair Encoding subwords. lang selects the bundled tokenizer: "multi" (default for BPE) uses the multilingual BPE; otherwise pass an ISO 639-3 code with a bundled model (mya, ksw, pwo, kvq, cnh, cfm, ctd, eky, shn).

Encoding conversion¶

`ricelang.cvt2zg`¶

ricelang.convert.cvt2zg ¶

cvt2zg(text: str) -> str

Convert Unicode Burmese to Zawgyi.

`ricelang.cvt2uni`¶

ricelang.convert.cvt2uni ¶

cvt2uni(text: str) -> str

Convert Zawgyi Burmese to Unicode.

`ricelang.cvt2zgi`¶

Alias for cvt2zg. Preserved for backwards compatibility with the original pyidaungsu API.

Module metadata¶

`ricelang.version`¶

The installed library version, e.g. "0.4.3".

Lower-level utilities¶

For most use cases the top-level functions above are enough. These are exposed for callers that need finer control:

`ricelang.scripts.script_detect`¶

ricelang.scripts.script_detect ¶

script_detect(text: str) -> Optional[str]

Return a label, "__shared_latin", "__shared_mymr", or None.

Concrete ISO label (e.g. "kor", "hin"): the text is in a script used by only one language we care about. No ML needed.
"__shared_latin" / "__shared_mymr": the text is dominantly in a shared script and needs the ML classifier to disambiguate.
None: no supported script dominates (e.g. emoji-only, pure punctuation, or text in some other script not in our table).

The Unicode-rule first stage of detect(). Returns:

An ISO 639-3 label if a monopoly-script rule fires
"__shared_latin" or "__shared_mymr" if the text is in a shared script (caller should run the ML classifier)
None if no supported script dominates the text

You won't normally call this directly — rl.detect() handles the dispatch.