Skip to content

Python API reference

Everything is at the top level — import ricelang as rl exposes the full public surface.

Detection

ricelang.detect

ricelang.detect.detect

detect(
    text: str, fallback: Optional[str] = None
) -> Optional[str]

Return an ISO 639-3 language label for text.

Strategy
  • If a monopoly Unicode-script rule fires (Korean, Japanese, Greek, Hebrew, Thai, Tamil, Hindi, ..., or any of the SE-Asia-specific scripts like Javanese, Cham, Meetei), return that label directly.
  • If the text is in Latin or the Myanmar block (shared scripts used by multiple supported languages), defer to the trained fastText classifier.
  • If neither — text contains no characters in any supported script — return fallback (default None). Examples: emoji-only, garbage like "lkj qwerty 123", text in a script with no rule.

A few labels also marked as "trained" are intentionally captured by the script rule first because their script is monopolistic in our label set (e.g. hin for Devanagari, zho for CJK without kana/Hangul, tha for Thai block, eky for Kayah Li). This means the classifier never runs for those, which is faster and more robust on very short input.

ricelang.predict

ricelang.detect.predict

predict(
    text: str | Sequence[str],
    k: int = 1,
    threshold: float = 0.0,
    on_unicode_error: str = "strict",
)

Return raw fastText predictions for text (string or list of strings).

This always calls the ML classifier. For script-aware dispatch (which catches Korean / Japanese / Greek / etc. without the model), use :func:detect.

Tokenization

ricelang.tokenize

ricelang.tokenize.tokenize

tokenize(
    text: str, lang: str = "mm", form: Form = "syllable"
) -> list[str]

Tokenize text into syllables (default), words, or BPE subwords.

  • form="syllable" — regex-based syllable split. lang must be one of "mm", "karen", "mon", "shan".
  • form="word" — CRF-based word segmentation (Burmese only).
  • form="bpe" — Byte-Pair Encoding subwords. lang selects the bundled tokenizer: "multi" (default for BPE) uses the multilingual BPE; otherwise pass an ISO 639-3 code with a bundled model (mya, ksw, pwo, kvq, cnh, cfm, ctd, eky, shn).

Encoding conversion

ricelang.cvt2zg

ricelang.convert.cvt2zg

cvt2zg(text: str) -> str

Convert Unicode Burmese to Zawgyi.

ricelang.cvt2uni

ricelang.convert.cvt2uni

cvt2uni(text: str) -> str

Convert Zawgyi Burmese to Unicode.

ricelang.cvt2zgi

Alias for cvt2zg. Preserved for backwards compatibility with the original pyidaungsu API.

Module metadata

ricelang.__version__

The installed library version, e.g. "0.4.3".

Lower-level utilities

For most use cases the top-level functions above are enough. These are exposed for callers that need finer control:

ricelang.scripts.script_detect

ricelang.scripts.script_detect

script_detect(text: str) -> Optional[str]

Return a label, "__shared_latin", "__shared_mymr", or None.

  • Concrete ISO label (e.g. "kor", "hin"): the text is in a script used by only one language we care about. No ML needed.
  • "__shared_latin" / "__shared_mymr": the text is dominantly in a shared script and needs the ML classifier to disambiguate.
  • None: no supported script dominates (e.g. emoji-only, pure punctuation, or text in some other script not in our table).

The Unicode-rule first stage of detect(). Returns:

  • An ISO 639-3 label if a monopoly-script rule fires
  • "__shared_latin" or "__shared_mymr" if the text is in a shared script (caller should run the ML classifier)
  • None if no supported script dominates the text

You won't normally call this directly — rl.detect() handles the dispatch.