Language detection¶
import ricelang as rl
rl.detect("ထမင်းစားပြီးပြီလား") # 'mya'
rl.detect("안녕하세요") # 'kor' (via script-rule, no ML)
rl.detect("Pathian nih van") # 'cnh'
rl.detect("🎉") # None (out of scope)
rl.detect(text, fallback=None)¶
Returns an ISO 639-3 label, or fallback if no rule fires and the text isn't
in a supported shared script.
rl.detect("hello") # might return 'eng' if Latin-script ML fires
rl.detect("🎉🎉🎉") # None
rl.detect("🎉🎉🎉", fallback="unknown") # 'unknown'
The function consults the hierarchical detector:
- Unicode-script rule runs first. For monopoly scripts (Hangul, Hiragana, Greek, Hebrew, Thai, Tamil, Devanagari, …), the rule alone returns the label — no ML invoked.
- Trained classifier runs only when text is in a shared script (Latin or the Myanmar block).
fallbackis returned when no rule matches and the text isn't in a supported shared script.
rl.predict(text, k=1, threshold=0.0)¶
Returns the raw fastText top-k as (labels_tuple, probs_array). Unlike
detect(), this always calls the ML model — no script-rule shortcut.
labels, probs = rl.predict("Pathian nih van", k=5)
# labels = ('__label__cnh', '__label__cfm', '__label__ctd', '__label__ksw', '__label__eng')
# probs = array([1.0000, 0.0000, 0.0000, 0.0000, 0.0000])
# Strip the __label__ prefix
clean = [(l[9:], float(p)) for l, p in zip(labels, probs)]
Use predict() when you need:
- Top-k candidates with probabilities
- A confidence score to decide whether to trust the answer
- Batch prediction (
textcan be a list of strings)
Output reliability tips¶
Short text¶
Short strings (under ~10 characters in Latin or Myanmar block) are inherently ambiguous because they may not contain discriminating n-grams. The model still returns a label, but check the top-1 probability before trusting it on short input.
labels, probs = rl.predict("Ka dam", k=2)
# probably ('cnh', 'cfm') with probs (0.6, 0.4) — genuine ambiguity
Mixed-script input¶
ricelang routes by the dominant script. Mixed-language strings ("Hello [Burmese text]") will pick whichever script makes up >30% of the text.
If you have mixed-language documents, split into sentences first (sentence segmentation is on the roadmap).
Out-of-scope text¶
For text in scripts ricelang doesn't support (Cyrillic, Arabic, etc.) or
non-textual input (emoji, garbage), detect() returns None. This is
intentional — silent wrong predictions break downstream pipelines worse than
explicit "I don't know."
rl.detect("Привет") # None (no Cyrillic rule)
rl.detect("مرحبا") # None (no Arabic rule)
rl.detect("😀😀") # None
If you need worldwide language coverage, run fasttext's lid.176
as an upstream router and only call ricelang for SE/South Asian text.
Performance¶
- Cold start: ~80ms (fastText model + a few BPE tokenizers lazy-load on first call)
- Per-call: ~0.1ms for script-rule paths, ~1ms for ML paths
- Batched (via
predict()with a list): much higher throughput - No GPU, no network, no shared state — safe in worker pools
Sibling-language ambiguity (not supported)¶
A few languages are deliberately not in ricelang because text-only detection can't disambiguate them from a sibling already in the set:
- Indonesian (
ind) collides with Malay (msa) —"Terima kasih"is both - Rakhine (
rki) collides with Burmese (mya) on short phrases - Chinese Traditional (
zho_hant) collides with Simplified (zho)
See Supported languages for the rationale.