Tokenization¶

ricelang exposes three tokenization modes via the same tokenize() function:

`form`	Algorithm	Languages
`"syllable"` (default)	Regex-based syllable split	Burmese (`mm`), Karen (`karen`), Mon (`mon`), Shan (`shan`)
`"word"`	CRF word segmentation	Burmese only
`"bpe"`	Byte-pair encoding subwords	Per-language model (24 langs) + multilingual (`multi`)

Syllable (default)¶

Splits on regex rules that respect each language's consonant + virama patterns. Cheap, fast, good for text statistics and downstream ML inputs.

import ricelang as rl

rl.tokenize("Alan Turingကို Artificial Intelligence")
# ['Alan', 'Turing', 'ကို', 'Artificial', 'Intelligence']

rl.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူး")
# ['ဖေ', 'ဖေ', 'နဲ့', 'မေ', 'မေ', '၏', 'ကျေး', 'ဇူး']

rl.tokenize("လူၤစံယၤ အခီၣ်စ့ၣ်တကပၤ", lang="karen")
# ['လူၤ', 'စံ', 'ယၤ', 'အ', 'ခီၣ်', 'စ့ၣ်', 'တ', 'က', 'ပၤ']

Word (Burmese only, CRF-based)¶

Segments Burmese into linguistic words using a trained CRF model (tokenizer.crfsuite, ~1.4 MB). Slower than syllable but much closer to what a human reader would produce.

rl.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူးတရားမှာကြီးမားလှပေသည်", form="word")
# ['ဖေဖေ', 'နဲ့', 'မေမေ', '၏', 'ကျေးဇူးတရား', 'မှာ',
#  'ကြီးမား', 'လှ', 'ပေ', 'သည်']

BPE subwords¶

Byte-pair encoding via HuggingFace tokenizers. Useful as input to downstream LLM / embedding pipelines that expect subword units.

Two flavors are bundled:

multi (default): 32k-vocab tokenizer trained on the full corpus. Handles every supported script in one model. Best for code-switching or unknown-language input.
Per-language (24 of them): 16k-vocab tokenizer trained on one language's corpus. Tighter splits on the target language at the cost of not handling other scripts well.

# multilingual (default for form="bpe")
rl.tokenize("Pathian nih van", form="bpe")
# ['Pathian', 'nih', 'van']

# per-language Burmese
rl.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူး", form="bpe", lang="mya")
# ['ဖေ', 'ဖေ', 'နဲ့', 'မေ', 'မေ', '၏', 'ကျေးဇူး']

# per-language Thai
rl.tokenize("สวัสดีครับ ขอบคุณมากครับ", form="bpe", lang="tha")

Available per-language BPE models¶

mya, ksw, pwo, kvq, cnh, cfm, ctd, eky, shn, eng, hin, khm, lao, msa, tam, tgl, tha, vie, zho, ban, hnn, kac, mnw, sun.

(No zgi BPE — Zawgyi text should be normalized to Unicode via cvt2uni() before tokenizing.)

Unknown lang silently falls back to multi.

Retraining BPEs¶

Both the per-language and multilingual BPEs can be retrained from your own corpus — see Training BPEs.

Choosing the right form¶

Use case	Recommended
Quick word count / text statistics	`syllable`
Search index / phrase matching for Burmese	`word`
Feeding into an LLM / embedding model	`bpe` (multi or per-language)
Building a non-Burmese SE Asian word-level pipeline	`syllable` + custom rules