Skip to content

Tokenization

ricelang exposes three tokenization modes via the same tokenize() function:

form Algorithm Languages
"syllable" (default) Regex-based syllable split Burmese (mm), Karen (karen), Mon (mon), Shan (shan)
"word" CRF word segmentation Burmese only
"bpe" Byte-pair encoding subwords Per-language model (24 langs) + multilingual (multi)

Syllable (default)

Splits on regex rules that respect each language's consonant + virama patterns. Cheap, fast, good for text statistics and downstream ML inputs.

import ricelang as rl

rl.tokenize("Alan Turingကို Artificial Intelligence")
# ['Alan', 'Turing', 'ကို', 'Artificial', 'Intelligence']

rl.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူး")
# ['ဖေ', 'ဖေ', 'နဲ့', 'မေ', 'မေ', '၏', 'ကျေး', 'ဇူး']

rl.tokenize("လူၤစံယၤ အခီၣ်စ့ၣ်တကပၤ", lang="karen")
# ['လူၤ', 'စံ', 'ယၤ', 'အ', 'ခီၣ်', 'စ့ၣ်', 'တ', 'က', 'ပၤ']

Word (Burmese only, CRF-based)

Segments Burmese into linguistic words using a trained CRF model (tokenizer.crfsuite, ~1.4 MB). Slower than syllable but much closer to what a human reader would produce.

rl.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူးတရားမှာကြီးမားလှပေသည်", form="word")
# ['ဖေဖေ', 'နဲ့', 'မေမေ', '၏', 'ကျေးဇူးတရား', 'မှာ',
#  'ကြီးမား', 'လှ', 'ပေ', 'သည်']

BPE subwords

Byte-pair encoding via HuggingFace tokenizers. Useful as input to downstream LLM / embedding pipelines that expect subword units.

Two flavors are bundled:

  • multi (default): 32k-vocab tokenizer trained on the full corpus. Handles every supported script in one model. Best for code-switching or unknown-language input.
  • Per-language (24 of them): 16k-vocab tokenizer trained on one language's corpus. Tighter splits on the target language at the cost of not handling other scripts well.
# multilingual (default for form="bpe")
rl.tokenize("Pathian nih van", form="bpe")
# ['Pathian', 'nih', 'van']

# per-language Burmese
rl.tokenize("ဖေဖေနဲ့မေမေ၏ကျေးဇူး", form="bpe", lang="mya")
# ['ဖေ', 'ဖေ', 'နဲ့', 'မေ', 'မေ', '၏', 'ကျေးဇူး']

# per-language Thai
rl.tokenize("สวัสดีครับ ขอบคุณมากครับ", form="bpe", lang="tha")

Available per-language BPE models

mya, ksw, pwo, kvq, cnh, cfm, ctd, eky, shn, eng, hin, khm, lao, msa, tam, tgl, tha, vie, zho, ban, hnn, kac, mnw, sun.

(No zgi BPE — Zawgyi text should be normalized to Unicode via cvt2uni() before tokenizing.)

Unknown lang silently falls back to multi.

Retraining BPEs

Both the per-language and multilingual BPEs can be retrained from your own corpus — see Training BPEs.

Choosing the right form

Use case Recommended
Quick word count / text statistics syllable
Search index / phrase matching for Burmese word
Feeding into an LLM / embedding model bpe (multi or per-language)
Building a non-Burmese SE Asian word-level pipeline syllable + custom rules