Corpus layout¶
The training corpus lives in a separate repo:
kaunghtetsan275/corpus.
File naming convention¶
Every file under corpus/data/ follows:
For example:
| File | Language | Source |
|---|---|---|
mya_jsw.txt |
Burmese | JW.org |
mya_mmtimes.txt |
Burmese | Myanmar Times (xlsx export → txt) |
cnh_jsw.txt |
Hakha Chin | JW.org |
cnh_youversion.txt |
Hakha Chin | YouVersion Bible v327 |
mnw_wikipedia.txt |
Mon | Mon Wikipedia dump |
eng_youversion.txt |
English | YouVersion KJV (v1) |
The Burmese files are tagged mya even though the detector emits mya
(Unicode) and zgi (Zawgyi) labels — zgi is synthesized at
corpus-build time by running cvt2zg over the mya text.
File contents¶
Every file is in fastText supervised format:
The build script (scripts/build_corpus.py) reads every *.txt, applies
short-prefix augmentation, caps each label, and emits data/train.txt
and data/valid.txt.
Data sources used¶
| Source | Languages |
|---|---|
| YouVersion Bible scrapes | most labels — see scripts/scrape_youversion.py |
| JW.org scrapes (legacy) | mya, ksw, cnh — inherited from pyidaungsu |
| Mon Wikipedia dump | mnw (no YouVersion Bible exists) |
| Myanmar Times xlsx export | mya (non-Bible Burmese register) |
| shannews.org xlsx export | shn (Shan news) |
| Werribee Karen Bible | ksw |
| Kayah Li (origin unknown) | small eky sample |
Adding a new source¶
- Get the text into fastText format:
- Save as
corpus/data/<iso>_<source>.txt. - Run
uv run python scripts/build_corpus.pyto rebuild the training/validation split. - Run
uv run python scripts/train_detector.py ...to retrain.
The corpus build script auto-discovers any *.txt in the directory — no
list to update.
Scraping YouVersion¶
For languages with a YouVersion Bible, the scraper handles the rest:
uv run python scripts/scrape_youversion.py \
--version 327 --label cnh \
--out ../corpus/data/cnh_youversion.txt \
--sleep 0.4
The script walks every chapter of every book (standard Protestant canon),
extracts verses from data-usfm markers in the chapter HTML, and writes
fastText-format output. --sleep 0.4 is the polite delay between
requests.
To find a version's numeric ID: visit bible.com/languages/<iso>, click
through to a chapter, and grab the number from the URL
(bible.com/bible/<ID>/MAT.1).