C

Block C · Tokenizer

Word Tokenizer V2

A space-aware hybrid tokenizer — whole-word, subword, and byte fallback — that beats gzip-9 by 7.19 percentage points on real FineWeb-EDU text.

FROZEN · v5.0.0-β.2

↓

Block C · Architecture

Hybrid segmentation pipeline

raw text Input UTF-8 string, split on whitespace boundaries

↓

whole-word If the full word exists in vocab → emit single token (95.9% learned coverage)

↓

subword DP DP segmentation across known subword entries — whole_ratio=0.9375

↓

byte fallback Unknown characters → individual byte tokens (1.26% of input bytes)

↓

token IDs Integer sequence, vocab size 32,294

Vocab slots

32,294

whole_ratio

0.9375

SuperBPE τ

0.9 (arXiv:2503.13423)

Training corpus

FineWeb-EDU 100 MB

Segmentation

DP per word

Unreachable tokens

0 / 2,000

Block C · Results

Compression benchmarks

30.43%

Real Huffman compression on 10 MB FineWeb-EDU (lower = better compression)

Method	Compression ratio	vs VRAXION
VRAXION C (Huffman)	30.43%	—
bzip2-9	29.97%	+0.46 pp worse
lzma-9e	28.61%	+1.82 pp worse
gzip-9	37.62%	+7.19 pp worse

95.90%

Learned coverage

1.26%

Byte fallback rate

14/14

Adversarial edge cases

0

Unreachable tokens

Block C · Artifacts

Deploy & reproduce

Champion vocab file

output/word_tokenizer_champion/champion_vocab.json

4.24 MB · 32,294 entries

Reproduce in one command

python tools/diag_word_tokenizer_champion_freeze.py

Vocab JSON

4.24 MB

Vocab slots

32,294

Edge cases pass

14/14

Version

v5.0.0-β.2

Block C · Roadmap

What comes next

→

The tokenizer feeds Block D (Word Embedder). Token IDs from vocab 0–32,293 index into the 32,294×64 embedding table.

→

Close the 0.46 pp gap behind bzip2-9: explore higher whole_ratio values and joint BPE/whole-word scheduling.

→

Multilingual extension: current vocab trained on English FineWeb-EDU only — byte fallback handles unknowns but subword coverage degrades for non-Latin scripts.

Related PRs & clusters

#130 Cluster 16

← B Merger All Blocks D Embedder →