← VRAXION
FROZEN
C

Block C · Tokenizer

Word Tokenizer V2

A space-aware hybrid tokenizer — whole-word, subword, and byte fallback — that beats gzip-9 by 7.19 percentage points on real FineWeb-EDU text.

FROZEN · v5.0.0-β.2

Block C · Architecture

Hybrid segmentation pipeline

raw text Input UTF-8 string, split on whitespace boundaries
whole-word If the full word exists in vocab → emit single token (95.9% learned coverage)
subword DP DP segmentation across known subword entries — whole_ratio=0.9375
byte fallback Unknown characters → individual byte tokens (1.26% of input bytes)
token IDs Integer sequence, vocab size 32,294
Vocab slots
32,294
whole_ratio
0.9375
SuperBPE τ
0.9 (arXiv:2503.13423)
Training corpus
FineWeb-EDU 100 MB
Segmentation
DP per word
Unreachable tokens
0 / 2,000

Block C · Results

Compression benchmarks

30.43%
Real Huffman compression on 10 MB FineWeb-EDU (lower = better compression)
MethodCompression ratiovs VRAXION
VRAXION C (Huffman)30.43%
bzip2-929.97%+0.46 pp worse
lzma-9e28.61%+1.82 pp worse
gzip-937.62%+7.19 pp worse
95.90%
Learned coverage
1.26%
Byte fallback rate
14/14
Adversarial edge cases
0
Unreachable tokens

Block C · Artifacts

Deploy & reproduce

Champion vocab file
output/word_tokenizer_champion/champion_vocab.json
4.24 MB · 32,294 entries

Reproduce in one command

python tools/diag_word_tokenizer_champion_freeze.py
Vocab JSON
4.24 MB
Vocab slots
32,294
Edge cases pass
14/14
Version
v5.0.0-β.2

Block C · Roadmap

What comes next

The tokenizer feeds Block D (Word Embedder). Token IDs from vocab 0–32,293 index into the 32,294×64 embedding table.
Close the 0.46 pp gap behind bzip2-9: explore higher whole_ratio values and joint BPE/whole-word scheduling.
Multilingual extension: current vocab trained on English FineWeb-EDU only — byte fallback handles unknowns but subword coverage degrades for non-Latin scripts.

Related PRs & clusters

#130 Cluster 16