Block B · L1 Merger · Variant
An alternative L1 deploy candidate: linear identity autoencoder, H=120 hidden, 7-bit integer weights, 3-bit biases, one fp32 scalar α. Packs to 3,421 B natively — no Huffman decode step, yet still beats the packed champion by ~0.55%. 100% lossless across all 65,536 byte pairs.
Variant · Architecture
y = (x · W + b1) · WT + b2, then scaled by α. No activation — just a linear identity map through the 7-bit weight field.
Two L0 outputs are concatenated into a 32-dim input, projected through a single shared W matrix (32×120) into a 120-dim hidden, then mirrored back. Unlike the Block B champion — which uses H=81 + C19 activation — this variant widens the hidden to H=120 and drops the activation entirely. The tradeoff: more cells, but every cell quantizes cleanly to 7-bit integer without the C19 aux paramters (c, rho) fighting the quantizer.
Variant · Pipeline position
This variant is not a separate layer — it is an alternative artifact for the same slot as the Block B champion. L0 reads one byte at a time; L1 (either champion or variant) takes two consecutive L0 outputs and returns one merged 32-dim latent. Downstream layers (Tokenizer → Embedder → Brain) are identical across both L1 artifacts.
| Layer | Block A · L0 | Block B · champion | Block B · variant |
|---|---|---|---|
| Input / output | 1 byte → 16-dim | 2×16 → 32-dim | 2×16 → 32-dim |
| Hidden | H = 16 | H = 81 | H = 120 |
| Activation | C19 | C19 | identity |
| Weights | binary ±1 | fp16 + Huffman pack | int7 native |
| Deploy size | 4.0 KB · int8 LUT | 3,440 B | 3,421 B |
| Decode step | none (LUT) | canonical Huffman walk | none (native) |
| Lossless | 256 / 256 | 65,536 / 65,536 | 65,536 / 65,536 |
Both L1 artifacts plug into the same pipeline slot. Downstream Tokenizer / Embedder / Brain do not know which one fed them.
Variant · Bit budget
The champion path (Block B proper) reaches 3,440 B by Huffman-coding an fp16 mirror model; the reader has to walk a canonical Huffman table to reconstruct weights before inference. This variant skips the decode entirely — the bytes on disk are the quantized weights. Smaller and simpler, at the cost of a different architecture (identity vs C19) and a larger hidden width (H=120 vs H=81). Both sit ~41% above the theoretical 2,422 B Shannon floor — the gap that arithmetic / range coding on the int7 stream could close next.
Block B champion (Huffman)
Native 7-bit variant
Same lossless guarantee · same deploy slot · opposite compression strategy.
Variant · Experimental findings
The autonomous compression loop swept activation × codebook × hidden-width × single-W/dual-W over ~34 iterations. GPT ran three parallel probes on top: alpha ablation, post-hoc bias quantization ladder, and C19 aux-parameter quantization. The three findings below explain the exact shape of this variant.
42: b1 + b2 still exact at 3-bit; 2-bit already produces 1 bad pair.7: b1 + b2 still exact all the way down to 1-bit.42: 0.085% correct, 65,480 bad pairs.7: 0.069% correct, 65,491 bad pairs.fp16 α already breaks the model (314–561 bad pairs). STE/QAT on its own does not absorb the scale — the one fp32 word stays. 4 bytes well spent.
b1 at int8 → 12 bad, b2 at int8 → 41 bad.c at int8 → 3 bad, rho at int4/5/6/7/8 all → 1 bad.all_aux bundled at int8 → 13 bad.Seed 7 is strictly the easier seed on the bias floor — it tolerates 1-bit biases. The 3-bit budget above is the multi-seed-safe floor; seed 42 is the tighter constraint.
Variant · Why 7-bit
Before touching the optimizer, the bake probe measures the representation ceiling of each codebook: train the merger in float to 100%, snap the weights to the target codebook at every α in a grid, take the best snap. No polish, no QAT — just "can this codebook hold the solution at all?" Below 4 bit / weight the answer is no, and QAT cannot rescue what the codebook does not contain.
Single-W H=81 bake probe, no polish. The last 10.83pp from 7-bit to 100% is what QAT + LBFGS close.
Not representable
Below 4 bit/weight the codebook cannot hold the merger solution. Bake ceiling ≤ 29% regardless of seed, architecture, or optimizer. A pure representation-space limit.
First viable width
7-bit is where the bake clears 89% without polish — enough slack for QAT to close to 100%. Smaller widths leave too much representational debt.
Variant · Artifacts
Purple = float regime, cyan = scan/pin, amber = quantized regime. Each phase carries the previous phase's weights forward; no restarts.
Champion · Huffman
canonical Huffman walk → fp16 bit-unpack → fp16 → fp32 cast → MAC
~O(N) table-driven decode cycles per weight load, then standard MAC
Variant · Native int7
read byte → int7 sign-extend → MAC against int32 accumulator → α·fp32 at the end
zero decode cycles; hot-path is pure load-and-MAC
Hypothesis only — no benchmark yet. The advantage compounds when weights churn out of cache; for hot weights kept in L1, the MAC dominates and the gap collapses. One of the first things to measure after the native packer port lands.
Variant · Roadmap
Python/block_b_merger/ and Rust/src/block_b_merger/). The read path becomes a pure int7 multiply-accumulate plus one fp32 scale — no Huffman table, no canonical tree walk.Related clusters