Block B · L1 Merger · Variant

Native 7-bit Identity Merger

An alternative L1 deploy candidate: linear identity autoencoder, H=120 hidden, 7-bit integer weights, 3-bit biases, one fp32 scalar α. Packs to 3,421 B natively — no Huffman decode step, yet still beats the packed champion by ~0.55%. 100% lossless across all 65,536 byte pairs.

VALIDATED ALTERNATIVE · Cluster 18

Lineage · Cluster 11 → 12 → 13 → 18

↓

Variant · Architecture

Identity linear — no C19

L0-A out

16

+

L0-B out

16

→

Hidden

120

→

Reconstruct

32

y = (x · W + b1) · W^T + b2, then scaled by α. No activation — just a linear identity map through the 7-bit weight field.

Two L0 outputs are concatenated into a 32-dim input, projected through a single shared W matrix (32×120) into a 120-dim hidden, then mirrored back. Unlike the Block B champion — which uses H=81 + C19 activation — this variant widens the hidden to H=120 and drops the activation entirely. The tradeoff: more cells, but every cell quantizes cleanly to 7-bit integer without the C19 aux paramters (c, rho) fighting the quantizer.

Shape

32 → 120 → 32

Weight matrix

1 × (32×120)

Weight cells

3,840

Activation

identity (none)

W precision

7-bit int

Bias precision

3-bit int

Alpha

fp32 scalar

Lossless on

65,536 / 65,536

Variant · Pipeline position

Where this sits in the byte-level stack

This variant is not a separate layer — it is an alternative artifact for the same slot as the Block B champion. L0 reads one byte at a time; L1 (either champion or variant) takes two consecutive L0 outputs and returns one merged 32-dim latent. Downstream layers (Tokenizer → Embedder → Brain) are identical across both L1 artifacts.

Layer	Block A · L0	Block B · champion	Block B · variant
Input / output	1 byte → 16-dim	2×16 → 32-dim	2×16 → 32-dim
Hidden	H = 16	H = 81	H = 120
Activation	C19	C19	identity
Weights	binary ±1	fp16 + Huffman pack	int7 native
Deploy size	4.0 KB · int8 LUT	3,440 B	3,421 B
Decode step	none (LUT)	canonical Huffman walk	none (native)
Lossless	256 / 256	65,536 / 65,536	65,536 / 65,536

Both L1 artifacts plug into the same pipeline slot. Downstream Tokenizer / Embedder / Brain do not know which one fed them.

Variant · Bit budget

3,421 B native footprint

3,421 B

vs 3,440 B Huffman-packed champion · −19 B · −0.55%

W (weight matrix)

3,840 cells × 7 bit

3,360 B

b1 + b2 (biases)

152 cells × 3 bit

57 B

α (global scale)

1 × fp32

4 B

Total

3,421 B

3.34 KB

Native footprint

None

Decode step needed

−19 B

vs Huffman champion

seeds 7, 42

Replicated lossless

2,422 B

Shannon floor

~41%

Gap to floor

The champion path (Block B proper) reaches 3,440 B by Huffman-coding an fp16 mirror model; the reader has to walk a canonical Huffman table to reconstruct weights before inference. This variant skips the decode entirely — the bytes on disk are the quantized weights. Smaller and simpler, at the cost of a different architecture (identity vs C19) and a larger hidden width (H=120 vs H=81). Both sit ~41% above the theoretical 2,422 B Shannon floor — the gap that arithmetic / range coding on the int7 stream could close next.

Block B champion (Huffman)

3,440 B packed · needs decode
H = 81 · cells = 2,592
fp16 W · C19 activation
canonical Huffman tree on disk

Native 7-bit variant

3,421 B raw · zero decode
H = 120 · cells = 3,840
int7 W · identity (no act.)
bytes on disk are weights

Same lossless guarantee · same deploy slot · opposite compression strategy.

Variant · Experimental findings

Why this shape, not the champion's

The autonomous compression loop swept activation × codebook × hidden-width × single-W/dual-W over ~34 iterations. GPT ran three parallel probes on top: alpha ablation, post-hoc bias quantization ladder, and C19 aux-parameter quantization. The three findings below explain the exact shape of this variant.

1 Biases quantize far tighter than expected 3-bit exact

Post-hoc bias probe on the native 7-bit identity H=120 model:

seed 42: b1 + b2 still exact at 3-bit; 2-bit already produces 1 bad pair.
seed 7: b1 + b2 still exact all the way down to 1-bit.

The multi-seed-safe floor is 3-bit per bias — which is what the native budget above uses.

2 The global α scalar is not removable α-free fails

Alpha-ablation, per seed:

seed 42: 0.085% correct, 65,480 bad pairs.
seed 7: 0.069% correct, 65,491 bad pairs.

Post-hoc probe confirms even fp16 α already breaks the model (314–561 bad pairs). STE/QAT on its own does not absorb the scale — the one fp32 word stays. 4 bytes well spent.

3 C19 aux params refuse post-hoc quantization keep identity

Same probe on the float exact C19 merger (the champion architecture, before quantization):

b1 at int8 → 12 bad, b2 at int8 → 41 bad.
c at int8 → 3 bad, rho at int4/5/6/7/8 all → 1 bad.
all_aux bundled at int8 → 13 bad.

So there is no regime where you "just quantize the C19 meta and stay exact." That was the gate that decided this variant: if the aux won't quantize, drop the activation entirely and widen the hidden until the identity map fits.

Per-seed detail

metric

seed 7

seed 42

lossless @ 7-bit W

100% · 65,536 / 65,536

b1 + b2 floor

1-bit exact

3-bit exact · 2-bit → 1 bad

α-free rollback

0.069% · 65,491 bad

0.085% · 65,480 bad

α-fp16 post-hoc

561 bad

314 bad

Seed 7 is strictly the easier seed on the bias floor — it tolerates 1-bit biases. The 3-bit budget above is the multi-seed-safe floor; seed 42 is the tighter constraint.

Variant · Why 7-bit

Codebook expressivity ladder

Before touching the optimizer, the bake probe measures the representation ceiling of each codebook: train the merger in float to 100%, snap the weights to the target codebook at every α in a grid, take the best snap. No polish, no QAT — just "can this codebook hold the solution at all?" Below 4 bit / weight the answer is no, and QAT cannot rescue what the codebook does not contain.

binary (1b)

0.25% ternary (2b)

1.82% 3-bit int

17.47% 4-bit int

29.28% 5-bit int

50.19% 6-bit int

74.08% 7-bit int

89.17%

Single-W H=81 bake probe, no polish. The last 10.83pp from 7-bit to 100% is what QAT + LBFGS close.

Not representable

Below 4 bit/weight the codebook cannot hold the merger solution. Bake ceiling ≤ 29% regardless of seed, architecture, or optimizer. A pure representation-space limit.

First viable width

7-bit is where the bake clears 89% without polish — enough slack for QAT to close to 100%. Smaller widths leave too much representational debt.

Variant · Artifacts

Deploy & reproduce

Deploy binary · pending native packer port

identity H=120 · W int7 · b1/b2 int3 · α fp32

3,421 B target · 100% lossless · 65,536 / 65,536

Current shipped champion (Block B main path)

output/merger_single_w_huffman_pack/packed_model.bin

3,440 B Huffman-packed · 100% lossless · canonical deploy

Reproduce — four commands

python tools/diag_byte_pair_merger_widen_sweep.py --arch single --codebook 7bit_int --H 120

python tools/diag_byte_pair_merger_aux_quant_probe.py

python tools/diag_byte_pair_merger_float_aux_quant_probe.py

python tools/diag_byte_pair_merger_alpha_ablation.py

Training recipe

Phase 1

Adam

float warmup

→

Phase 2

LBFGS

float finish, strong-wolfe

→

Phase 3

α-search

static scan, frozen W

→

Phase 4

QAT Adam

STE 7-bit W

→

Phase 5

QAT LBFGS

STE close to 100%

Purple = float regime, cyan = scan/pin, amber = quantized regime. Each phase carries the previous phase's weights forward; no restarts.

Quantizer

STE (straight-through)

Seeds replicated

7, 42

Cluster

Cluster 18

Speculative decode cost

Champion · Huffman

canonical Huffman walk → fp16 bit-unpack → fp16 → fp32 cast → MAC

~O(N) table-driven decode cycles per weight load, then standard MAC

Variant · Native int7

read byte → int7 sign-extend → MAC against int32 accumulator → α·fp32 at the end

zero decode cycles; hot-path is pure load-and-MAC

Hypothesis only — no benchmark yet. The advantage compounds when weights churn out of cache; for hot weights kept in L1, the MAC dominates and the gap collapses. One of the first things to measure after the native packer port lands.

Variant · Roadmap

What this unlocks

→

Port the native packer into both deploy SDKs (Python/block_b_merger/ and Rust/src/block_b_merger/). The read path becomes a pure int7 multiply-accumulate plus one fp32 scale — no Huffman table, no canonical tree walk.

→

Benchmark CPU decode + forward cost against the Huffman champion. The native path skips the decode entirely, so it should win on latency; the comparison tells us whether the identity / wider-hidden forward itself is competitive.

→

Try range / arithmetic coding over the native int7 W stream. If the 7-bit weights are entropy-redundant at all, that lifts the variant below the Huffman champion on size and keeps the no-activation forward path.

→

Cross-seed robustness: replicate on seeds 1000, 2024, 31337 (the Cluster 11/12 replication set). Two-seed convergence is suggestive but not decisive; a five-seed lossless carpet before promotion.

→

Close the loop: bubble one verdict back to the Block B canonical page — either promote the native 7-bit to champion, or record why the Huffman mirror stayed.

Related clusters

Cluster 12 · fp16 champion Cluster 13 · Huffman pack Cluster 18 · full writeup →

← B Merger (champion) All Blocks C Tokenizer →