E

Block E · Language Model

Nano Brain V1

A 2-block causal Transformer with 64-dim embeddings and 4 attention heads. Forward-pass verified at 81 ms on CPU. Awaiting its first training run.

SCAFFOLD · v5.0.0-β.2

↓

Block E · Architecture

Causal Transformer stack

Token embedder (Block D) 32,294 × 64 lookup

Positional encoding learned, max_seq=256

            TransformerBlock ×1
            64-dim · 4 heads · GELU FFN 64→256→64
          

            TransformerBlock ×2
            64-dim · 4 heads · GELU FFN 64→256→64
          

Output head (tied weights) 64 → 32,294 logits

Causal (left-to-right) attention mask. Output head shares weights with the token embedder.

Model dim

64

Attn heads

4

FFN hidden

256

Depth

2 blocks

Max sequence

256 tokens

Attention

Causal

Block E · Scaffold baseline

Forward-pass verification

Verified forward pass

"The cat sleeps peacefully on the warm mat." → 10 tokens

Output: [1, 10, 32294] logits · 81 ms on CPU

2,182,144

Total parameters

94.7%

Params in embedder

~100k

Params in 2 blocks

81 ms

Forward pass (CPU)

f32 memory

8.73 MB

int8 memory

2.18 MB

Init loss

11.15

Uniform baseline

10.38

ℹ

Init loss 11.15 vs uniform baseline 10.38 — expected behaviour for random-init weights. Loss will drop sharply once real text training begins. The ~0.77 excess is within normal random-init range.

Block E · Artifacts

Deploy & reproduce

⚠

Scaffold artifact. Weights are random init. The scaffold is only useful for shape-checking the full pipeline end-to-end before training.

Reproduce in one command

python tools/diag_nano_brain_v1.py

f32 size

8.73 MB

int8 size

2.18 MB

Fwd pass CPU

81 ms

Version

v5.0.0-β.2

Block E · Roadmap

What comes next

→

First training run: connect Block C → D → E on FineWeb-EDU, target perplexity below uniform baseline within 1,000 gradient steps.

→

After training converges: evaluate generation quality on held-out FineWeb-EDU slice; measure token-level perplexity vs GPT-2 small (124 M params) as a scaling reference.

→

Architecture search: 2-block depth is a starting point. Run ablations on depth (1 / 2 / 4 blocks) and FFN width (128 / 256 / 512) to find the Pareto front for perplexity vs size.

→

Once frozen: re-quantize to int8 and benchmark CPU throughput. Target < 50 ms for a 64-token forward pass.

Related PRs

#132

← D Embedder All Blocks A Byte Unit →