← VRAXION
SCAFFOLD
E

Block E · Language Model

Nano Brain V1

A 2-block causal Transformer with 64-dim embeddings and 4 attention heads. Forward-pass verified at 81 ms on CPU. Awaiting its first training run.

SCAFFOLD · v5.0.0-β.2

Block E · Architecture

Causal Transformer stack

Token embedder (Block D) 32,294 × 64 lookup
Positional encoding learned, max_seq=256
TransformerBlock ×1 64-dim · 4 heads · GELU FFN 64→256→64
TransformerBlock ×2 64-dim · 4 heads · GELU FFN 64→256→64
Output head (tied weights) 64 → 32,294 logits

Causal (left-to-right) attention mask. Output head shares weights with the token embedder.

Model dim
64
Attn heads
4
FFN hidden
256
Depth
2 blocks
Max sequence
256 tokens
Attention
Causal

Block E · Scaffold baseline

Forward-pass verification

Verified forward pass
"The cat sleeps peacefully on the warm mat." → 10 tokens
Output: [1, 10, 32294] logits · 81 ms on CPU
2,182,144
Total parameters
94.7%
Params in embedder
~100k
Params in 2 blocks
81 ms
Forward pass (CPU)
f32 memory
8.73 MB
int8 memory
2.18 MB
Init loss
11.15
Uniform baseline
10.38
Init loss 11.15 vs uniform baseline 10.38 — expected behaviour for random-init weights. Loss will drop sharply once real text training begins. The ~0.77 excess is within normal random-init range.

Block E · Artifacts

Deploy & reproduce

Scaffold artifact. Weights are random init. The scaffold is only useful for shape-checking the full pipeline end-to-end before training.

Reproduce in one command

python tools/diag_nano_brain_v1.py
f32 size
8.73 MB
int8 size
2.18 MB
Fwd pass CPU
81 ms
Version
v5.0.0-β.2

Block E · Roadmap

What comes next

First training run: connect Block C → D → E on FineWeb-EDU, target perplexity below uniform baseline within 1,000 gradient steps.
After training converges: evaluate generation quality on held-out FineWeb-EDU slice; measure token-level perplexity vs GPT-2 small (124 M params) as a scaling reference.
Architecture search: 2-block depth is a starting point. Run ablations on depth (1 / 2 / 4 blocks) and FFN width (128 / 256 / 512) to find the Pareto front for perplexity vs size.
Once frozen: re-quantize to int8 and benchmark CPU throughput. Target < 50 ms for a 64-token forward pass.

Related PRs

#132