Transformer Scaling Explainer

Move the sliders and watch how latency, attention cost, and KV cache memory scale with sequence length.
Controls
Model
Layers
24
Heads
16
d model
1024
FFN mult
4
Batch
1
Sequence length
2048
Precision and KV cache
Weight precision bits
16
KV precision bits
16
Hardware
Peak TFLOPs
200
Bandwidth GB per s
900
Weight reuse factor
0.1
Optional
Vocab size
0
Snapshot at current sequence length
Params
301.99 M
Weights
603.98 MB
KV cache
201.33 MB
FLOPs per token
402.65 M
Latency per token
0.067 ms
Bottleneck
memory
If memory dominates, quantization and caching help. If compute dominates, you need more math throughput or a smaller model.
Latency per token vs sequence length
Latency follows the slower of compute and memory.
Attention scaling intuition
Long context makes full attention expensive. KV cache makes decode much gentler.
KV cache memory vs sequence length
KV cache grows linearly with sequence length, layers, batch, and d model.