Transformer Scaling Explainer

Move the sliders and watch how latency, attention cost, and KV cache memory scale with sequence length.

Controls

Model

Layers

24

Heads

16

d model

1024

FFN mult

4

Batch

1

Sequence length

2048

Precision and KV cache

Weight precision bits

16

KV cache enabled

KV precision bits

16

Hardware

Peak TFLOPs

200

Bandwidth GB per s

900

Weight reuse factor

0.1

Optional

Include embeddings

Vocab size

0

Snapshot at current sequence length

Params

301.99 M

Weights

603.98 MB

KV cache

201.33 MB

FLOPs per token

402.65 M

Latency per token

0.067 ms

Bottleneck

memory

If memory dominates, quantization and caching help. If compute dominates, you need more math throughput or a smaller model.

Latency per token vs sequence length

Latency follows the slower of compute and memory.

Attention scaling intuition

Long context makes full attention expensive. KV cache makes decode much gentler.

KV cache memory vs sequence length

KV cache grows linearly with sequence length, layers, batch, and d model.