Latency per token vs sequence length
Latency follows the slower of compute and memory.
Attention scaling intuition
Long context makes full attention expensive. KV cache makes decode much gentler.
KV cache memory vs sequence length
KV cache grows linearly with sequence length, layers, batch, and d model.