A Beginner's Guide to Understanding KV Caches in Autoregressive LLMs

Table of contents
  1. KV cache
  2. Notation
  3. KV-cache state during decode
  4. Growth of the KV cache
  5. Prefix caching
  6. KV-cache quantization
  7. Summary
  8. References

1. KV cache

A key-value, or KV, cache is the per-layer storage of attention keys and values computed for tokens that an autoregressive model has already processed. It is used during inference, when the model generates one new token at a time.

At a decoder layer, the current token produces a query, a key, and a value. The query is used only for the current attention calculation. The key and value remain useful because every later token in the same layer may attend to them. The cache therefore retains keys and values, not past queries.

The cache avoids repeated prefix computation. Without it, generating each new token would require recomputing the keys and values of all preceding tokens at every decoder layer. With it, the model computes the key and value only for the newly appended token and reuses the stored prefix states.

A KV cache trades memory for avoided recomputation. It does not eliminate attention over the prefix: the current query still scores against stored keys and aggregates stored values.

1.1 Why the cache remains valid

Causal attention prevents a token from depending on later tokens. Therefore, appending later tokens does not modify the hidden state of an earlier token at a fixed layer. The key and value derived from that earlier hidden state also do not change, so they can be reused exactly during later decode steps.

2. Notation

All symbols are defined before their first use in an equation. The article first describes one sequence, then extends the memory formulas to a batch of active requests.

SymbolDefinition
tCurrent token position in a sequence.
uSource-token position.
TCurrent context length.
TiContext length of active sequence i, used when request lengths differ.
Decoder-layer index.
LTotal number of decoder layers.
qℓ,tQuery vector for token t at layer .
kℓ,tKey vector for token t at layer .
vℓ,tValue vector for token t at layer .
Kℓ,<t, Vℓ,<tCached keys and values at layer for positions 1,…,t−1.
Kℓ,≤t, Vℓ,≤tCached keys and values at layer after position t is appended.
oℓ,tAttention output for token t at layer .
BNumber of active sequences or request branches.
hNumber of query heads.
hkvNumber of key/value heads.
dDimension of one query, key, or value head.
HkvTotal key/value width; Hkv = hkvd.
sK, sVBytes per stored key or value scalar.
sCommon bytes per scalar when keys and values use the same representation. For BF16 or FP16, s = 2.
PLength of a reusable prefix.
ULength of a new suffix; U = T − P.
bK, bVStored bit widths for a quantized key or value scalar.
MmetaMemory used by quantization scales, zero points, and other metadata.

For ordinary multi-head attention, hkv = h. In grouped-query attention, 1 < hkv < h; in multi-query attention, hkv = 1. KV-cache storage scales with hkv, not directly with the number of query heads h.

3. KV-cache state during decode

Before token t is processed at layer , the cache contains keys and values for all earlier positions:

Kℓ,<t = [kℓ,1, …, kℓ,t−1] Vℓ,<t = [vℓ,1, …, vℓ,t−1]

After computing the current key and value, the cache contains positions 1,…,t:

Kℓ,≤t = [kℓ,1, …, kℓ,t] Vℓ,≤t = [vℓ,1, …, vℓ,t]

The current query is scored against the updated key cache, and the resulting weights aggregate the updated value cache:

oℓ,t = softmax(qℓ,tKℓ,≤t / √d) Vℓ,≤t

The cache is maintained independently at every decoder layer. A model with L layers has L key caches and L value caches.

4. Growth of the KV cache

4.1 Per-layer cache size

For one layer and one sequence of length T, each key tensor and each value tensor has shape hkv × T × d. The combined key/value storage at that layer is:

MKV,layer = T hkv d (sK + sV)

When keys and values use the same precision, sK = sV = s:

MKV,layer = 2T hkvd s = 2T Hkvs

4.2 Total cache size

For B active sequences, each with context length T, the total KV-cache memory across L decoder layers is:

MKV,total = L B T hkvd(sK + sV)

With equal key/value precision:

MKV,total = 2LBT hkvd s = 2LBT Hkvs

For requests with unequal context lengths, replace BT with Σi=1BTi:

MKV,total = L hkvd(sK + sV) Σi=1BTi

4.3 Incremental growth

Appending one token to one active sequence adds:

ΔMone token = L hkvd(sK + sV)

For BF16 or FP16 keys and values, this is:

ΔMone token = 2L hkvd s

4.4 Example

Consider L = 32 layers, hkv = 8, d = 128, B = 1, T = 8192, and BF16 storage, so s = 2 bytes. The KV cache is:

MKV,total = 2 × 32 × 1 × 8192 × 8 × 128 × 2 = 1,073,741,824 bytes = 1 GiB

If the same model used hkv = 32 rather than hkv = 8, the cache would require 4 GiB under the same conditions. This is one reason grouped-query attention is valuable in inference.

4.5 Decode bandwidth

During decode, the KV cache has a resident HBM footprint and incurs per-token HBM read traffic. A lower-bound payload estimate for one decode step across all layers is:

RKV,decode ≈ L B T hkvd(sK + sV)

This is a payload estimate, not a hardware-independent prediction of exact HBM traffic. Actual traffic depends on GQA reuse, tiling, GPU SRAM capacity, cache layout, and the attention kernel. Nevertheless, it explains why small-batch decode is frequently constrained by GPU HBM bandwidth rather than floating-point throughput.

5. Prefix caching

Prefix caching reuses KV-cache states across requests that start with the same compatible token prefix. It targets redundant prefill; it does not alter ordinary within-request decode reuse.

Suppose a request has a reusable prefix of length P and a new suffix of length U, where T = P + U. If a cache entry exists for the prefix, the serving system reuses its KV blocks and computes the sequence state only for the suffix.

5.1 What work is saved

Without a cache hit, the leading exact-attention prefill term is O(BhT2d). With a reusable prefix, only suffix query rows must be created. They still score against all T cached and new keys, giving a leading suffix-attention term:

O(BhUTd)

Prefix reuse therefore skips prefill computation for the shared prefix. It does not remove the prefix from the attention context; suffix queries still attend to cached prefix keys and values.

5.2 Cache-key compatibility

A prefix-cache hit is valid only when the cached KV states are exactly the states required by the new request. For example, two requests can share a cache entry when they use the same model and begin with the same token sequence. They cannot safely share it when the prefix differs after tokenization, the model or adapter differs, or a setting changes how keys and values are computed. Serving systems therefore store a cache key with each prefix block. The key identifies the computation that produced the block, such as the token IDs, model version, adapter, and relevant positional-encoding or multimodal settings.

5.3 Block-based sharing

Serving engines commonly retain cache entries in fixed-size token blocks. A matching request can reference immutable prefix blocks instead of recomputing or copying them. Block sharing reduces redundant prefill computation and, while the blocks remain resident, avoids duplicate physical storage. A partially filled trailing block may not be reusable, depending on the engine's block policy.

Prefix caching is not free memory reduction. Retained blocks consume cache capacity. Its net value depends on shared-prefix length, hit rate, request locality, and eviction policy.

6. KV-cache quantization

KV-cache quantization stores keys and values at fewer bits than BF16 or FP16. It reduces cache capacity and can reduce decode bandwidth when the attention kernel consumes the low-precision cache directly.

6.1 Quantization model

For a real-valued cache element x, an affine quantizer stores an integer q together with a scale a and zero point z. The scalar model is:

q = clip(round(x / a) + z, qmin, qmax)
x̂ = a(q − z)

Here denotes the reconstructed value. The scale and zero point can be assigned per tensor, per layer, per head, per channel, per token, or per group. Finer granularity generally reduces error, but increases metadata and may complicate kernels.

6.2 Quantized memory

If keys use bK bits and values use bV bits, the low-precision payload is:

MKV,payload = LBT hkvd (bK + bV) / 8

The total cache footprint includes quantization metadata:

MKV,quant = MKV,payload + Mmeta

When both keys and values use the same bit width b, the payload-only compression factor relative to BF16 is:

Compressionpayload = 16 / b
Stored K/V format Payload bytes per key/value scalar pair Payload-only reduction versus BF16
BF16 or FP16, 16 bits each4 bytes
FP8 or INT8, 8 bits each2 bytes
INT4, 4 bits each1 byte
INT2, 2 bits each0.5 byte

6.3 Accuracy and kernel effects

Quantization does not affect all cache components identically. The distributions and functional roles of keys and values differ, so a practical design may use different bit widths or grouping rules for them. KIVI and KVQuant study low-bit KV-cache quantization with asymmetric or distribution-aware design choices [3], [4].

A smaller cache does not guarantee proportional speedup. The kernel must load quantized payloads, access scales and other metadata, and perform reconstruction or low-precision computation. Performance improves when this work is fused into the attention kernel and the reduction in GPU HBM traffic exceeds the added arithmetic and metadata cost.

ConditionExpected effect
Decode is GPU-HBM-bandwidth-bound and the kernel directly consumes quantized KV blocks Reduced cache traffic can improve throughput or concurrency.
The cache is expanded to full precision in GPU HBM before attention Most of the intended bandwidth saving can be lost.
Quantization groups are very small Accuracy may improve, but metadata and kernel overhead can increase.
Bit width is too low for the model and group scheme Quantization error can reduce model quality.

6.4 Quantization and prefix caching

Prefix caching and KV-cache quantization are complementary. Prefix caching reduces the number of prefix states that must be computed. Quantization reduces the bytes required to retain each state. A serving engine can cache and share quantized blocks when their model, format, and quantization parameters are compatible.

7. Summary

A KV cache stores the keys and values of already processed tokens at every decoder layer. It avoids recomputing those keys and values during autoregressive decode, but grows linearly with layer count, active sequences, context length, key/value heads, head dimension, and bytes per scalar:

MKV,total = LBT hkvd(sK + sV)

Prefix caching reuses KV states for an exact compatible prefix. It reduces redundant prefill work, but the suffix still attends to the cached prefix. KV-cache quantization reduces stored bytes per key/value element:

MKV,quant = LBT hkvd (bK + bV) / 8 + Mmeta

The practical design problem is to balance cache capacity, GPU-HBM bandwidth, prefix-cache hit rate, quantization error, metadata overhead, and the efficiency of the attention kernel.

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017. Available: NeurIPS proceedings.
  2. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp. 611–626, 2023, doi: 10.1145/3600006.3613165.
  3. Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu, “KIVI: A tuning-free asymmetric 2bit quantization for KV cache,” in Proceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 235, pp. 32332–32344, 2024. Available: PMLR.
  4. C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami, “KVQuant: Towards 10 million context length LLM inference with KV cache quantization,” in Advances in Neural Information Processing Systems, vol. 37, 2024. Available: NeurIPS proceedings.
  5. vLLM Project, “Automatic Prefix Caching,” vLLM Documentation, accessed Jun. 20, 2026. Available: vLLM documentation.