A Beginner's Guide to Understanding KV Caches in Autoregressive LLMs

Table of contents

KV cache
Notation
KV-cache state during decode
Growth of the KV cache
Prefix caching
KV-cache quantization
Summary
References

1. KV cache

A key-value, or KV, cache is the per-layer storage of attention keys and values computed for tokens that an autoregressive model has already processed. It is used during inference, when the model generates one new token at a time.

At a decoder layer, the current token produces a query, a key, and a value. The query is used only for the current attention calculation. The key and value remain useful because every later token in the same layer may attend to them. The cache therefore retains keys and values, not past queries.

The cache avoids repeated prefix computation. Without it, generating each new token would require recomputing the keys and values of all preceding tokens at every decoder layer. With it, the model computes the key and value only for the newly appended token and reuses the stored prefix states.

A KV cache trades memory for avoided recomputation. It does not eliminate attention over the prefix: the current
query still scores against stored keys and aggregates stored values.

1.1 Why the cache remains valid

Causal attention prevents a token from depending on later tokens. Therefore, appending later tokens does not modify the hidden state of an earlier token at a fixed layer. The key and value derived from that earlier hidden state also do not change, so they can be reused exactly during later decode steps.

2. Notation

All symbols are defined before their first use in an equation. The article first describes one sequence, then extends the memory formulas to a batch of active requests.

Symbol	Definition
t	Current token position in a sequence.
u	Source-token position.
T	Current context length.
T_i	Context length of active sequence i, used when request lengths differ.
ℓ	Decoder-layer index.
L	Total number of decoder layers.
q_ℓ,t	Query vector for token t at layer ℓ.
k_ℓ,t	Key vector for token t at layer ℓ.
v_ℓ,t	Value vector for token t at layer ℓ.
K_ℓ,<t, V_ℓ,<t	Cached keys and values at layer ℓ for positions 1,…,t−1.
K_ℓ,≤t, V_ℓ,≤t	Cached keys and values at layer ℓ after position t is appended.
o_ℓ,t	Attention output for token t at layer ℓ.
B	Number of active sequences or request branches.
h	Number of query heads.
h_kv	Number of key/value heads.
d	Dimension of one query, key, or value head.
H_kv	Total key/value width; H_kv = h_kvd.
s_K, s_V	Bytes per stored key or value scalar.
s	Common bytes per scalar when keys and values use the same representation. For BF16 or FP16, s = 2.
P	Length of a reusable prefix.
U	Length of a new suffix; U = T − P.
b_K, b_V	Stored bit widths for a quantized key or value scalar.
M_meta	Memory used by quantization scales, zero points, and other metadata.

For ordinary multi-head attention, h_kv = h. In grouped-query attention, 1 < h_kv < h; in multi-query attention, h_kv = 1. KV-cache storage scales with h_kv, not directly with the number of query heads h.

3. KV-cache state during decode

Before token t is processed at layer ℓ, the cache contains keys and values for all earlier positions:

K_ℓ,<t = [k_ℓ,1, …, k_ℓ,t−1]^⊤ V_ℓ,<t = [v_ℓ,1, …, v_ℓ,t−1]^⊤

After computing the current key and value, the cache contains positions 1,…,t:

K_ℓ,≤t = [k_ℓ,1, …, k_ℓ,t]^⊤ V_ℓ,≤t = [v_ℓ,1, …, v_ℓ,t]^⊤

The current query is scored against the updated key cache, and the resulting weights aggregate the updated value cache:

o_ℓ,t = softmax(q_ℓ,tK_ℓ,≤t^⊤ / √d) V_ℓ,≤t

The cache is maintained independently at every decoder layer. A model with L layers has L key caches and L value caches.

4. Growth of the KV cache

4.1 Per-layer cache size

For one layer and one sequence of length T, each key tensor and each value tensor has shape h_kv × T × d. The combined key/value storage at that layer is:

M_KV,layer = T h_kv d (s_K + s_V)

When keys and values use the same precision, s_K = s_V = s:

M_KV,layer = 2T h_kvd s = 2T H_kvs

4.2 Total cache size

For B active sequences, each with context length T, the total KV-cache memory across L decoder layers is:

M_KV,total = L B T h_kvd(s_K + s_V)

With equal key/value precision:

M_KV,total = 2LBT h_kvd s = 2LBT H_kvs

For requests with unequal context lengths, replace BT with Σ_i=1^BT_i:

M_KV,total = L h_kvd(s_K + s_V) Σ_i=1^BT_i

4.3 Incremental growth

Appending one token to one active sequence adds:

ΔM_{one token} = L h_kvd(s_K + s_V)

For BF16 or FP16 keys and values, this is:

ΔM_{one token} = 2L h_kvd s

4.4 Example

Consider L = 32 layers, h_kv = 8, d = 128, B = 1, T = 8192, and BF16 storage, so s = 2 bytes. The KV cache is:

M_KV,total = 2 × 32 × 1 × 8192 × 8 × 128 × 2 = 1,073,741,824 bytes = 1 GiB

If the same model used h_kv = 32 rather than h_kv = 8, the cache would require 4 GiB under the same conditions. This is one reason grouped-query attention is valuable in inference.

4.5 Decode bandwidth

During decode, the KV cache has a resident HBM footprint and incurs per-token HBM read traffic. A lower-bound payload estimate for one decode step across all layers is:

R_KV,decode ≈ L B T h_kvd(s_K + s_V)

This is a payload estimate, not a hardware-independent prediction of exact HBM traffic. Actual traffic depends on GQA reuse, tiling, GPU SRAM capacity, cache layout, and the attention kernel. Nevertheless, it explains why small-batch decode is frequently constrained by GPU HBM bandwidth rather than floating-point throughput.

5. Prefix caching

Prefix caching reuses KV-cache states across requests that start with the same compatible token prefix. It targets redundant prefill; it does not alter ordinary within-request decode reuse.

Suppose a request has a reusable prefix of length P and a new suffix of length U, where T = P + U. If a cache entry exists for the prefix, the serving system reuses its KV blocks and computes the sequence state only for the suffix.

5.1 What work is saved

Without a cache hit, the leading exact-attention prefill term is O(BhT²d). With a reusable prefix, only suffix query rows must be created. They still score against all T cached and new keys, giving a leading suffix-attention term:

O(BhUTd)

Prefix reuse therefore skips prefill computation for the shared prefix. It does not remove the prefix from the attention context; suffix queries still attend to cached prefix keys and values.

5.2 Cache-key compatibility

A prefix-cache hit is valid only when the cached KV states are exactly the states required by the new request. For example, two requests can share a cache entry when they use the same model and begin with the same token sequence. They cannot safely share it when the prefix differs after tokenization, the model or adapter differs, or a setting changes how keys and values are computed. Serving systems therefore store a cache key with each prefix block. The key identifies the computation that produced the block, such as the token IDs, model version, adapter, and relevant positional-encoding or multimodal settings.

5.3 Block-based sharing

Serving engines commonly retain cache entries in fixed-size token blocks. A matching request can reference immutable prefix blocks instead of recomputing or copying them. Block sharing reduces redundant prefill computation and, while the blocks remain resident, avoids duplicate physical storage. A partially filled trailing block may not be reusable, depending on the engine's block policy.

Prefix caching is not free memory reduction. Retained blocks consume cache capacity. Its net value depends on shared-prefix length, hit rate, request locality, and eviction policy.

6. KV-cache quantization

KV-cache quantization stores keys and values at fewer bits than BF16 or FP16. It reduces cache capacity and can reduce decode bandwidth when the attention kernel consumes the low-precision cache directly.

6.1 Quantization model

For a real-valued cache element x, an affine quantizer stores an integer q together with a scale a and zero point z. The scalar model is:

q = clip(round(x / a) + z, q_min, q_max)

x̂ = a(q − z)

Here x̂ denotes the reconstructed value. The scale and zero point can be assigned per tensor, per layer, per head, per channel, per token, or per group. Finer granularity generally reduces error, but increases metadata and may complicate kernels.

6.2 Quantized memory

If keys use b_K bits and values use b_V bits, the low-precision payload is:

M_KV,payload = LBT h_kvd (b_K + b_V) / 8

The total cache footprint includes quantization metadata:

M_KV,quant = M_KV,payload + M_meta

When both keys and values use the same bit width b, the payload-only compression factor relative to BF16 is:

Compression_payload = 16 / b

Stored K/V format	Payload bytes per key/value scalar pair	Payload-only reduction versus BF16
BF16 or FP16, 16 bits each	4 bytes	1×
FP8 or INT8, 8 bits each	2 bytes	2×
INT4, 4 bits each	1 byte	4×
INT2, 2 bits each	0.5 byte	8×

6.3 Accuracy and kernel effects

Quantization does not affect all cache components identically. The distributions and functional roles of keys and values differ, so a practical design may use different bit widths or grouping rules for them. KIVI and KVQuant study low-bit KV-cache quantization with asymmetric or distribution-aware design choices [3], [4].

A smaller cache does not guarantee proportional speedup. The kernel must load quantized payloads, access scales and other metadata, and perform reconstruction or low-precision computation. Performance improves when this work is fused into the attention kernel and the reduction in GPU HBM traffic exceeds the added arithmetic and metadata cost.

Condition	Expected effect
Decode is GPU-HBM-bandwidth-bound and the kernel directly consumes quantized KV blocks	Reduced cache traffic can improve throughput or concurrency.
The cache is expanded to full precision in GPU HBM before attention	Most of the intended bandwidth saving can be lost.
Quantization groups are very small	Accuracy may improve, but metadata and kernel overhead can increase.
Bit width is too low for the model and group scheme	Quantization error can reduce model quality.

6.4 Quantization and prefix caching

Prefix caching and KV-cache quantization are complementary. Prefix caching reduces the number of prefix states that must be computed. Quantization reduces the bytes required to retain each state. A serving engine can cache and share quantized blocks when their model, format, and quantization parameters are compatible.

7. Summary

A KV cache stores the keys and values of already processed tokens at every decoder layer. It avoids recomputing those keys and values during autoregressive decode, but grows linearly with layer count, active sequences, context length, key/value heads, head dimension, and bytes per scalar:

M_KV,total = LBT h_kvd(s_K + s_V)

Prefix caching reuses KV states for an exact compatible prefix. It reduces redundant prefill work, but the suffix still attends to the cached prefix. KV-cache quantization reduces stored bytes per key/value element:

M_KV,quant = LBT h_kvd (b_K + b_V) / 8 + M_meta

The practical design problem is to balance cache capacity, GPU-HBM bandwidth, prefix-cache hit rate, quantization error, metadata overhead, and the efficiency of the attention kernel.

References

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017. Available: NeurIPS proceedings.
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, pp. 611–626, 2023, doi: 10.1145/3600006.3613165.
Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V. Braverman, B. Chen, and X. Hu, “KIVI: A tuning-free asymmetric 2bit quantization for KV cache,” in Proceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 235, pp. 32332–32344, 2024. Available: PMLR.
C. Hooper, S. Kim, H. Mohammadzadeh, M. W. Mahoney, Y. S. Shao, K. Keutzer, and A. Gholami, “KVQuant: Towards 10 million context length LLM inference with KV cache quantization,” in Advances in Neural Information Processing Systems, vol. 37, 2024. Available: NeurIPS proceedings.
vLLM Project, “Automatic Prefix Caching,” vLLM Documentation, accessed Jun. 20, 2026. Available: vLLM documentation.