A key-value, or KV, cache is the per-layer storage of attention keys and values computed for tokens that an autoregressive model has already processed. It is used during inference, when the model generates one new token at a time.
At a decoder layer, the current token produces a query, a key, and a value. The query is used only for the current attention calculation. The key and value remain useful because every later token in the same layer may attend to them. The cache therefore retains keys and values, not past queries.
The cache avoids repeated prefix computation. Without it, generating each new token would require recomputing the keys and values of all preceding tokens at every decoder layer. With it, the model computes the key and value only for the newly appended token and reuses the stored prefix states.
Causal attention prevents a token from depending on later tokens. Therefore, appending later tokens does not modify the hidden state of an earlier token at a fixed layer. The key and value derived from that earlier hidden state also do not change, so they can be reused exactly during later decode steps.
All symbols are defined before their first use in an equation. The article first describes one sequence, then extends the memory formulas to a batch of active requests.
| Symbol | Definition |
|---|---|
| t | Current token position in a sequence. |
| u | Source-token position. |
| T | Current context length. |
| Ti | Context length of active sequence i, used when request lengths differ. |
| ℓ | Decoder-layer index. |
| L | Total number of decoder layers. |
| qℓ,t | Query vector for token t at layer ℓ. |
| kℓ,t | Key vector for token t at layer ℓ. |
| vℓ,t | Value vector for token t at layer ℓ. |
| Kℓ,<t, Vℓ,<t | Cached keys and values at layer ℓ for positions 1,…,t−1. |
| Kℓ,≤t, Vℓ,≤t | Cached keys and values at layer ℓ after position t is appended. |
| oℓ,t | Attention output for token t at layer ℓ. |
| B | Number of active sequences or request branches. |
| h | Number of query heads. |
| hkv | Number of key/value heads. |
| d | Dimension of one query, key, or value head. |
| Hkv | Total key/value width; Hkv = hkvd. |
| sK, sV | Bytes per stored key or value scalar. |
| s | Common bytes per scalar when keys and values use the same representation. For BF16 or FP16, s = 2. |
| P | Length of a reusable prefix. |
| U | Length of a new suffix; U = T − P. |
| bK, bV | Stored bit widths for a quantized key or value scalar. |
| Mmeta | Memory used by quantization scales, zero points, and other metadata. |
For ordinary multi-head attention, hkv = h. In grouped-query attention, 1 < hkv < h; in multi-query attention, hkv = 1. KV-cache storage scales with hkv, not directly with the number of query heads h.
Before token t is processed at layer ℓ, the cache contains keys and values for all earlier positions:
After computing the current key and value, the cache contains positions 1,…,t:
The current query is scored against the updated key cache, and the resulting weights aggregate the updated value cache:
The cache is maintained independently at every decoder layer. A model with L layers has L key caches and L value caches.
For one layer and one sequence of length T, each key tensor and each value tensor has shape hkv × T × d. The combined key/value storage at that layer is:
When keys and values use the same precision, sK = sV = s:
For B active sequences, each with context length T, the total KV-cache memory across L decoder layers is:
With equal key/value precision:
For requests with unequal context lengths, replace BT with Σi=1BTi:
Appending one token to one active sequence adds:
For BF16 or FP16 keys and values, this is:
Consider L = 32 layers, hkv = 8, d = 128, B = 1, T = 8192, and BF16 storage, so s = 2 bytes. The KV cache is:
If the same model used hkv = 32 rather than hkv = 8, the cache would require 4 GiB under the same conditions. This is one reason grouped-query attention is valuable in inference.
During decode, the KV cache has a resident HBM footprint and incurs per-token HBM read traffic. A lower-bound payload estimate for one decode step across all layers is:
This is a payload estimate, not a hardware-independent prediction of exact HBM traffic. Actual traffic depends on GQA reuse, tiling, GPU SRAM capacity, cache layout, and the attention kernel. Nevertheless, it explains why small-batch decode is frequently constrained by GPU HBM bandwidth rather than floating-point throughput.
Prefix caching reuses KV-cache states across requests that start with the same compatible token prefix. It targets redundant prefill; it does not alter ordinary within-request decode reuse.
Suppose a request has a reusable prefix of length P and a new suffix of length U, where T = P + U. If a cache entry exists for the prefix, the serving system reuses its KV blocks and computes the sequence state only for the suffix.
Without a cache hit, the leading exact-attention prefill term is O(BhT2d). With a reusable prefix, only suffix query rows must be created. They still score against all T cached and new keys, giving a leading suffix-attention term:
Prefix reuse therefore skips prefill computation for the shared prefix. It does not remove the prefix from the attention context; suffix queries still attend to cached prefix keys and values.
A prefix-cache hit is valid only when the cached KV states are exactly the states required by the new request. For example, two requests can share a cache entry when they use the same model and begin with the same token sequence. They cannot safely share it when the prefix differs after tokenization, the model or adapter differs, or a setting changes how keys and values are computed. Serving systems therefore store a cache key with each prefix block. The key identifies the computation that produced the block, such as the token IDs, model version, adapter, and relevant positional-encoding or multimodal settings.
Serving engines commonly retain cache entries in fixed-size token blocks. A matching request can reference immutable prefix blocks instead of recomputing or copying them. Block sharing reduces redundant prefill computation and, while the blocks remain resident, avoids duplicate physical storage. A partially filled trailing block may not be reusable, depending on the engine's block policy.
KV-cache quantization stores keys and values at fewer bits than BF16 or FP16. It reduces cache capacity and can reduce decode bandwidth when the attention kernel consumes the low-precision cache directly.
For a real-valued cache element x, an affine quantizer stores an integer q together with a scale a and zero point z. The scalar model is:
Here x̂ denotes the reconstructed value. The scale and zero point can be assigned per tensor, per layer, per head, per channel, per token, or per group. Finer granularity generally reduces error, but increases metadata and may complicate kernels.
If keys use bK bits and values use bV bits, the low-precision payload is:
The total cache footprint includes quantization metadata:
When both keys and values use the same bit width b, the payload-only compression factor relative to BF16 is:
| Stored K/V format | Payload bytes per key/value scalar pair | Payload-only reduction versus BF16 |
|---|---|---|
| BF16 or FP16, 16 bits each | 4 bytes | 1× |
| FP8 or INT8, 8 bits each | 2 bytes | 2× |
| INT4, 4 bits each | 1 byte | 4× |
| INT2, 2 bits each | 0.5 byte | 8× |
Quantization does not affect all cache components identically. The distributions and functional roles of keys and values differ, so a practical design may use different bit widths or grouping rules for them. KIVI and KVQuant study low-bit KV-cache quantization with asymmetric or distribution-aware design choices [3], [4].
A smaller cache does not guarantee proportional speedup. The kernel must load quantized payloads, access scales and other metadata, and perform reconstruction or low-precision computation. Performance improves when this work is fused into the attention kernel and the reduction in GPU HBM traffic exceeds the added arithmetic and metadata cost.
| Condition | Expected effect |
|---|---|
| Decode is GPU-HBM-bandwidth-bound and the kernel directly consumes quantized KV blocks | Reduced cache traffic can improve throughput or concurrency. |
| The cache is expanded to full precision in GPU HBM before attention | Most of the intended bandwidth saving can be lost. |
| Quantization groups are very small | Accuracy may improve, but metadata and kernel overhead can increase. |
| Bit width is too low for the model and group scheme | Quantization error can reduce model quality. |
Prefix caching and KV-cache quantization are complementary. Prefix caching reduces the number of prefix states that must be computed. Quantization reduces the bytes required to retain each state. A serving engine can cache and share quantized blocks when their model, format, and quantization parameters are compatible.
A KV cache stores the keys and values of already processed tokens at every decoder layer. It avoids recomputing those keys and values during autoregressive decode, but grows linearly with layer count, active sequences, context length, key/value heads, head dimension, and bytes per scalar:
Prefix caching reuses KV states for an exact compatible prefix. It reduces redundant prefill work, but the suffix still attends to the cached prefix. KV-cache quantization reduces stored bytes per key/value element:
The practical design problem is to balance cache capacity, GPU-HBM bandwidth, prefix-cache hit rate, quantization error, metadata overhead, and the efficiency of the attention kernel.