Self-attention is a sequence-to-sequence operation. It takes a sequence of token representations as input and produces a new sequence of token representations as output. The output representation at each position is computed by reading from other positions in the same sequence.
In an autoregressive decoder, the reading direction is causal. The representation at position \(t\) may read from positions \(1,\dots,t\), but not from positions \(t+1,\dots,T\). Therefore, self-attention is the mechanism by which a token representation can use information from the prefix.
The word attention refers to the weights used for this reading operation. For a fixed target position, the layer assigns one weight to each allowed source position. The output at the target position is a weighted average of the source value vectors. These weights are not fixed by distance. They are computed from the current hidden representations.
Autoregressive LLMs predict the next token from a prefix. The prefix may contain syntax, entities, instructions, definitions, retrieved documents, code, or previous dialogue turns. The model needs a mechanism that can select which prefix positions are relevant to the current computation.
Self-attention provides explicit content-addressed access to the prefix. A target position forms a query. Prefix positions provide keys and values. The query is compared with keys to decide which values should contribute to the target representation.
This design gives strong modeling flexibility, but it is expensive. In prefill, all prompt positions are processed together, so all query positions interact with all allowed key positions. In decode, one new query is processed at a time, but it must read keys and values stored for the entire prefix. Therefore, attention creates two different systems bottlenecks:
Long context is useful for code repositories, long documents, clinical records, legal records, conversation memory, agent traces, and retrieval-augmented generation. These workloads make the cost of exact self-attention a central runtime and memory problem.
| Question | Mechanism | Precise interpretation |
|---|---|---|
| What is the baseline sequence operator? | Exact causal self-attention | Softmax-normalized content-dependent aggregation over the causal prefix. |
| Which costs are implementation artifacts? | FlashAttention | The full score and probability matrices need not be written to GPU HBM. |
| Can softmax attention be replaced by an associative operator? | Linear attention | Feature-map attention avoids explicit \(T\times T\) pairwise score matrices. |
| Can long-range mixing be performed without query-key attention? | Hyena-style long convolution | Structured long convolutions provide subquadratic sequence mixing with different inductive bias. |
| Can the prefix be represented by a compact recurrent state? | Mamba-style selective SSM | Selective state-space recurrence targets linear-time sequence modeling and bounded decode state. |
| Can exact attention be used only where it is most valuable? | Hybrid sequence models | Exact attention can be combined with local, linear, convolutional, or state-space operators. |
| Term | Definition |
|---|---|
| Token | A discrete input unit processed by the model. A text sequence of length \(T\) contains \(T\) tokens. |
| Position | The index of a token in the sequence. Position \(t\) denotes the \(t\)-th token. |
| Hidden representation | A vector representation of a token at some layer of the model. |
| Query | A vector produced from the target position. It represents what the target position asks for. |
| Key | A vector produced from a source position. It represents what the source position offers for matching. |
| Value | A vector produced from a source position. It is the information aggregated after attention weights are computed. |
| Attention weight | A normalized scalar weight assigned by a target position to a source position. |
| Causal mask | A mask that prevents a position from reading future positions. |
| Prefill | The phase in which the model processes the full prompt and builds the initial KV cache. |
| Decode | The phase in which the model generates new tokens one at a time using the KV cache. |
| KV cache | The stored keys and values from previous tokens used during autoregressive decode. |
| GPU HBM | High-bandwidth memory attached to the GPU. It is large relative to SRAM but slower to access. |
| GPU SRAM | On-chip scratchpad/cache memory used by GPU kernels. It is small but has high bandwidth and low latency. |
| Sequence operator | A layer component that mixes information across token positions. |
| Symbol | Definition |
|---|---|
| \(B\) | Batch size. |
| \(T\) | Current context length. |
| \(T_0\) | Initial prompt length before decode. |
| \(T_q\) | Query length. In prefill, \(T_q=T\). In decode, \(T_q=1\). |
| \(G\) | Number of generated decode tokens. |
| \(L\) | Number of decoder layers. |
| \(H\) | Residual-stream hidden dimension. |
| \(h\) | Number of query heads. |
| \(h_{kv}\) | Number of key/value heads. For MHA, \(h_{kv}=h\). For GQA/MQA, \(h_{kv} |
| \(d\) | Per-head dimension. |
| \(H_q\) | Total query projection dimension, \(H_q=hd\). |
| \(H_{kv}\) | Total key/value projection dimension, \(H_{kv}=h_{kv}d\). |
| \(X\) | Input hidden-state matrix for one layer. |
| \(Q,K,V\) | Query, key, and value matrices or tensors. |
| \(S\) | Attention score matrix or tensor before softmax. |
| \(P\) | Attention probability matrix or tensor after softmax. |
| \(O\) | Attention output before the output projection. |
| \(W_Q,W_K,W_V,W_O\) | Projection matrices. |
| \(M_{\mathrm{causal}}\) | Causal mask. |
| \(s\) | Bytes per scalar element. For BF16 or FP16, \(s=2\). |
| \(r\) | Feature dimension used by a linear-attention feature map. |
| \(w\) | Window size for local or sliding-window attention. |
| \(n_s\) | State dimension used by a state-space sequence model. |
| \(c_s\) | Per-token cost of a non-attention sequence mixer in high-level estimates. |
The cost model reports leading-order sequence-mixing terms. Wall-clock runtime also depends on tensor-core utilization, kernel fusion, memory layout, GPU occupancy, parallelization, and communication overhead.
| Method | Preserves exact softmax attention? | Prefill sequence-mixing work | Extra score/probability memory | Decode per-token work | Decode state or cache | Primary systems bottleneck |
|---|---|---|---|---|---|---|
| Naive causal attention | Yes | \(O(BhT^2d)\) | \(O(BhT^2)\) | \(O(BhTd)\) | \(O(LBTH_{kv})\) KV cache | Quadratic HBM traffic and activation memory in prefill; KV-cache bandwidth in decode. |
| FlashAttention | Yes | \(O(BhT^2d)\) | Avoids materialized \(S,P\) | \(O(BhTd)\) | Same KV cache as exact attention | Quadratic compute remains; HBM traffic is reduced by SRAM tiling and online softmax. |
| Linear attention | No | \(O(BhTrd)\) | No \(T\times T\) attention matrix | \(O(Bhrd)\) | Feature-map recurrent state \(O(Bh(rd+r))\) | Feature-map expressivity, numerical stability, and approximation quality. |
| Sliding-window attention | Exact only within the window | \(O(BhTwd)\) | \(O(BhTw)\) | \(O(Bhwd)\) | Windowed KV cache \(O(LBwh_{kv}d)\) | Restricted receptive field unless combined with global or sparse paths. |
| Hyena-style long convolution | No | Often subquadratic, e.g. \(O(BHT\log T)\) depending on implementation | No softmax attention matrix | Implementation-dependent streaming state | Convolutional or recurrent representation | Balancing long-range mixing with content adaptivity. |
| Mamba-style selective SSM | No | Designed for linear sequence scaling, e.g. \(O(BTHn_s)\) | No softmax attention matrix | \(O(BHn_s)\) | Selective recurrent state \(O(LBHn_s)\) | State capacity and implicit retrieval from compressed history. |
| Hybrid sequence model | Partially | Depends on allocation across operators | Depends on exact-attention fraction | Mixed KV cache and recurrent states | Architecture-dependent | Operator allocation, kernel fusion, normalization, and quality tradeoffs. |
Consider one sequence and one attention head. The hidden-state matrix is
\[ X=[x_1,\dots,x_T]^\top\in\mathbb{R}^{T\times H}. \]
The query, key, and value matrices are
\[ Q=XW_Q,\qquad K=XW_K,\qquad V=XW_V, \]
where
\[ Q,K,V\in\mathbb{R}^{T\times d}. \]
For position \(t\), the output is
\[ o_t = \sum_{u\leq t}\alpha_{t,u}v_u. \]
The attention weight from target position \(t\) to source position \(u\) is
\[ \alpha_{t,u} = \frac{\exp(q_t^\top k_u/\sqrt d)} {\sum_{j\leq t}\exp(q_t^\top k_j/\sqrt d)}. \]
The same operation can be written as
\[ O = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt d} + M_{\mathrm{causal}} \right)V. \]
The causal mask is
\[ M_{\mathrm{causal}}(t,u)= \begin{cases} 0, & u\leq t,\\ -\infty, & u>t. \end{cases} \]
With batch size \(B\), query heads \(h\), and key/value heads \(h_{kv}\),
\[ Q\in\mathbb{R}^{B\times h\times T_q\times d}, \]
\[ K,V\in\mathbb{R}^{B\times h_{kv}\times T\times d}. \]
For MHA, \(h_{kv}=h\). For GQA or MQA, \(h_{kv}
In prefill, \(T_q=T\). The score tensor has shape
\[
S\in\mathbb{R}^{B\times h\times T\times T}.
\]
Thus the prefill attention core costs
\[
O(BhT^2d).
\]
When \(H=hd\), this is \(O(BT^2H)\). Dense QKV and output projections scale as \(O(BTH^2)\). Therefore, for
moderate \(T\), projections and MLPs may dominate; for sufficiently long \(T\), the quadratic attention core becomes
dominant.
A naive implementation materializes
\[
S,P\in\mathbb{R}^{B\times h\times T\times T}.
\]
The two tensors require
\[
2BhT^2s
\]
bytes. In contrast, the QKV tensors require
\[
BT(h+2h_{kv})ds
\]
bytes. For MHA, this becomes \(3BTHs\). Therefore, score/probability storage scales quadratically in \(T\), while
QKV storage scales linearly in \(T\).
In decode, \(T_q=1\). For each generated token,
\[
Q_{\mathrm{new}}\in\mathbb{R}^{B\times h\times 1\times d},
\]
while
\[
K_{\mathrm{cache}},V_{\mathrm{cache}}
\in\mathbb{R}^{B\times h_{kv}\times T\times d}.
\]
The attention core costs
\[
O(BhTd)
\]
per generated token. For \(G\) generated tokens after a prompt of length \(T_0\), the cumulative decode attention
work is
\[
O\left(
Bhd\sum_{g=1}^{G}(T_0+g)
\right)
=
O\left(Bhd(GT_0+G^2)\right).
\]
The KV cache for one layer is
\[
K,V\in\mathbb{R}^{B\times h_{kv}\times T\times d}.
\]
The memory per layer is
\[
\mathrm{Mem}_{\mathrm{KV,layer}}
=
2BTh_{kv}ds
=
2BTH_{kv}s.
\]
Across \(L\) layers,
\[
\mathrm{Mem}_{\mathrm{KV,total}}
=
2LBTH_{kv}s.
\]
At each decode step, the model streams the relevant cached keys and values. The approximate read volume across all
layers is
\[
2LBTh_{kv}ds.
\]
This is why decode is often memory-bandwidth-bound even though the arithmetic cost is linear in \(T\) per generated
token.
FlashAttention computes exact causal softmax attention:
\[
O
=
\mathrm{softmax}
\left(
\frac{QK^\top}{\sqrt d}
+
M_{\mathrm{causal}}
\right)V.
\]
It changes the execution schedule, not the sequence operator. A naive implementation writes the \(S\) and \(P\)
tensors to GPU HBM. FlashAttention tiles \(Q\), \(K\), and \(V\), moves tiles through GPU SRAM, and maintains the
softmax state without materializing \(S\) and \(P\) in HBM.
For a row \(i\), the attention output is
\[
o_i
=
\frac{\sum_j \exp(s_{ij})v_j}{\sum_j \exp(s_{ij})}.
\]
For numerical stability, define
\[
m_i=\max_j s_{ij}.
\]
Then
\[
o_i
=
\frac{\sum_j \exp(s_{ij}-m_i)v_j}{\sum_j \exp(s_{ij}-m_i)}.
\]
FlashAttention processes \(K,V\) blocks sequentially for a block of queries. It updates the row maximum,
denominator, and partial numerator in SRAM. The final output is exact, but the quadratic score and probability
matrices are not written to HBM.
FlashAttention reduces IO cost and activation-memory pressure. It does not remove the quadratic number of query-key
interactions.
Linear attention replaces the softmax attention kernel with a feature-map similarity:
\[
\mathrm{sim}(q,k)=\phi(q)^\top\phi(k),
\]
where
\[
\phi(q),\phi(k)\in\mathbb{R}^{r}.
\]
For causal linear attention,
\[
o_t
=
\frac{
\phi(q_t)^\top
\left(
\sum_{u\leq t}\phi(k_u)v_u^\top
\right)
}{
\phi(q_t)^\top
\left(
\sum_{u\leq t}\phi(k_u)
\right)
}.
\]
Define the recurrent states
\[
A_t=
\sum_{u\leq t}\phi(k_u)v_u^\top
\in\mathbb{R}^{r\times d},
\]
and
\[
b_t=
\sum_{u\leq t}\phi(k_u)
\in\mathbb{R}^{r}.
\]
Then
\[
o_t
=
\frac{\phi(q_t)^\top A_t}{\phi(q_t)^\top b_t}.
\]
For each token and each head, updating \(A_t\) costs \(O(rd)\), updating \(b_t\) costs \(O(r)\), and producing
\(o_t\) costs \(O(rd)\). The leading cost over \(T\) tokens is
\[
O(Trd)
\]
per head, or
\[
O(BhTrd)
\]
with batch and heads.
The recurrent state per batch element and per head is
\[
A_t\in\mathbb{R}^{r\times d},
\qquad
b_t\in\mathbb{R}^{r}.
\]
This avoids a \(T\times T\) attention matrix, but it changes the sequence operator. The quality and stability depend
on the feature map \(\phi\), feature dimension \(r\), normalization, and training dynamics.
Hyena is an attention alternative, not an attention implementation. It uses long convolutions and data-controlled
gating to obtain subquadratic sequence mixing.
A causal long convolution can be written as
\[
y_t
=
\sum_{u\leq t} k_{t-u}x_u.
\]
Hyena-style architectures combine such long filters with multiplicative gates:
\[
Y
=
G(X)\odot \mathrm{LongConv}(X).
\]
The convolution supplies long-range mixing. The gate supplies input dependence. Depending on the filter
parameterization and implementation, the sequence cost can be subquadratic, for example through FFT-based or
structured convolution methods.
Mamba is a selective state-space sequence model. It is not an attention mechanism. Its relevance here is that it
targets long-context sequence modeling with a recurrent state rather than an explicit KV cache.
A simplified state-space recurrence is
\[
x_{t+1}=A x_t+B u_t,
\]
\[
y_t=C x_t+D u_t.
\]
Selective state-space models make parts of the recurrence input-dependent. A schematic form is
\[
x_{t+1}=A(u_t)x_t+B(u_t)u_t,
\]
\[
y_t=C(u_t)x_t.
\]
This expression is schematic. It shows the modeling distinction: the update depends on the current token, allowing
the state to selectively propagate or forget information.
Hybrid models combine exact attention with lower-cost sequence mixers. The premise is that full pairwise
content-addressing may not be required in every layer, every head, or every token range.
Suppose a model has \(L\) sequence-mixing layers:
\[
L=L_a+L_l+L_s,
\]
where \(L_a\) layers use exact attention, \(L_l\) layers use linear or local attention, and \(L_s\) layers use
state-space or convolutional mixers.
A representative prefill cost is
\[
O(L_aBhT^2d)
+
O(L_lBhTrd)
+
O(L_sBTc_s).
\]
This expression makes the allocation problem explicit: exact attention layers preserve pairwise token interaction,
while cheaper layers reduce the average sequence-mixing cost.
Self-attention is content-dependent prefix aggregation. Each position computes weights over allowed source positions
and uses those weights to aggregate source values. In autoregressive LLMs, the allowed source positions are the
current token and the previous tokens.
Exact causal attention has prefill sequence-mixing cost
\[
O(BhT^2d).
\]
A naive implementation may also materialize score and probability tensors of size
\[
O(BhT^2).
\]
In decode, exact attention has per-token cost
\[
O(BhTd),
\]
and a total KV-cache memory footprint
\[
2LBTH_{kv}s.
\]
FlashAttention preserves exact softmax attention and reduces HBM traffic by using SRAM-resident tiles and online
softmax. Linear attention replaces softmax attention with an associative feature-map formulation. Hyena and Mamba
replace attention with long-convolution or selective state-space sequence mixers. Hybrid models combine these
mechanisms to choose how much exact pairwise interaction is used.
8. Prefill and decode
8.1 Prefill runtime
Operation
Expression
Leading work
Score computation
\(QK^\top\)
\(O(BhT^2d)\)
Causal softmax
\(\mathrm{softmax}(S+M_{\mathrm{causal}})\)
\(O(BhT^2)\)
Value aggregation
\(PV\)
\(O(BhT^2d)\)
8.2 Prefill activation memory
8.3 Decode runtime
8.4 KV-cache memory and bandwidth
9. FlashAttention
9.1 Exact operator, IO-aware execution
9.2 Online softmax
9.3 Runtime and memory implications
Quantity
Naive exact attention
FlashAttention
Output
Exact causal softmax attention
Exact causal softmax attention
Attention-core work
\(O(BhT^2d)\)
\(O(BhT^2d)\)
Materialized \(S,P\)
\(O(BhT^2)\) elements in HBM
Avoided
Memory-system effect
High HBM traffic
Improved SRAM reuse and reduced HBM transactions
10. Linear attention
10.1 Feature-map formulation
10.2 Runtime and memory
11. Hyena-style long-convolution sequence mixing
12. Mamba-style selective state-space sequence modeling
13. Hybrid sequence models
14. Summary
References