A Beginner's Guide to Understanding Self-Attention in Autoregressive LLMs

Table of contents
  1. Self-attention
  2. Why autoregressive LLMs use self-attention
  3. Questions addressed in this article
  4. Terminology
  5. Notation and cost model
  6. Runtime and memory comparison
  7. Exact causal self-attention
  8. Prefill and decode
  9. FlashAttention
  10. Linear attention
  11. Hyena-style long-convolution sequence mixing
  12. Mamba-style selective state-space sequence modeling
  13. Hybrid sequence models
  14. Summary
  15. References

1. Self-attention

Self-attention is a sequence-to-sequence operation. It takes a sequence of token representations as input and produces a new sequence of token representations as output. The output representation at each position is computed by reading from other positions in the same sequence.

In an autoregressive decoder, the reading direction is causal. The representation at position \(t\) may read from positions \(1,\dots,t\), but not from positions \(t+1,\dots,T\). Therefore, self-attention is the mechanism by which a token representation can use information from the prefix.

The word attention refers to the weights used for this reading operation. For a fixed target position, the layer assigns one weight to each allowed source position. The output at the target position is a weighted average of the source value vectors. These weights are not fixed by distance. They are computed from the current hidden representations.

Self-attention is content-dependent prefix aggregation: each position computes weights over allowed source positions and uses those weights to aggregate source values.

2. Why autoregressive LLMs use self-attention

Autoregressive LLMs predict the next token from a prefix. The prefix may contain syntax, entities, instructions, definitions, retrieved documents, code, or previous dialogue turns. The model needs a mechanism that can select which prefix positions are relevant to the current computation.

Self-attention provides explicit content-addressed access to the prefix. A target position forms a query. Prefix positions provide keys and values. The query is compared with keys to decide which values should contribute to the target representation.

This design gives strong modeling flexibility, but it is expensive. In prefill, all prompt positions are processed together, so all query positions interact with all allowed key positions. In decode, one new query is processed at a time, but it must read keys and values stored for the entire prefix. Therefore, attention creates two different systems bottlenecks:

Long context is useful for code repositories, long documents, clinical records, legal records, conversation memory, agent traces, and retrieval-augmented generation. These workloads make the cost of exact self-attention a central runtime and memory problem.

3. Questions addressed in this article

Question Mechanism Precise interpretation
What is the baseline sequence operator? Exact causal self-attention Softmax-normalized content-dependent aggregation over the causal prefix.
Which costs are implementation artifacts? FlashAttention The full score and probability matrices need not be written to GPU HBM.
Can softmax attention be replaced by an associative operator? Linear attention Feature-map attention avoids explicit \(T\times T\) pairwise score matrices.
Can long-range mixing be performed without query-key attention? Hyena-style long convolution Structured long convolutions provide subquadratic sequence mixing with different inductive bias.
Can the prefix be represented by a compact recurrent state? Mamba-style selective SSM Selective state-space recurrence targets linear-time sequence modeling and bounded decode state.
Can exact attention be used only where it is most valuable? Hybrid sequence models Exact attention can be combined with local, linear, convolutional, or state-space operators.

4. Terminology

TermDefinition
Token A discrete input unit processed by the model. A text sequence of length \(T\) contains \(T\) tokens.
Position The index of a token in the sequence. Position \(t\) denotes the \(t\)-th token.
Hidden representation A vector representation of a token at some layer of the model.
Query A vector produced from the target position. It represents what the target position asks for.
Key A vector produced from a source position. It represents what the source position offers for matching.
Value A vector produced from a source position. It is the information aggregated after attention weights are computed.
Attention weight A normalized scalar weight assigned by a target position to a source position.
Causal mask A mask that prevents a position from reading future positions.
Prefill The phase in which the model processes the full prompt and builds the initial KV cache.
Decode The phase in which the model generates new tokens one at a time using the KV cache.
KV cache The stored keys and values from previous tokens used during autoregressive decode.
GPU HBM High-bandwidth memory attached to the GPU. It is large relative to SRAM but slower to access.
GPU SRAM On-chip scratchpad/cache memory used by GPU kernels. It is small but has high bandwidth and low latency.
Sequence operator A layer component that mixes information across token positions.

5. Notation and cost model

SymbolDefinition
\(B\)Batch size.
\(T\)Current context length.
\(T_0\)Initial prompt length before decode.
\(T_q\)Query length. In prefill, \(T_q=T\). In decode, \(T_q=1\).
\(G\)Number of generated decode tokens.
\(L\)Number of decoder layers.
\(H\)Residual-stream hidden dimension.
\(h\)Number of query heads.
\(h_{kv}\)Number of key/value heads. For MHA, \(h_{kv}=h\). For GQA/MQA, \(h_{kv}
\(d\)Per-head dimension.
\(H_q\)Total query projection dimension, \(H_q=hd\).
\(H_{kv}\)Total key/value projection dimension, \(H_{kv}=h_{kv}d\).
\(X\)Input hidden-state matrix for one layer.
\(Q,K,V\)Query, key, and value matrices or tensors.
\(S\)Attention score matrix or tensor before softmax.
\(P\)Attention probability matrix or tensor after softmax.
\(O\)Attention output before the output projection.
\(W_Q,W_K,W_V,W_O\)Projection matrices.
\(M_{\mathrm{causal}}\)Causal mask.
\(s\)Bytes per scalar element. For BF16 or FP16, \(s=2\).
\(r\)Feature dimension used by a linear-attention feature map.
\(w\)Window size for local or sliding-window attention.
\(n_s\)State dimension used by a state-space sequence model.
\(c_s\)Per-token cost of a non-attention sequence mixer in high-level estimates.

The cost model reports leading-order sequence-mixing terms. Wall-clock runtime also depends on tensor-core utilization, kernel fusion, memory layout, GPU occupancy, parallelization, and communication overhead.

6. Runtime and memory comparison

Method Preserves exact softmax attention? Prefill sequence-mixing work Extra score/probability memory Decode per-token work Decode state or cache Primary systems bottleneck
Naive causal attention Yes \(O(BhT^2d)\) \(O(BhT^2)\) \(O(BhTd)\) \(O(LBTH_{kv})\) KV cache Quadratic HBM traffic and activation memory in prefill; KV-cache bandwidth in decode.
FlashAttention Yes \(O(BhT^2d)\) Avoids materialized \(S,P\) \(O(BhTd)\) Same KV cache as exact attention Quadratic compute remains; HBM traffic is reduced by SRAM tiling and online softmax.
Linear attention No \(O(BhTrd)\) No \(T\times T\) attention matrix \(O(Bhrd)\) Feature-map recurrent state \(O(Bh(rd+r))\) Feature-map expressivity, numerical stability, and approximation quality.
Sliding-window attention Exact only within the window \(O(BhTwd)\) \(O(BhTw)\) \(O(Bhwd)\) Windowed KV cache \(O(LBwh_{kv}d)\) Restricted receptive field unless combined with global or sparse paths.
Hyena-style long convolution No Often subquadratic, e.g. \(O(BHT\log T)\) depending on implementation No softmax attention matrix Implementation-dependent streaming state Convolutional or recurrent representation Balancing long-range mixing with content adaptivity.
Mamba-style selective SSM No Designed for linear sequence scaling, e.g. \(O(BTHn_s)\) No softmax attention matrix \(O(BHn_s)\) Selective recurrent state \(O(LBHn_s)\) State capacity and implicit retrieval from compressed history.
Hybrid sequence model Partially Depends on allocation across operators Depends on exact-attention fraction Mixed KV cache and recurrent states Architecture-dependent Operator allocation, kernel fusion, normalization, and quality tradeoffs.
The table compares sequence-mixing costs, not the full decoder block. QKV projections, output projections, MLPs, normalization, communication, and sampling are outside this table unless explicitly stated.

7. Exact causal self-attention

7.1 Scalar definition

Consider one sequence and one attention head. The hidden-state matrix is

\[ X=[x_1,\dots,x_T]^\top\in\mathbb{R}^{T\times H}. \]

The query, key, and value matrices are

\[ Q=XW_Q,\qquad K=XW_K,\qquad V=XW_V, \]

where

\[ Q,K,V\in\mathbb{R}^{T\times d}. \]

For position \(t\), the output is

\[ o_t = \sum_{u\leq t}\alpha_{t,u}v_u. \]

The attention weight from target position \(t\) to source position \(u\) is

\[ \alpha_{t,u} = \frac{\exp(q_t^\top k_u/\sqrt d)} {\sum_{j\leq t}\exp(q_t^\top k_j/\sqrt d)}. \]

7.2 Matrix form

The same operation can be written as

\[ O = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt d} + M_{\mathrm{causal}} \right)V. \]

The causal mask is

\[ M_{\mathrm{causal}}(t,u)= \begin{cases} 0, & u\leq t,\\ -\infty, & u>t. \end{cases} \]

7.3 Batch and head dimensions

With batch size \(B\), query heads \(h\), and key/value heads \(h_{kv}\),

\[ Q\in\mathbb{R}^{B\times h\times T_q\times d}, \]

\[ K,V\in\mathbb{R}^{B\times h_{kv}\times T\times d}. \]

For MHA, \(h_{kv}=h\). For GQA or MQA, \(h_{kv}

8. Prefill and decode

8.1 Prefill runtime

In prefill, \(T_q=T\). The score tensor has shape

\[ S\in\mathbb{R}^{B\times h\times T\times T}. \]

Operation Expression Leading work
Score computation \(QK^\top\) \(O(BhT^2d)\)
Causal softmax \(\mathrm{softmax}(S+M_{\mathrm{causal}})\) \(O(BhT^2)\)
Value aggregation \(PV\) \(O(BhT^2d)\)

Thus the prefill attention core costs

\[ O(BhT^2d). \]

When \(H=hd\), this is \(O(BT^2H)\). Dense QKV and output projections scale as \(O(BTH^2)\). Therefore, for moderate \(T\), projections and MLPs may dominate; for sufficiently long \(T\), the quadratic attention core becomes dominant.

8.2 Prefill activation memory

A naive implementation materializes

\[ S,P\in\mathbb{R}^{B\times h\times T\times T}. \]

The two tensors require

\[ 2BhT^2s \]

bytes. In contrast, the QKV tensors require

\[ BT(h+2h_{kv})ds \]

bytes. For MHA, this becomes \(3BTHs\). Therefore, score/probability storage scales quadratically in \(T\), while QKV storage scales linearly in \(T\).

8.3 Decode runtime

In decode, \(T_q=1\). For each generated token,

\[ Q_{\mathrm{new}}\in\mathbb{R}^{B\times h\times 1\times d}, \]

while

\[ K_{\mathrm{cache}},V_{\mathrm{cache}} \in\mathbb{R}^{B\times h_{kv}\times T\times d}. \]

The attention core costs

\[ O(BhTd) \]

per generated token. For \(G\) generated tokens after a prompt of length \(T_0\), the cumulative decode attention work is

\[ O\left( Bhd\sum_{g=1}^{G}(T_0+g) \right) = O\left(Bhd(GT_0+G^2)\right). \]

8.4 KV-cache memory and bandwidth

The KV cache for one layer is

\[ K,V\in\mathbb{R}^{B\times h_{kv}\times T\times d}. \]

The memory per layer is

\[ \mathrm{Mem}_{\mathrm{KV,layer}} = 2BTh_{kv}ds = 2BTH_{kv}s. \]

Across \(L\) layers,

\[ \mathrm{Mem}_{\mathrm{KV,total}} = 2LBTH_{kv}s. \]

At each decode step, the model streams the relevant cached keys and values. The approximate read volume across all layers is

\[ 2LBTh_{kv}ds. \]

This is why decode is often memory-bandwidth-bound even though the arithmetic cost is linear in \(T\) per generated token.

9. FlashAttention

9.1 Exact operator, IO-aware execution

FlashAttention computes exact causal softmax attention:

\[ O = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt d} + M_{\mathrm{causal}} \right)V. \]

It changes the execution schedule, not the sequence operator. A naive implementation writes the \(S\) and \(P\) tensors to GPU HBM. FlashAttention tiles \(Q\), \(K\), and \(V\), moves tiles through GPU SRAM, and maintains the softmax state without materializing \(S\) and \(P\) in HBM.

9.2 Online softmax

For a row \(i\), the attention output is

\[ o_i = \frac{\sum_j \exp(s_{ij})v_j}{\sum_j \exp(s_{ij})}. \]

For numerical stability, define

\[ m_i=\max_j s_{ij}. \]

Then

\[ o_i = \frac{\sum_j \exp(s_{ij}-m_i)v_j}{\sum_j \exp(s_{ij}-m_i)}. \]

FlashAttention processes \(K,V\) blocks sequentially for a block of queries. It updates the row maximum, denominator, and partial numerator in SRAM. The final output is exact, but the quadratic score and probability matrices are not written to HBM.

9.3 Runtime and memory implications

Quantity Naive exact attention FlashAttention
Output Exact causal softmax attention Exact causal softmax attention
Attention-core work \(O(BhT^2d)\) \(O(BhT^2d)\)
Materialized \(S,P\) \(O(BhT^2)\) elements in HBM Avoided
Memory-system effect High HBM traffic Improved SRAM reuse and reduced HBM transactions

FlashAttention reduces IO cost and activation-memory pressure. It does not remove the quadratic number of query-key interactions.

10. Linear attention

10.1 Feature-map formulation

Linear attention replaces the softmax attention kernel with a feature-map similarity:

\[ \mathrm{sim}(q,k)=\phi(q)^\top\phi(k), \]

where

\[ \phi(q),\phi(k)\in\mathbb{R}^{r}. \]

For causal linear attention,

\[ o_t = \frac{ \phi(q_t)^\top \left( \sum_{u\leq t}\phi(k_u)v_u^\top \right) }{ \phi(q_t)^\top \left( \sum_{u\leq t}\phi(k_u) \right) }. \]

Define the recurrent states

\[ A_t= \sum_{u\leq t}\phi(k_u)v_u^\top \in\mathbb{R}^{r\times d}, \]

and

\[ b_t= \sum_{u\leq t}\phi(k_u) \in\mathbb{R}^{r}. \]

Then

\[ o_t = \frac{\phi(q_t)^\top A_t}{\phi(q_t)^\top b_t}. \]

10.2 Runtime and memory

For each token and each head, updating \(A_t\) costs \(O(rd)\), updating \(b_t\) costs \(O(r)\), and producing \(o_t\) costs \(O(rd)\). The leading cost over \(T\) tokens is

\[ O(Trd) \]

per head, or

\[ O(BhTrd) \]

with batch and heads.

The recurrent state per batch element and per head is

\[ A_t\in\mathbb{R}^{r\times d}, \qquad b_t\in\mathbb{R}^{r}. \]

This avoids a \(T\times T\) attention matrix, but it changes the sequence operator. The quality and stability depend on the feature map \(\phi\), feature dimension \(r\), normalization, and training dynamics.

11. Hyena-style long-convolution sequence mixing

Hyena is an attention alternative, not an attention implementation. It uses long convolutions and data-controlled gating to obtain subquadratic sequence mixing.

A causal long convolution can be written as

\[ y_t = \sum_{u\leq t} k_{t-u}x_u. \]

Hyena-style architectures combine such long filters with multiplicative gates:

\[ Y = G(X)\odot \mathrm{LongConv}(X). \]

The convolution supplies long-range mixing. The gate supplies input dependence. Depending on the filter parameterization and implementation, the sequence cost can be subquadratic, for example through FFT-based or structured convolution methods.

12. Mamba-style selective state-space sequence modeling

Mamba is a selective state-space sequence model. It is not an attention mechanism. Its relevance here is that it targets long-context sequence modeling with a recurrent state rather than an explicit KV cache.

A simplified state-space recurrence is

\[ x_{t+1}=A x_t+B u_t, \]

\[ y_t=C x_t+D u_t. \]

Selective state-space models make parts of the recurrence input-dependent. A schematic form is

\[ x_{t+1}=A(u_t)x_t+B(u_t)u_t, \]

\[ y_t=C(u_t)x_t. \]

This expression is schematic. It shows the modeling distinction: the update depends on the current token, allowing the state to selectively propagate or forget information.

13. Hybrid sequence models

Hybrid models combine exact attention with lower-cost sequence mixers. The premise is that full pairwise content-addressing may not be required in every layer, every head, or every token range.

Suppose a model has \(L\) sequence-mixing layers:

\[ L=L_a+L_l+L_s, \]

where \(L_a\) layers use exact attention, \(L_l\) layers use linear or local attention, and \(L_s\) layers use state-space or convolutional mixers.

A representative prefill cost is

\[ O(L_aBhT^2d) + O(L_lBhTrd) + O(L_sBTc_s). \]

This expression makes the allocation problem explicit: exact attention layers preserve pairwise token interaction, while cheaper layers reduce the average sequence-mixing cost.

14. Summary

Self-attention is content-dependent prefix aggregation. Each position computes weights over allowed source positions and uses those weights to aggregate source values. In autoregressive LLMs, the allowed source positions are the current token and the previous tokens.

Exact causal attention has prefill sequence-mixing cost

\[ O(BhT^2d). \]

A naive implementation may also materialize score and probability tensors of size

\[ O(BhT^2). \]

In decode, exact attention has per-token cost

\[ O(BhTd), \]

and a total KV-cache memory footprint

\[ 2LBTH_{kv}s. \]

FlashAttention preserves exact softmax attention and reduces HBM traffic by using SRAM-resident tiles and online softmax. Linear attention replaces softmax attention with an associative feature-map formulation. Hyena and Mamba replace attention with long-convolution or selective state-space sequence mixers. Hybrid models combine these mechanisms to choose how much exact pairwise interaction is used.

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017. Available: NeurIPS proceedings; arXiv: 1706.03762.
  2. A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, PMLR, vol. 119, pp. 5156–5165, 2020. Available: PMLR; arXiv: 2006.16236.
  3. K. Choromanski et al., “Rethinking attention with performers,” in International Conference on Learning Representations, 2021. Available: OpenReview; arXiv: 2009.14794.
  4. T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” in Advances in Neural Information Processing Systems, vol. 35, 2022. Available: NeurIPS proceedings; arXiv: 2205.14135.
  5. T. Dao, “FlashAttention-2: Faster attention with better parallelism and work partitioning,” in International Conference on Learning Representations, 2024. Available: OpenReview; arXiv: 2307.08691.
  6. M. Poli et al., “Hyena hierarchy: Towards larger convolutional language models,” in Proceedings of the 40th International Conference on Machine Learning, PMLR, vol. 202, pp. 28043–28078, 2023. arXiv: 2302.10866.
  7. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” in Conference on Language Modeling, 2024. Available: OpenReview; arXiv: 2312.00752.