A Beginner's Guide to Tensor Parallelism + Expert Parallelism in MoE LLM Inference

1. Motivation

Autoregressive large language models are dominated by dense linear algebra in every decoder block. In a dense transformer, every token passes through every parameter of the attention projections and every parameter of the MLP sublayer. Scaling the dense model therefore increases both model capacity and per-token compute.

This is powerful, but expensive. If a dense model has \(N_{\mathrm{params}}\) parameters, then, ignoring embeddings and normalization, the amount of work per generated token grows roughly with \(N_{\mathrm{params}}\). A larger dense model can improve quality, but every token must pay for the whole model.

1.1 Why Dense Models Are Not Enough

Dense models are simple and hardware-friendly, but they have a basic scaling bottleneck:

\[ \text{capacity per token} = \text{parameters evaluated per token}. \]

If we make the MLPs larger, the model becomes more expressive, but every token becomes more expensive. If we add more layers, the model becomes deeper, but every token must traverse those layers. For inference, this hurts latency, throughput, memory footprint, and serving cost.

Mixture-of-Experts, or MoE, changes this tradeoff. Instead of using one dense MLP for every token, an MoE layer contains \(E\) expert MLPs and a router. For each token, the router selects only \(k\) experts, where typically

\[ k \ll E. \]

The model can therefore increase total parameter capacity by increasing \(E\), while keeping per-token expert compute controlled by \(k\).

For fixed \(k\), increasing the number of experts \(E\) increases total model capacity, but does not increase the number of experts evaluated by each token. This is the main reason MoE is attractive for very large LLMs.

1.2 Why Tensor Parallelism and Expert Parallelism Are Used Together

Tensor parallelism, or TP, partitions large dense matrix multiplications across GPUs. It is useful for attention projections, output projections, and dense MLP projections.

Expert parallelism, or EP, partitions experts across GPUs. It is useful when the MoE layer has many experts and no single GPU should store all of them.

In this article, we study a single group of \(p\) GPUs. Dense attention is tensor-parallel across these \(p\) GPUs. The MoE experts are expert-parallel across the same \(p\) GPUs:

\[ E \bmod p = 0, \qquad E_p = \frac{E}{p}. \]

Each GPU stores exactly \(E_p\) experts.

This article uses TP for the dense attention part and EP for the MoE expert part over the same \(p\) GPUs. The experts themselves are not additionally tensor-parallel. If each expert is also tensor-parallel, then one should introduce two separate degrees, for example \(t\) for TP and \(e\) for EP, with total GPUs \(p=te\).

2. Notation

SymbolDefinition
\(B\)Global batch size: number of independent autoregressive sequences.
\(T\)Context length or sequence length.
\(T_q\)Query length. In decode, \(T_q=1\).
\(N_t\)Number of tokens entering an MoE layer. In prefill, \(N_t=BT\). In decode, \(N_t=B\).
\(H\)Residual-stream hidden dimension.
\(I\)Intermediate dimension of one dense MLP or one expert MLP.
\(L\)Number of decoder blocks.
\(\ell\)Decoder-block index.
\(p\)Number of GPUs in the TP+EP group.
\(r\)GPU/rank index, \(r\in\{1,\dots,p\}\).
\(E\)Total number of experts in one MoE layer.
\(E_p\)Experts stored per GPU, \(E_p=E/p\).
\(e\)Expert index, \(e\in\{1,\dots,E\}\).
\(k\)Top-\(k\) routing degree: number of experts selected per token.
\(\gamma\)Capacity or padding factor for expert buffers. If there is no padding, \(\gamma=1\).
\(G\)Router logits.
\(R(x)\)Set of selected experts for token \(x\).
\(\alpha_{i,e}\)Router combine weight for token \(i\) and expert \(e\).
\(f_e\)Expert MLP function for expert \(e\).
\(X_\ell\)Residual-stream activation entering decoder block \(\ell\).
\(\widetilde{X}_\ell\)Residual stream after self-attention.
\(Y\)Self-attention output after the output projection.
\(Z\)MoE output.
\(Q,K,V\)Query, key, and value tensors.
\(S\)Attention score tensor before softmax.
\(P\)Attention probability tensor after softmax.
\(O\)Self-attention output before the output projection.
\(h\)Number of query heads.
\(h_{kv}\)Number of key/value heads.
\(d\)Per-head dimension.
\(H_q\)Total query projection dimension, \(H_q=hd\).
\(H_{kv}\)Total key/value projection dimension, \(H_{kv}=h_{kv}d\).
\(s\)Bytes per scalar element. For BF16 or FP16, \(s=2\).
\(M_H\)Number of scalar elements in one full hidden activation tensor, \(M_H=BTH=N_tH\).
\(M_{KV}\)Number of scalar elements in one full key or value tensor, \(M_{KV}=BTH_{kv}\).

3. Communication Model

3.1 Assumptions

The communication formulas in this article count bytes communicated per rank. They ignore latency, kernel launch overhead, network contention, topology effects, routing metadata, token indices, and router combine weights.

We assume bandwidth-optimal collective implementations:

For small messages, implementations may use recursive-doubling, recursive-halving, tree-based, or topology-aware algorithms to reduce latency. Those choices change the latency term and step count, but the volume formulas below are the bandwidth model used for the rest of the article.

The model is a payload-volume model, not an end-to-end performance model. In practice, AllToAll can be slower than its byte count suggests because it creates many peer-to-peer exchanges and is sensitive to load imbalance.

3.2 AllReduce, ReduceScatter, and AllGather

Let \(N\) be the number of scalar elements in the full tensor involved in the collective, and let \(s\) be the number of bytes per scalar.

For AllGather:

\[ C_{AG}(N,p)=\frac{p-1}{p}Ns. \]

For ReduceScatter:

\[ C_{RS}(N,p)=\frac{p-1}{p}Ns. \]

For ring AllReduce:

\[ C_{AR}(N,p)=2\frac{p-1}{p}Ns. \]

The factor of \(2\) appears because ring AllReduce is composed of a ReduceScatter phase and an AllGather phase.

3.3 AllToAll

Expert parallelism uses AllToAll. The most readable way to write AllToAll communication is in terms of the local payload per rank.

Let \(N_{\mathrm{local}}\) be the number of scalar elements initially owned by one rank before the AllToAll. Under balanced traffic, the fraction \((p-1)/p\) goes to remote ranks. Therefore,

\[ C_{A2A}(N_{\mathrm{local}},p) = \frac{p-1}{p}N_{\mathrm{local}}s. \]

If \(N_{\mathrm{global}}\) is the total payload across all ranks, then \(N_{\mathrm{local}}=N_{\mathrm{global}}/p\), so

\[ C_{A2A} = \frac{p-1}{p}\left(\frac{N_{\mathrm{global}}}{p}\right)s. \]

In the rest of the article, we use the local-payload form because it is easier to compare with what each GPU actually sends.

4. Autoregressive Inference Phases

4.1 Prefill

In prefill, the full prompt is processed:

\[ X_\ell\in\mathbb{R}^{B\times T\times H}. \]

The number of MoE input tokens is

\[ N_t=BT. \]

The hidden activation size is

\[ M_H=BTH. \]

4.2 Decode

In decode, the model generates one new token per sequence. The query length is

\[ T_q=1. \]

The current-step residual activation is

\[ X_{\ell,\mathrm{decode}}\in\mathbb{R}^{B\times 1\times H}. \]

The number of MoE input tokens is

\[ N_t=B. \]

The current-step hidden activation size is

\[ M_H=BH. \]

The KV cache for one decoder block is

\[ K,V\in\mathbb{R}^{B\times T\times H_{kv}}, \]

with memory

\[ \mathrm{Mem}_{\mathrm{KV,block}}=2BTH_{kv}s. \]

Prefill is sensitive to long-sequence activation and attention compute. Decode has small current-token activations, but KV-cache memory grows with \(B\), \(T\), \(H_{kv}\), and \(L\).

5. Decoder Block with an MoE Sublayer

A pre-norm decoder block with MoE updates the residual stream as

\[ \widetilde{X}_\ell = X_\ell+ \mathrm{SelfAttention}_\ell(\mathrm{RMSNorm}(X_\ell)), \]

\[ X_{\ell+1} = \widetilde{X}_\ell+ \mathrm{MoE}_\ell(\mathrm{RMSNorm}(\widetilde{X}_\ell)). \]

5.1 Self-Attention Sublayer

For normalized input \(X\),

\[ Q=XW_Q,\qquad K=XW_K,\qquad V=XW_V. \]

The masked attention scores are

\[ S=\frac{QK^\top}{\sqrt d}+M_{\mathrm{causal}}. \]

The attention probabilities and attention output are

\[ P=\mathrm{softmax}(S), \qquad O=PV. \]

The self-attention output projection is

\[ Y=OW_O. \]

5.2 MoE Sublayer

Flatten the token dimensions:

\[ X_{\mathrm{tok}}\in\mathbb{R}^{N_t\times H}. \]

The router computes

\[ G=X_{\mathrm{tok}}W_{\mathrm{router}}, \qquad W_{\mathrm{router}}\in\mathbb{R}^{H\times E}. \]

For token \(x_i\), the router selects

\[ R(x_i)=\operatorname{TopK}(G_i,k). \]

The MoE output for token \(i\) is

\[ z_i = \sum_{e\in R(x_i)} \alpha_{i,e} f_e(x_i), \qquad \sum_{e\in R(x_i)}\alpha_{i,e}=1. \]

Each expert is a gated MLP:

\[ f_e(x) = W_{d,e} \left( \phi(xW_{g,e})\odot xW_{u,e} \right), \]

where

\[ W_{g,e},W_{u,e}\in\mathbb{R}^{H\times I}, \qquad W_{d,e}\in\mathbb{R}^{I\times H}. \]

6. Tensor-Parallel Attention on \(p\) GPUs

The dense self-attention part uses classic tensor parallelism. The residual activation is replicated across the \(p\) tensor-parallel ranks:

\[ X_r=X. \]

The QKV projections are column-parallel:

\[ W_Q=[W_{Q,1},\dots,W_{Q,p}], \quad W_K=[W_{K,1},\dots,W_{K,p}], \quad W_V=[W_{V,1},\dots,W_{V,p}]. \]

Rank \(r\) computes

\[ Q_r=XW_{Q,r}, \qquad K_r=XW_{K,r}, \qquad V_r=XW_{V,r}. \]

Each rank owns a subset of attention heads, so the attention core is local:

\[ O_r = \mathrm{softmax} \left( \frac{Q_rK_r^\top}{\sqrt d} + M_{\mathrm{causal}} \right)V_r. \]

The output projection is row-parallel:

\[ W_O= \begin{bmatrix} W_{O,1}\\ \vdots\\ W_{O,p} \end{bmatrix}. \]

Rank \(r\) computes a partial projected output:

\[ Y_r=O_rW_{O,r}. \]

The full self-attention output is

\[ Y=\sum_{r=1}^{p}Y_r = \mathrm{AllReduce}_r(Y_r). \]

Thus the attention-output communication per rank is

\[ C_{\mathrm{attn}} = C_{AR}(M_H,p) = 2\frac{p-1}{p}M_Hs. \]

7. Expert-Parallel MoE on \(p\) GPUs

7.1 Expert Placement

Assume

\[ E \bmod p=0. \]

Each rank stores

\[ E_p=\frac{E}{p} \]

experts. One simple contiguous placement is

\[ \mathcal{E}_r = \{(r-1)E_p+1,\dots,rE_p\}. \]

The rank that owns expert \(e\) is

\[ \operatorname{owner}(e) = 1+ \left\lfloor \frac{e-1}{E_p} \right\rfloor. \]

7.2 Token Ownership Before Dispatch

After tensor-parallel attention, the activation is replicated. Before expert execution, each token should have a unique source rank; otherwise the expert work would be duplicated on every TP rank.

We assign a deterministic token shard to rank \(r\):

\[ X_{\mathrm{tok}}^{(r)} \in \mathbb{R}^{(N_t/p)\times H}. \]

This slicing is local because each rank already has the replicated activation.

7.3 Dispatch

For every local token \(x_i\), the router selects \(k\) experts. Each selected expert creates one token-expert assignment. Globally, the number of token-expert assignments is

\[ kN_t. \]

The local dispatch payload on each rank is

\[ N_{\mathrm{dispatch,local}} = \gamma k\frac{N_t}{p}H = \gamma k\frac{M_H}{p}. \]

Here \(\gamma\geq 1\) models capacity padding or load-balancing slack. With no padding, \(\gamma=1\).

Dispatch is an AllToAll:

\[ X_{\mathrm{expert}} = \mathrm{AllToAll}(X_{\mathrm{tok}}). \]

The per-rank dispatch communication is

\[ C_{\mathrm{dispatch}} = \frac{p-1}{p} \left( \gamma k\frac{M_H}{p} \right)s. \]

7.4 Local Expert Computation

Under balanced routing, each expert receives approximately

\[ \frac{kN_t}{E} \]

tokens, and each GPU receives approximately

\[ E_p\cdot \frac{kN_t}{E} = \frac{kN_t}{p} \]

token-expert assignments.

Rank \(r\) applies only its local experts:

\[ u_{i,e}=f_e(x_i), \qquad e\in\mathcal{E}_r. \]

The expert compute grows with \(k\), not directly with \(E\). For fixed \(k\), increasing \(E\) increases total expert capacity without increasing the number of expert MLPs evaluated per token.

7.5 Combine

After local expert execution, expert outputs are sent back to the original token owners:

\[ U_{\mathrm{return}} = \mathrm{AllToAll}(U_{\mathrm{expert}}). \]

The local combine payload has the same size as the local dispatch payload:

\[ N_{\mathrm{combine,local}} = \gamma k\frac{M_H}{p}. \]

Thus

\[ C_{\mathrm{combine}} = \frac{p-1}{p} \left( \gamma k\frac{M_H}{p} \right)s. \]

Each source rank combines the returned expert outputs:

\[ z_i = \sum_{e\in R(x_i)} \alpha_{i,e}u_{i,e}. \]

At this point, the MoE output is token-sharded:

\[ Z^{(r)} \in \mathbb{R}^{(N_t/p)\times H}. \]

7.6 Restoring Replicated TP State

Classic TP expects the next block input to be replicated across all TP ranks. Therefore, after the MoE combine, we AllGather the token-sharded MoE output:

\[ Z=\mathrm{AllGather}(Z^{(r)}). \]

The communication cost is

\[ C_{\mathrm{restore}} = C_{AG}(M_H,p) = \frac{p-1}{p}M_Hs. \]

Then each rank can form the replicated residual output:

\[ X_{\ell+1} = \widetilde{X}_\ell+Z. \]

If the implementation keeps token-sharded activations across blocks, this restore AllGather can be moved or avoided. For the classic TP convention used here, it is included explicitly.

8. Communication Analysis

8.1 Per-Layer TP+EP Communication

For one decoder block with tensor-parallel attention and expert-parallel MoE:

Component Collective Per-rank communication volume
QKV projections None \(0\)
Attention core None, assuming heads are partitioned cleanly \(0\)
Attention output projection Ring AllReduce \(2\frac{p-1}{p}M_Hs\)
MoE dispatch Balanced AllToAll \(\frac{p-1}{p}\left(\gamma k\frac{M_H}{p}\right)s\)
Local experts None \(0\)
MoE combine Balanced AllToAll \(\frac{p-1}{p}\left(\gamma k\frac{M_H}{p}\right)s\)
Restore replicated TP state Ring AllGather \(\frac{p-1}{p}M_Hs\)

Therefore, the total per-rank communication volume is

\[ C_{\mathrm{TP+EP}} = 2\frac{p-1}{p}M_Hs + 2\frac{p-1}{p} \left( \gamma k\frac{M_H}{p} \right)s + \frac{p-1}{p}M_Hs. \]

Equivalently,

\[ C_{\mathrm{TP+EP}} = 3\frac{p-1}{p}M_Hs + 2\frac{p-1}{p} \left( \gamma k\frac{M_H}{p} \right)s. \]

8.2 Comparison with Dense TP

A dense transformer block with classic TP has two row-parallel output reductions: one after attention and one after the dense MLP down projection.

Model Attention communication MLP / MoE communication Total per-rank communication
Dense TP block \(2\frac{p-1}{p}M_Hs\) \(2\frac{p-1}{p}M_Hs\) \(4\frac{p-1}{p}M_Hs\)
TP + EP MoE block \(2\frac{p-1}{p}M_Hs\) \(2\frac{p-1}{p}\left(\gamma k\frac{M_H}{p}\right)s+\frac{p-1}{p}M_Hs\) \(3\frac{p-1}{p}M_Hs+2\frac{p-1}{p}\left(\gamma k\frac{M_H}{p}\right)s\)

This comparison is only about communication volume. It does not imply that MoE is always faster. AllToAll can be more latency-sensitive and topology-sensitive than AllReduce, and expert imbalance can create stragglers.

8.3 Prefill vs Decode

In prefill,

\[ N_t=BT, \qquad M_H=BTH. \]

In decode,

\[ N_t=B, \qquad M_H=BH. \]

Thus the MoE dispatch and combine payloads are much smaller during decode than during long-context prefill. However, decode latency may still be sensitive to AllToAll startup costs because the message sizes are small.

9. Memory Analysis

9.1 Expert Parameters

One gated expert has parameter count

\[ P_{\mathrm{expert}} = 3HI, \]

ignoring bias terms.

The full MoE layer has

\[ P_{\mathrm{MoE}} = 3EHI \]

expert parameters. Since experts are partitioned across \(p\) GPUs, the expert parameters per GPU are

\[ P_{\mathrm{MoE,GPU}} = \frac{3EHI}{p}. \]

The number of activated expert parameters per token is

\[ P_{\mathrm{active/token}} = 3kHI. \]

The key MoE scaling property is \[ P_{\mathrm{MoE}}\propto E, \qquad P_{\mathrm{active/token}}\propto k. \] For fixed \(k\), capacity grows with \(E\), but active expert compute does not.

9.2 Activation and Buffer Memory

The replicated residual activation costs

\[ M_Hs=BTHs \]

bytes per GPU in classic TP.

During expert execution, each GPU holds approximately

\[ \frac{\gamma kN_t}{p} \]

token-expert assignments. The expert input buffer therefore costs approximately

\[ \mathrm{Mem}_{\mathrm{expert\ buffer}} = \frac{\gamma kN_tH}{p}s = \frac{\gamma kM_H}{p}s. \]

The output buffer has the same order of memory.

9.3 KV Cache

With TP-sharded attention heads, the per-block KV cache per GPU is approximately

\[ \mathrm{Mem}_{\mathrm{KV,block,GPU}} = \frac{2BTH_{kv}s}{p}, \]

assuming the key/value heads are partitioned evenly across ranks. Across \(L\) decoder blocks:

\[ \mathrm{Mem}_{\mathrm{KV,total,GPU}} = \frac{2LBTH_{kv}s}{p}. \]

10. Practical Notes

Issue Why it matters
Load balance If many tokens route to the same expert, one GPU can become a straggler. Auxiliary losses, expert capacity, token dropping, or expert-choice routing are often used to control this.
AllToAll topology AllToAll stresses the interconnect differently from AllReduce. Good placement across network domains matters.
Small-message decode Decode has small \(N_t\), so latency and launch overhead can dominate even when byte volume is low.
Padding factor \(\gamma\) Capacity padding makes expert kernels regular, but increases communication and buffer memory.
Nested TP inside experts If experts are too large for one GPU, introduce separate TP and EP degrees. The formulas in this article intentionally avoid that case.

11. Summary

Tensor parallelism and expert parallelism solve different scaling problems. Tensor parallelism partitions large dense matrix multiplications. Expert parallelism partitions a large set of experts across GPUs.

For \(p\) GPUs and \(E\) experts with \(E\bmod p=0\), each GPU stores \(E/p\) experts. The router sends each token to \(k\) selected experts. Dispatch and combine are AllToAll operations whose local payload scales with \(\gamma kM_H/p\), while the total expert capacity scales with \(E\).

The central MoE tradeoff is "more parameters without evaluating all parameters per token."

The cost is extra routing complexity, AllToAll communication, load-balancing constraints, and more complicated serving systems. This is why MoE models can scale far beyond dense models in parameter count, but require careful distributed-systems design to run efficiently.

References

  1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008, 2017. Available: NeurIPS proceedings.