The worked design has 1,055,231,744 parameters, a 65,536-token vocabulary, an 8,192-token context, and 27.54B input-token pretraining exposures.
Many of us use language models every day. We ask for an explanation, complain when the answer is too long, ask again, and somehow expect the machine not to take it personally. Running inference is now ordinary. Training the model that makes inference possible is still mostly hidden behind model names, API calls, and a progress bar somebody else watched.
This article opens that box. We will ask how text becomes tokens, how tokens become vectors, how attention lets one position use another, why prediction error becomes a gradient, why several GPU workers must agree on an update, why a pretrained model still needs fine-tuning, and why generation has different bottlenecks from training. MyLLM gives us a concrete model near one billion parameters, large enough to expose real constraints and small enough to calculate.
The article develops a connected account of language-model training from first principles. By the end, we should be able to reason about the following chain:
\[ \text{documents} \longrightarrow \text{tokens} \longrightarrow \text{training sequences} \longrightarrow \text{next-token loss} \longrightarrow \text{parameter updates} \longrightarrow \text{aligned responses} \longrightarrow \text{inference}. \]
Each arrow changes the constraints on the next stage. Vocabulary size affects the embedding matrix and the random initialization loss. Context length affects both attention computation and KV-cache memory. Global batch size affects gradient noise and learning-rate selection. Quantization changes deployment memory but does not automatically reduce the cache. The point of an end-to-end treatment is to make these connections explicit.
| Chapter | Primary question | What we learn to derive |
|---|---|---|
| Data & Parameters | What should the model learn from, and what architecture fits the objective? | Corpus statistics, tokenizer design, context length, parameter budget, Transformer dimensions, and scaling-law token targets. |
| Training | What numerical operation turns a sequence into a learning signal? | The autoregressive factorization, forward pass, cross-entropy, perplexity, gradients, and parameter updates. |
| Pretraining | How can the mathematical objective be optimized efficiently across 8 GPUs? | Data parallelism, global tokens per step, memory constraints, schedules, throughput, and convergence diagnostics. |
| Fine-tuning | How can next-token prediction be transformed into useful assistant behavior? | Assistant-only supervised loss, verifiable rewards, group-relative advantages, KL control, and preference objectives. |
| Inference | What happens after training when the model must generate one token at a time? | Prefill, decode, KV-cache size, memory bandwidth, checkpoint formats, quantization, and local deployment. |
More parameters provide more representational capacity, but they also increase training computation, optimizer state, checkpoint size, and inference bandwidth. We therefore ask not merely whether a larger model is better, but whether its added capacity is supported by enough data and compute.
A model does not learn an abstract language in isolation; it approximates the distribution represented by its corpus. Dataset composition, cleaning, tokenization, and sequence packing therefore define the learning problem as strongly as the architecture does.
The loss function supplies a local direction, not a guarantee of useful learning. Batch size, learning rate, warmup, weight decay, gradient clipping, and the number of exposed tokens determine whether billions of small updates form a stable trajectory.
Training and inference reward different properties. Training benefits from parallel processing across all sequence positions; autoregressive decode produces one new token at a time and is often limited by HBM memory bandwidth, the rate at which weights and KV-cache entries can be read from high-bandwidth memory. A good design anticipates this asymmetry before training begins.
The chapters are cumulative. Data and architecture establish the symbols and dimensions used by the training mathematics. The training objective then makes the distributed pretraining measurements interpretable. Fine-tuning extends the same probabilistic model with different supervision, and inference shows how the architectural choices reappear as concrete memory and latency costs.
A useful reading habit is to pause at every numerical choice and ask three questions: what quantity constrains it, what would become more expensive if it increased, and what capability might be lost if it decreased? That habit is the transferable skill this article is designed to develop.