How to train your first LLM?

In brief

The worked design has 1,055,231,744 parameters, a 65,536-token vocabulary, an 8,192-token context, and 27.54B input-token pretraining exposures.

Table of contents
  1. Why train a language model?
  2. What this article explains
  3. The complete learning path
  4. The central design questions
  5. How to use this article
  6. References and licensing

1. Why train a language model?

Many of us use language models every day. We ask for an explanation, complain when the answer is too long, ask again, and somehow expect the machine not to take it personally. Running inference is now ordinary. Training the model that makes inference possible is still mostly hidden behind model names, API calls, and a progress bar somebody else watched.

This article opens that box. We will ask how text becomes tokens, how tokens become vectors, how attention lets one position use another, why prediction error becomes a gradient, why several GPU workers must agree on an update, why a pretrained model still needs fine-tuning, and why generation has different bottlenecks from training. MyLLM gives us a concrete model near one billion parameters, large enough to expose real constraints and small enough to calculate.

The objective is not to memorize one configuration. It is to learn how to derive a defensible configuration from a task, a dataset, a compute budget, and an intended inference environment.

2. What this article explains

The article develops a connected account of language-model training from first principles. By the end, we should be able to reason about the following chain:

\[ \text{documents} \longrightarrow \text{tokens} \longrightarrow \text{training sequences} \longrightarrow \text{next-token loss} \longrightarrow \text{parameter updates} \longrightarrow \text{aligned responses} \longrightarrow \text{inference}. \]

Each arrow changes the constraints on the next stage. Vocabulary size affects the embedding matrix and the random initialization loss. Context length affects both attention computation and KV-cache memory. Global batch size affects gradient noise and learning-rate selection. Quantization changes deployment memory but does not automatically reduce the cache. The point of an end-to-end treatment is to make these connections explicit.

3. The complete learning path

ChapterPrimary questionWhat we learn to derive
Data & Parameters What should the model learn from, and what architecture fits the objective? Corpus statistics, tokenizer design, context length, parameter budget, Transformer dimensions, and scaling-law token targets.
Training What numerical operation turns a sequence into a learning signal? The autoregressive factorization, forward pass, cross-entropy, perplexity, gradients, and parameter updates.
Pretraining How can the mathematical objective be optimized efficiently across 8 GPUs? Data parallelism, global tokens per step, memory constraints, schedules, throughput, and convergence diagnostics.
Fine-tuning How can next-token prediction be transformed into useful assistant behavior? Assistant-only supervised loss, verifiable rewards, group-relative advantages, KL control, and preference objectives.
Inference What happens after training when the model must generate one token at a time? Prefill, decode, KV-cache size, memory bandwidth, checkpoint formats, quantization, and local deployment.

4. The central design questions

4.1 Capacity

More parameters provide more representational capacity, but they also increase training computation, optimizer state, checkpoint size, and inference bandwidth. We therefore ask not merely whether a larger model is better, but whether its added capacity is supported by enough data and compute.

4.2 Data

A model does not learn an abstract language in isolation; it approximates the distribution represented by its corpus. Dataset composition, cleaning, tokenization, and sequence packing therefore define the learning problem as strongly as the architecture does.

4.3 Optimization

The loss function supplies a local direction, not a guarantee of useful learning. Batch size, learning rate, warmup, weight decay, gradient clipping, and the number of exposed tokens determine whether billions of small updates form a stable trajectory.

4.4 Deployment

Training and inference reward different properties. Training benefits from parallel processing across all sequence positions; autoregressive decode produces one new token at a time and is often limited by HBM memory bandwidth, the rate at which weights and KV-cache entries can be read from high-bandwidth memory. A good design anticipates this asymmetry before training begins.

Throughout the article, fixed MyLLM values provide a worked example. They should be read as quantities to explain and recalculate, not as universal defaults for every language model.

5. How to use this article

The chapters are cumulative. Data and architecture establish the symbols and dimensions used by the training mathematics. The training objective then makes the distributed pretraining measurements interpretable. Fine-tuning extends the same probabilistic model with different supervision, and inference shows how the architectural choices reappear as concrete memory and latency costs.

A useful reading habit is to pause at every numerical choice and ask three questions: what quantity constrains it, what would become more expensive if it increased, and what capability might be lost if it decreased? That habit is the transferable skill this article is designed to develop.

6. References and licensing

Article prose and original figures are © 2026, all rights reserved. External papers, datasets, software, trademarks, and documentation remain under their respective owners and licenses. Citation or discussion does not relicense third-party material.