How to train your first LLM?

Index · Data & Parameters · Training · Pretraining · Fine-tuning · Inference

In brief

Start with documents, train a tokenizer, and choose an architecture whose parameter count, context length, and memory cost satisfy explicit constraints.
Train the decoder by predicting every next token with cross-entropy across 8 GPUs at about 216k input tokens/s.
Use SFT to teach response behavior, verifiable-reward RL to optimize checkable correctness, and preference learning to represent human choices.
At inference, prefill is highly parallel; batch-one decode is commonly limited by HBM bandwidth and KV-cache traffic.

The worked design has 1,055,231,744 parameters, a 65,536-token vocabulary, an 8,192-token context, and 27.54B input-token pretraining exposures.

Table of contents

Why train a language model?
What this article explains
The complete learning path
The central design questions
How to use this article
References and licensing

1. Why train a language model?

Many of us use language models every day. We ask for an explanation, complain when the answer is too long, ask again, and somehow expect the machine not to take it personally. Running inference is now ordinary. Training the model that makes inference possible is still mostly hidden behind model names, API calls, and a progress bar somebody else watched.

This article opens that box. We will ask how text becomes tokens, how tokens become vectors, how attention lets one position use another, why prediction error becomes a gradient, why several GPU workers must agree on an update, why a pretrained model still needs fine-tuning, and why generation has different bottlenecks from training. MyLLM gives us a concrete model near one billion parameters, large enough to expose real constraints and small enough to calculate.

The objective is not to memorize one configuration. It is to learn how to derive a defensible configuration from a
task, a dataset, a compute budget, and an intended inference environment.

2. What this article explains

The article develops a connected account of language-model training from first principles. By the end, we should be able to reason about the following chain:

\[ \text{documents} \longrightarrow \text{tokens} \longrightarrow \text{training sequences} \longrightarrow \text{next-token loss} \longrightarrow \text{parameter updates} \longrightarrow \text{aligned responses} \longrightarrow \text{inference}. \]

Each arrow changes the constraints on the next stage. Vocabulary size affects the embedding matrix and the random initialization loss. Context length affects both attention computation and KV-cache memory. Global batch size affects gradient noise and learning-rate selection. Quantization changes deployment memory but does not automatically reduce the cache. The point of an end-to-end treatment is to make these connections explicit.

3. The complete learning path

Chapter	Primary question	What we learn to derive
Data & Parameters	What should the model learn from, and what architecture fits the objective?	Corpus statistics, tokenizer design, context length, parameter budget, Transformer dimensions, and scaling-law token targets.
Training	What numerical operation turns a sequence into a learning signal?	The autoregressive factorization, forward pass, cross-entropy, perplexity, gradients, and parameter updates.
Pretraining	How can the mathematical objective be optimized efficiently across 8 GPUs?	Data parallelism, global tokens per step, memory constraints, schedules, throughput, and convergence diagnostics.
Fine-tuning	How can next-token prediction be transformed into useful assistant behavior?	Assistant-only supervised loss, verifiable rewards, group-relative advantages, KL control, and preference objectives.
Inference	What happens after training when the model must generate one token at a time?	Prefill, decode, KV-cache size, memory bandwidth, checkpoint formats, quantization, and local deployment.

4. The central design questions

4.1 Capacity

More parameters provide more representational capacity, but they also increase training computation, optimizer state, checkpoint size, and inference bandwidth. We therefore ask not merely whether a larger model is better, but whether its added capacity is supported by enough data and compute.

4.2 Data

A model does not learn an abstract language in isolation; it approximates the distribution represented by its corpus. Dataset composition, cleaning, tokenization, and sequence packing therefore define the learning problem as strongly as the architecture does.

4.3 Optimization

The loss function supplies a local direction, not a guarantee of useful learning. Batch size, learning rate, warmup, weight decay, gradient clipping, and the number of exposed tokens determine whether billions of small updates form a stable trajectory.

4.4 Deployment

Training and inference reward different properties. Training benefits from parallel processing across all sequence positions; autoregressive decode produces one new token at a time and is often limited by HBM memory bandwidth, the rate at which weights and KV-cache entries can be read from high-bandwidth memory. A good design anticipates this asymmetry before training begins.

Throughout the article, fixed MyLLM values provide a worked example. They should be read as quantities to explain and recalculate, not as universal defaults for every language model.

5. How to use this article

The chapters are cumulative. Data and architecture establish the symbols and dimensions used by the training mathematics. The training objective then makes the distributed pretraining measurements interpretable. Fine-tuning extends the same probabilistic model with different supervision, and inference shows how the architectural choices reappear as concrete memory and latency costs.

A useful reading habit is to pause at every numerical choice and ask three questions: what quantity constrains it, what would become more expensive if it increased, and what capability might be lost if it decreased? That habit is the transferable skill this article is designed to develop.

6. References and licensing

Vaswani et al., Attention Is All You Need, introduces the Transformer architecture used as the conceptual foundation of this article.
Hoffmann et al., Training Compute-Optimal Large Language Models, provides the scaling-law framework discussed in the data chapter.

Article prose and original figures are © 2026, all rights reserved. External papers, datasets, software, trademarks, and documentation remain under their respective owners and licenses. Citation or discussion does not relicense third-party material.