Pretraining MyLLM

In brief
Table of contents
  1. Why pretraining is the expensive foundation
  2. Setup
  3. Data-parallel optimization
  4. Batch size
  5. Input-token budget
  6. Convergence
  7. References and licensing

1. Why pretraining is the expensive foundation

The equations in the Training chapter describe one optimizer update. Pretraining performs that update many thousands of times over billions of tokens. This is where a model acquires broad statistical representations of language, mathematical notation, explanations, and common reasoning patterns.

The central problem is efficient distributed execution: maintain high utilization, prevent input-pipeline stalls, partition examples across workers, synchronize gradients, and preserve numerical stability. Throughput, memory allocation, gradient norm, learning rate, and loss are operational diagnostics. Together they reveal whether input staging, communication, memory, or optimization behavior is limiting progress.

2. Pretraining setup

MyLLM uses 8 GPUs with an 8,192-token context, a global batch of 32 packed sequences, no gradient accumulation, and data-parallel training. Each optimizer update therefore processes

\[32\times8{,}192=262{,}144\text{ input tokens per step}.\]

3. Input-token budget

The article treats Pretraining Data 1 and Pretraining Data 2 as the two pretraining datasets in the same training plan. The combined input-token exposure is

\[12.54\text{B}+15.00\text{B}=27.54\text{B input tokens}.\]

Pretraining datasetInput-token exposureStepsWall timeThroughputPurpose
Pretraining Data 1: OpenWebMath12,540,182,52847,837about 15.9 hoursabout 219k tokens/sMathematical web text, notation, derivations, definitions, and proof-like prose.
Pretraining Data 2: FineWeb-Edu + Cosmopedia auto_math_text15,000,061,82457,221about 19.5 hoursabout 214k tokens/sGeneral English, educational prose, textbook-like explanations, and student-facing language.
Total27,540,244,352105,058about 35.4 hoursabout 216k tokens/sMath plus English tutor foundation.

Pretraining Data 2 contains 13.25B unique complete tokens after 8,192-token packing: 12.00B from FineWeb-Edu and 1.25B from Cosmopedia auto_math_text. The optimizer schedule samples 15.00B input-token draws from Pretraining Data 2. For training compute and exposure, the relevant number is its 15.00B input-token schedule.

Learning-rate schedule and token throughput across both pretraining datasets

Figure 1. Learning-rate schedule and throughput across both pretraining datasets.

Cumulative input-token exposure across the full pretraining plan

Figure 2. The full pretraining plan exposes MyLLM to 27.54B input tokens across Pretraining Data 1 and Pretraining Data 2.

3.1 Budget consistency

Every optimizer step uses a full batch, so the nominal number of input-token draws is the step count multiplied by 262,144. For the full pretraining plan,

\[(47{,}837+57{,}221)\times262{,}144=27{,}540{,}244{,}352.\]

The average throughput across the full pretraining corpus is roughly

\[\frac{27.54\times10^9}{57{,}232+70{,}205}\approx216{,}100\text{ input tokens/s}.\]

4. Convergence

Pretraining Data 1 has an initial loss near the random-uniform reference \(\ln(65{,}536)\approx11.09\) and falls to roughly 2.0–2.1. Pretraining Data 2 has a different data distribution, so its loss is reported as a separate curve: around 3.1 at the high end and near 2.6 by the end. The two curves should not be read as one continuous loss line; they are two pretraining datasets with different text distributions.

MyLLM pretraining loss across 27.54B input tokens

Figure 3. The loss curve reports Pretraining Data 1 and Pretraining Data 2 separately because their text distributions differ.

5. References and licensing

A mixed dataset keeps the obligations of its individual sources. Token counts are tokenizer-dependent and should be reported with the tokenizer that produced them.