The Scaling Laws Explorer
An interactive visualization of the paper "Scaling Laws for Neural Language Models" by Kaplan et al.
Understanding Scaling Laws
The 2020 paper “Scaling Laws for Neural Language Models” by Kaplan et al. revealed fundamental relationships between model performance and three key factors: model size (N), dataset size (D), and compute budget (C).
Key Insight #1
Loss follows a power law: Performance improves predictably as you scale up any of the three factors.
Key Insight #2
There's an optimal ratio: For any compute budget, there's an ideal balance between model size and data.
Key Insight #3
Chinchilla proved it: Training on ~20 tokens per parameter is optimal (not 1-2 as was common).
Interactive Scaling Laws Explorer
Main Interactive Explorer
Adjust the sliders below to explore how Model Size (N), Dataset Size (D), and Compute (C) influence the predicted test loss of a language model, based on the scaling laws.
📊 Note: The paper tested models up to 1.5B parameters and 23B tokens (Page 7). Current values (D=100B) extend beyond these experiments. The authors observed “no signs of deviation from these trends on the upper end” (Page 3).
Predicted Test Loss:
2.3888
Scaling Law Constants[Table 1 & 2]
These constants define the power laws for neural language model performance. The formula: L(N,D) = [(N_c/N)^(α_N/α_D) + D_c/D]^α_D[Eq. 4.1]
N_c
88T
Critical model size constant
N_c = 88T
Source: Page 4, Equation 1.1
D_c
54T
Critical dataset size constant
D_c = 54T
Source: Page 5, Equation 1.2
α_N
0.076
Model size scaling exponent
α_N = 0.076
Source: Page 4, Equation 1.1
α_D
0.095
Dataset size scaling exponent
α_D = 0.095
Source: Page 5, Equation 1.2
Interpretation:
- These constants were fitted to experimental data from models ranging from 10⁶ to 10⁹ parameters[Fig. 1]
- The exponents (α_N ≈ 0.076, α_D ≈ 0.095) show that data scaling is slightly more important than model scaling
- The formula approaches but never reaches zero loss, as performance must flatten out eventually
- These laws hold across multiple domains but may vary for specialized datasets
Quick Facts
- •GPT-3 was trained on only 2.9 tokens per parameter - far from optimal!
- •Chinchilla achieved better performance than GPT-3 with 2.5x fewer parameters by using 20 tokens/parameter.
- •Every 10x increase in compute should be split roughly equally between model size and data (in log space).
- •The scaling laws hold across 7+ orders of magnitude - from tiny models to GPT-4 scale!