The Scaling Laws Explorer

An interactive visualization of the paper "Scaling Laws for Neural Language Models" by Kaplan et al.

Understanding Scaling Laws

The 2020 paper “Scaling Laws for Neural Language Models” by Kaplan et al. revealed fundamental relationships between model performance and three key factors: model size (N), dataset size (D), and compute budget (C).

Key Insight #1

Loss follows a power law: Performance improves predictably as you scale up any of the three factors.

Key Insight #2

There's an optimal ratio: For any compute budget, there's an ideal balance between model size and data.

Key Insight #3

Chinchilla proved it: Training on ~20 tokens per parameter is optimal (not 1-2 as was common).

Interactive Scaling Laws Explorer

Main Interactive Explorer

Adjust the sliders below to explore how Model Size (N), Dataset Size (D), and Compute (C) influence the predicted test loss of a language model, based on the scaling laws.

📊 Note: The paper tested models up to 1.5B parameters and 23B tokens (Page 7). Current values (D=100B) extend beyond these experiments. The authors observed “no signs of deviation from these trends on the upper end” (Page 3).

Model Size (N): 1.00 Billion parameters

1M100M10B1T

Dataset Size (D): 100.00 Billion tokens

100M10B1T100T

Compute (C)

: 6.94 Petaflop-days

1 PF100K PF10B PF

Predicted Test Loss:

2.3888

Scaling Law Constants[Table 1 & 2]

These constants define the power laws for neural language model performance. The formula: L(N,D) = [(N_c/N)^(α_N/α_D) + D_c/D]^α_D[Eq. 4.1]

Use Alternative Constants (Table 2):

N_c

88T

Critical model size constant

N_c = 88T

Source: Page 4, Equation 1.1

D_c

54T

Critical dataset size constant

D_c = 54T

Source: Page 5, Equation 1.2

α_N

0.076

Model size scaling exponent

α_N = 0.076

Source: Page 4, Equation 1.1

α_D

0.095

Dataset size scaling exponent

α_D = 0.095

Source: Page 5, Equation 1.2

Interpretation:

These constants were fitted to experimental data from models ranging from 10⁶ to 10⁹ parameters[Fig. 1]
The exponents (α_N ≈ 0.076, α_D ≈ 0.095) show that data scaling is slightly more important than model scaling
The formula approaches but never reaches zero loss, as performance must flatten out eventually
These laws hold across multiple domains but may vary for specialized datasets

Quick Facts

•GPT-3 was trained on only 2.9 tokens per parameter - far from optimal!
•Chinchilla achieved better performance than GPT-3 with 2.5x fewer parameters by using 20 tokens/parameter.
•Every 10x increase in compute should be split roughly equally between model size and data (in log space).
•The scaling laws hold across 7+ orders of magnitude - from tiny models to GPT-4 scale!