Training Compute-Optimal Large Language Models
arXiv: 2203.15556
TLDR (English)
Proposes the Chinchilla scaling laws: given a fixed compute budget, model parameters and training tokens should scale equally (challenging the prior belief that parameters matter more). Chinchilla 70B outperformed Gopher 280B, redefining optimal LLM training strategy.
TLDR(中文)
提出了 Chinchilla 法则:在固定算力预算下,模型参数量和训练数据量应该同比例增长 (而非此前主流认为的参数增长更重要)。这重新定义了 LLM 训练的最优策略, Chinchilla 70B 在多个基准上超越了 Gopher 280B。