Deduplicating Training Data Makes Language Models Better
arXiv: 2107.06499
Domains
TLDR (English)
Systematically demonstrates that deduplicating training data significantly improves language model performance and reduces memorization. By removing near-duplicate and exact-duplicate examples from C4 and RealNews, models perform better on downstream tasks and are far less likely to emit training data verbatim.
TLDR(中文)
系统证明训练数据去重能显著提升语言模型性能并减少记忆效应。通过在 C4 和 RealNews 数据集上去除近似重复和精确重复,模型在下游任务上表现更好,且生成训练数据副本的概率大幅降低。
Related Papers
Other papers in the same domain