跳转到内容

Deduplicating Training Data Makes Language Models Better

作者: Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini (2022)

arXiv: 2107.06499

领域

预训练

TLDR(中文)

系统证明训练数据去重能显著提升语言模型性能并减少记忆效应。通过在 C4 和 RealNews 数据集上去除近似重复和精确重复,模型在下游任务上表现更好,且生成训练数据副本的概率大幅降低。

TLDR (English)

Systematically demonstrates that deduplicating training data significantly improves language model performance and reduces memorization. By removing near-duplicate and exact-duplicate examples from C4 and RealNews, models perform better on downstream tasks and are far less likely to emit training data verbatim.

相关论文

同一领域的其他论文