GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

作者： Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui (2021)

arXiv： 2112.06905

领域

混合专家预训练

TLDR（中文）

1.2T 参数 MoE 在 1/3 训练算力下达到 GPT-3 同等质量，是 MoE 路线"性价比胜出"的早期代表。Mixtral / DeepSeek-V2/V3 都是它的精神后裔。

TLDR (English)

1.2T parameter MoE achieves GPT-3 quality with 1/3 training compute, early representative of MoE "cost-effectiveness wins". Mixtral/DeepSeek-V2/V3 are its spiritual descendants.

出现在这些文章里

同被引用

这些论文与本文出现在同一篇文章中

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文