GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Authors: Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui (2021)

arXiv: 2112.06905

Domains

Mixture of ExpertsPretraining

TLDR (English)

1.2T parameter MoE achieves GPT-3 quality with 1/3 training compute, early representative of MoE "cost-effectiveness wins". Mixtral/DeepSeek-V2/V3 are its spiritual descendants.

TLDR（中文）

1.2T 参数 MoE 在 1/3 训练算力下达到 GPT-3 同等质量，是 MoE 路线"性价比胜出"的早期代表。Mixtral / DeepSeek-V2/V3 都是它的精神后裔。

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Domains

TLDR (English)

TLDR（中文）

Appears in These Articles

Co-cited Papers

Related Papers