跳转到内容

liang2022-helm

arXiv: 2211.09110

TLDR(中文)

Stanford CRFM 系统化评测 30+ LLM × 多维度指标(准确性、鲁棒性、公平性、效率…),把"评测科学"立起来。是反"只看平均分"的代表性工作。

TLDR (English)

Stanford CRFM systematically evaluates 30+ LLMs × multidimensional metrics (accuracy, robustness, fairness, efficiency...), establishing "evaluation science". Representative work against "only looking at average scores".