liang2022-helm
arXiv: 2211.09110
TLDR (English)
Stanford CRFM systematically evaluates 30+ LLMs × multidimensional metrics (accuracy, robustness, fairness, efficiency...), establishing "evaluation science". Representative work against "only looking at average scores".
TLDR(中文)
Stanford CRFM 系统化评测 30+ LLM × 多维度指标(准确性、鲁棒性、公平性、效率…),把"评测科学"立起来。是反"只看平均分"的代表性工作。