zheng2023-mtbench
arXiv: 2306.05685
TLDR(中文)
提出 GPT-4-as-judge + 人类偏好众包 (Chatbot Arena) 评测对话能力。MT-Bench 和 Arena ELO 至今是社区比较模型"对话能力"的事实双标准。
TLDR (English)
Proposes GPT-4-as-judge + human preference crowdsourcing (Chatbot Arena) for evaluating dialogue capability. MT-Bench and Arena ELO remain community's de facto dual standards for comparing model "dialogue capability" today.