Large Language Models are not Fair Evaluators
arXiv: 2306.05685
领域
TLDR(中文)
系统评估了 LLM-as-a-Judge 方法的偏见问题:位置偏见(偏好第一个回答)、长度偏见(偏好更长的回答)和自增强偏见(偏好自己生成的内容)。提出了缓解这些偏见的方法,如交换位置评分和引入参考答案。
TLDR (English)
Systematically evaluates bias issues in LLM-as-a-Judge methods: position bias (preferring the first response), length bias (preferring longer responses), and self-enhancement bias (preferring self-generated content). Proposes mitigation methods such as position-swapped evaluation and reference-based scoring.
相关论文
同一领域的其他论文