Large Language Models are not Fair Evaluators
arXiv: 2306.05685
Domains
TLDR (English)
Systematically evaluates bias issues in LLM-as-a-Judge methods: position bias (preferring the first response), length bias (preferring longer responses), and self-enhancement bias (preferring self-generated content). Proposes mitigation methods such as position-swapped evaluation and reference-based scoring.
TLDR(中文)
系统评估了 LLM-as-a-Judge 方法的偏见问题:位置偏见(偏好第一个回答)、长度偏见(偏好更长的回答)和自增强偏见(偏好自己生成的内容)。提出了缓解这些偏见的方法,如交换位置评分和引入参考答案。
Related Papers
Other papers in the same domain