Large Language Models are not Fair Evaluators

Authors: Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui (2023)

Domains

Evaluation

TLDR (English)

Systematically evaluates bias issues in LLM-as-a-Judge methods: position bias (preferring the first response), length bias (preferring longer responses), and self-enhancement bias (preferring self-generated content). Proposes mitigation methods such as position-swapped evaluation and reference-based scoring.

TLDR（中文）

系统评估了 LLM-as-a-Judge 方法的偏见问题：位置偏见（偏好第一个回答）、长度偏见（偏好更长的回答）和自增强偏见（偏好自己生成的内容）。提出了缓解这些偏见的方法，如交换位置评分和引入参考答案。

Appears in These Articles

评估与基准：如何判断模型好坏
Evaluation and Benchmarks: Judging Model Quality

Co-cited Papers

These papers appear in the same articles as this one

Related Papers

Other papers in the same domain