Large Language Models are not Fair Evaluators

作者： Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui (2023)

领域

评估

TLDR（中文）

系统评估了 LLM-as-a-Judge 方法的偏见问题：位置偏见（偏好第一个回答）、长度偏见（偏好更长的回答）和自增强偏见（偏好自己生成的内容）。提出了缓解这些偏见的方法，如交换位置评分和引入参考答案。

TLDR (English)

Systematically evaluates bias issues in LLM-as-a-Judge methods: position bias (preferring the first response), length bias (preferring longer responses), and self-enhancement bias (preferring self-generated content). Proposes mitigation methods such as position-swapped evaluation and reference-based scoring.

出现在这些文章里

评估与基准：如何判断模型好坏
Evaluation and Benchmarks: Judging Model Quality

同被引用

这些论文与本文出现在同一篇文章中

Large Language Models are not Fair Evaluators

领域

TLDR（中文）

TLDR (English)

出现在这些文章里

同被引用

相关论文