TY - GEN
T1 - Poor-Supervised Evaluation for SuperLLM via Mutual Consistency
AU - Yuan, Peiwen
AU - Feng, Shaoxiong
AU - Li, Yiwei
AU - Wang, Xinglin
AU - Pan, Boyuan
AU - Wang, Heda
AU - Hu, Yao
AU - Li, Kan
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - The guidance from capability evaluations has greatly propelled the progress of human society and the development of Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmark with accurate labels for LLMs whose capabilities approach or even surpass those of humans (denoted as SuperLLMs). To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation), we first prove that the consistency between the model under evaluation and the reference model, when their prediction distributions are independent and the sample size is infinite, can equivalently assess the true capabilities of the model to be evaluated. However, using either humans or LLMs as the reference model cannot sufficiently meet the conditions, for which we propose the PEEM algorithm. By treating all models under evaluation as reference models, PEEM alternately optimizes model weights and filters reference models based on EM algorithm to maximally alleviate the insufficiency of the conditions. Comprehensive experiments across 3 types of tasks with 16 mainstream LLMs validate the efficiency, universality, and effectiveness of PEEM. More generally, PEEM has advanced the evaluation paradigm evolution from human-centric to human&model-centric, alleviating the limitations of human capabilities for evaluating SuperLLMs.
AB - The guidance from capability evaluations has greatly propelled the progress of human society and the development of Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmark with accurate labels for LLMs whose capabilities approach or even surpass those of humans (denoted as SuperLLMs). To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation), we first prove that the consistency between the model under evaluation and the reference model, when their prediction distributions are independent and the sample size is infinite, can equivalently assess the true capabilities of the model to be evaluated. However, using either humans or LLMs as the reference model cannot sufficiently meet the conditions, for which we propose the PEEM algorithm. By treating all models under evaluation as reference models, PEEM alternately optimizes model weights and filters reference models based on EM algorithm to maximally alleviate the insufficiency of the conditions. Comprehensive experiments across 3 types of tasks with 16 mainstream LLMs validate the efficiency, universality, and effectiveness of PEEM. More generally, PEEM has advanced the evaluation paradigm evolution from human-centric to human&model-centric, alleviating the limitations of human capabilities for evaluating SuperLLMs.
UR - http://www.scopus.com/inward/record.url?scp=85203802009&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.findings-acl.690
DO - 10.18653/v1/2024.findings-acl.690
M3 - Conference contribution
AN - SCOPUS:85203802009
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 11614
EP - 11627
BT - The 62nd Annual Meeting of the Association for Computational Linguistics
A2 - Ku, Lun-Wei
A2 - Martins, Andre
A2 - Srikumar, Vivek
PB - Association for Computational Linguistics (ACL)
T2 - Findings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Y2 - 11 August 2024 through 16 August 2024
ER -