MODE+: A benchmark and a probe into multimodal open-domain dialogue evaluation

  • Hang Yin
  • , Xinglin Wang
  • , Yueqi Zhang
  • , Pinren Lu
  • , Bin Sun
  • , Peiwen Yuan
  • , Kan Li*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Multimodal Open-domain Dialogue (MOD) plays a crucial role in AI-human interactions and has garnered substantial interest. Although existing studies have explored various aspects of MOD, the evaluation of MOD remains underexplored. In this work, we propose MODE+, an evaluation benchmark for MOD and a probe into multimodel open-domain dialogue evaluation. Specifically, we construct MODE with a balanced difficulty distribution and divide it into three parts: MODE-Base and MODE-Hard, both consisting of single-turn dialogues, with MODE-Base containing 889 test cases and MODE-Hard comprising 215 more challenging cases designed for probing model robustness against multimodal inconsistencies. Additionally, we include MODE-Multi, which contains over 10,000 multi-turn dialogue cases for more extensive testing. Each case contains an image, a context, and turn-level response scores provided by at least three human annotators following standardized criteria. The consistency of human annotations has an average Spearman correlation of over 0.9, indicating that MODE is highly reliable in annotation. We test the MOD evaluation capabilities of various evaluators on MODE, including LLaMA, Claude3, GPT-4, LLaVA, Gemini and Qwen3-VL. Results show that even the best-performing model-based evaluators have surprisingly low agreement with human evaluations, with consistency scores for MODE-Base below 0.7 and for MODE-Hard falling below 0.4. To improve model-based MOD evaluation capabilities, we propose the MM-Eval framework, a systematic methodology designed to standardize automatic evaluation. MM-Eval introduces Image Transformation as a modality-bridging mechanism, Inference Enhancement for transparent reasoning, and Inference Calibration for statistical reliability. Compared to the baselines, MM-Eval achieves a 67.41% improvement on MODE-Base and a 297% enhancement on MODE-Hard. Furthermore, the performance on MODE-Multi shows significant improvements with MM-Eval, demonstrating that the framework is capable of handling larger and more complex datasets. These results demonstrate that MM-Eval serves as a transferable and robust standard for future MOD evaluation.

Original languageEnglish
Article number132787
JournalNeurocomputing
Volume675
DOIs
Publication statusPublished - 28 Apr 2026
Externally publishedYes

Keywords

  • AIGC
  • Evaluation benchmark
  • Evaluation method
  • Multimodal open-domain dialogue

Fingerprint

Dive into the research topics of 'MODE+: A benchmark and a probe into multimodal open-domain dialogue evaluation'. Together they form a unique fingerprint.

Cite this