跳到主要导航 跳到搜索 跳到主要内容

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

  • Ziqiang Liu
  • , Feiteng Fang
  • , Xi Feng
  • , Xinrun Du
  • , Chenhao Zhang
  • , Zekun Wang
  • , Yuelin Bai
  • , Qixuan Zhao
  • , Liyang Fan
  • , Chengguang Gan
  • , Hongquan Lin
  • , Jiaming Li
  • , Yuansheng Ni
  • , Haihong Wu
  • , Yaswanth Narsupalli
  • , Zhigang Zheng
  • , Chengming Li
  • , Xiping Hu
  • , Ruifeng Xu
  • , Xiaojun Chen
  • Min Yang, Jiaheng Liu, Ruibo Liu, Wenhao Huang, Ge Zhang*, Shiwen Ni*
*此作品的通讯作者
  • Shenzhen Institute of Advanced Technology
  • University of Chinese Academy of Sciences
  • University of Science and Technology of China
  • M-A-P
  • 01.AI
  • Huazhong University of Science and Technology
  • Beihang University
  • Yokohama National University
  • Zhejiang University
  • Indian Institute of Technology Kharagpur
  • Shenzhen MSU-BIT University
  • Harbin Institute of Technology Shenzhen
  • Shenzhen University
  • Dartmouth College
  • University of Waterloo

科研成果: 期刊稿件会议文章同行评审

摘要

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.

源语言英语
期刊Advances in Neural Information Processing Systems
37
出版状态已出版 - 2024
已对外发布
活动38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, 加拿大
期限: 9 12月 202415 12月 2024

指纹

探究 'II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models' 的科研主题。它们共同构成独一无二的指纹。

引用此