跳到主要导航 跳到搜索 跳到主要内容

Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval

  • Yuanmin Tang
  • , Jing Yu*
  • , Keke Gai*
  • , Gang Xiong
  • , Gaopeng Gou
  • , Meikang Qiu
  • , Qi Wu
  • *此作品的通讯作者
  • CAS - Institute of Information Engineering
  • University of Chinese Academy of Sciences
  • Minzu University of China
  • Zhongguancun Academy
  • Augusta University
  • Adelaide University

科研成果: 期刊稿件会议文章同行评审

摘要

Zero-shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with varied visual manipulation intents across domains, scenes, objects, and attributes. A key challenge is that existing datasets contain limited intent-relevant annotations, making it hard for models to infer human intent from textual modifications. We introduce an intent-centric image–text dataset generated via reasoning by a Multimodal Large Language Model (MLLM) to better train ZS-CIR models for human manipulation intent understanding. Building on this dataset, we propose De-MINDS, a framework that distills the MLLM’s reasoning ability to capture manipulation intent and enhance models’ comprehension of modified text. A simple mapping network translates image information into language space and combines it with the manipulation text to form a query. De-MINDS then extracts intention-relevant information from this query and encodes it as pseudo-word tokens Caption for accurate ZS-CIR.a rugbyAcrossplayer passesfourthe ZS-CIR tasks, De-MINDS ball with his teammate shows strong generalization and improves over existing methods by 2.15% to 4.05%, establishing new state-of-the-art results with comparable inference time.

源语言英语
页(从-至)9466-9474
页数9
期刊Proceedings of the AAAI Conference on Artificial Intelligence
40
11
DOI
出版状态已出版 - 2026
活动40th AAAI Conference on Artificial Intelligence, AAAI 2026 - Singapore, 新加坡
期限: 20 1月 202627 1月 2026

指纹

探究 'Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval' 的科研主题。它们共同构成独一无二的指纹。

引用此