TY - JOUR
T1 - Manipulation Intention Understanding for Zero-Shot Composed Image Retrieval
AU - Tang, Yuanmin
AU - Yu, Jing
AU - Gai, Keke
AU - Xiong, Gang
AU - Gou, Gaopeng
AU - Qiu, Meikang
AU - Wu, Qi
N1 - Publisher Copyright:
© 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2026
Y1 - 2026
N2 - Zero-shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with varied visual manipulation intents across domains, scenes, objects, and attributes. A key challenge is that existing datasets contain limited intent-relevant annotations, making it hard for models to infer human intent from textual modifications. We introduce an intent-centric image–text dataset generated via reasoning by a Multimodal Large Language Model (MLLM) to better train ZS-CIR models for human manipulation intent understanding. Building on this dataset, we propose De-MINDS, a framework that distills the MLLM’s reasoning ability to capture manipulation intent and enhance models’ comprehension of modified text. A simple mapping network translates image information into language space and combines it with the manipulation text to form a query. De-MINDS then extracts intention-relevant information from this query and encodes it as pseudo-word tokens Caption for accurate ZS-CIR.a rugbyAcrossplayer passesfourthe ZS-CIR tasks, De-MINDS ball with his teammate shows strong generalization and improves over existing methods by 2.15% to 4.05%, establishing new state-of-the-art results with comparable inference time.
AB - Zero-shot Composed Image Retrieval (ZS-CIR) involves diverse tasks with varied visual manipulation intents across domains, scenes, objects, and attributes. A key challenge is that existing datasets contain limited intent-relevant annotations, making it hard for models to infer human intent from textual modifications. We introduce an intent-centric image–text dataset generated via reasoning by a Multimodal Large Language Model (MLLM) to better train ZS-CIR models for human manipulation intent understanding. Building on this dataset, we propose De-MINDS, a framework that distills the MLLM’s reasoning ability to capture manipulation intent and enhance models’ comprehension of modified text. A simple mapping network translates image information into language space and combines it with the manipulation text to form a query. De-MINDS then extracts intention-relevant information from this query and encodes it as pseudo-word tokens Caption for accurate ZS-CIR.a rugbyAcrossplayer passesfourthe ZS-CIR tasks, De-MINDS ball with his teammate shows strong generalization and improves over existing methods by 2.15% to 4.05%, establishing new state-of-the-art results with comparable inference time.
UR - https://www.scopus.com/pages/publications/105034579527
U2 - 10.1609/aaai.v40i11.37907
DO - 10.1609/aaai.v40i11.37907
M3 - Conference article
AN - SCOPUS:105034579527
SN - 2159-5399
VL - 40
SP - 9466
EP - 9474
JO - Proceedings of the AAAI Conference on Artificial Intelligence
JF - Proceedings of the AAAI Conference on Artificial Intelligence
IS - 11
T2 - 40th AAAI Conference on Artificial Intelligence, AAAI 2026
Y2 - 20 January 2026 through 27 January 2026
ER -