TY - GEN
T1 - Orthogonal Vector-Decomposed Disentanglement Network of Interactive Image Retrieval for Fashion Outfit Recommendation
AU - Chen, Chen
AU - Guo, Jie
AU - Song, Bin
AU - Zhang, Tong
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/10/14
Y1 - 2022/10/14
N2 - Interactive image retrieval for fashion outfit recommendation is a challenging task, which aims to search for the target desired image according to a multi-modal query (a reference image and a modification text). Previous studies focus on exploring effective feature composing methods to achieve similarity matching between different modalities. However, the existence of feature redundancy and the semantic inconsistency between modalities introduces many task-irrelevant information. It is intractable to correctly identify the particular information to be modified and will inevitably introduce noise disturbances which lead to suboptimal performance. To this end, we present a novel Orthogonal Vector-Decomposed Disentanglement Network (OVDDN) for image retrieval. It proposes to leverage the disentangled parts to learn a controllable denoising embedding space. First, we design an orthogonal disentanglement module. It is applied to both image and text features to decouple them into two independent components (invariant and specific) through orthogonal constraints. A similarity metric loss ensures semantic consistency of paired images. Then, an attention network generates composition of the reference image invariant part and text task-related part to match the target one. Finally, a differential feature alignment module maintain the cross-modal semantic consistency. Extensive experiments conducted on three benchmark datasets denote the OVDDN achieving the consistently superior performance. Ablation analyses further verify the effectiveness of our proposed model.
AB - Interactive image retrieval for fashion outfit recommendation is a challenging task, which aims to search for the target desired image according to a multi-modal query (a reference image and a modification text). Previous studies focus on exploring effective feature composing methods to achieve similarity matching between different modalities. However, the existence of feature redundancy and the semantic inconsistency between modalities introduces many task-irrelevant information. It is intractable to correctly identify the particular information to be modified and will inevitably introduce noise disturbances which lead to suboptimal performance. To this end, we present a novel Orthogonal Vector-Decomposed Disentanglement Network (OVDDN) for image retrieval. It proposes to leverage the disentangled parts to learn a controllable denoising embedding space. First, we design an orthogonal disentanglement module. It is applied to both image and text features to decouple them into two independent components (invariant and specific) through orthogonal constraints. A similarity metric loss ensures semantic consistency of paired images. Then, an attention network generates composition of the reference image invariant part and text task-related part to match the target one. Finally, a differential feature alignment module maintain the cross-modal semantic consistency. Extensive experiments conducted on three benchmark datasets denote the OVDDN achieving the consistently superior performance. Ablation analyses further verify the effectiveness of our proposed model.
KW - disentanglement learning
KW - feature fusion
KW - image retrieval
UR - http://www.scopus.com/inward/record.url?scp=85141087359&partnerID=8YFLogxK
U2 - 10.1145/3552468.3555362
DO - 10.1145/3552468.3555362
M3 - Conference contribution
AN - SCOPUS:85141087359
T3 - MCFR 2022 - Proceedings of the 1st Workshop on Multimedia Computing towards Fashion Recommendation
SP - 21
EP - 29
BT - MCFR 2022 - Proceedings of the 1st Workshop on Multimedia Computing towards Fashion Recommendation
PB - Association for Computing Machinery, Inc
T2 - 1st Workshop on Multimedia Computing towards Fashion Recommendation, MCFR 2022
Y2 - 14 October 2022
ER -