TY - GEN
T1 - Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning
AU - Qu, Xiangyan
AU - Yu, Jing
AU - Gai, Keke
AU - Zhuang, Jiamin
AU - Tang, Yuanmin
AU - Xiong, Gang
AU - Gou, Gaopeng
AU - Wu, Qi
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/10/28
Y1 - 2024/10/28
N2 - Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association. Code is available at https://github.com/MorningStarOvO/EmDepart.
AB - Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association. Code is available at https://github.com/MorningStarOvO/EmDepart.
KW - document-based zero-shot learning
KW - partial semantic alignment
KW - visual-semantic decomposition
UR - https://www.scopus.com/pages/publications/85209788009
U2 - 10.1145/3664647.3680829
DO - 10.1145/3664647.3680829
M3 - Conference contribution
AN - SCOPUS:85209788009
T3 - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
SP - 4581
EP - 4590
BT - MM 2024 - Proceedings of the 32nd ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 32nd ACM International Conference on Multimedia, MM 2024
Y2 - 28 October 2024 through 1 November 2024
ER -