TY - JOUR
T1 - Multimodal Entity Linking With Dynamic Modality Selection and Interactive Prompt Learning
AU - Ma, Yingyao
AU - Xue, Yifan
AU - Wu, Jiasong
AU - Senhadji, Lotfi
AU - Shu, Huazhong
AU - Yang, Jian
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Recent advances in Multimodal Entity Linking leverage multimodal information to link target mentions to corresponding entities. However, existing methods uniformly adopt a “one-size-fits-all” approach, which overlooks the unique requirements of individual samples and fails to adequately balance modality-assisted disambiguation and modality-induced noise. Also, the commonly used separate large-scale visual and text pre-trained models for feature extraction do not address inter-modal heterogeneity and the high computational cost of fine-tuning. To resolve these two issues, we introduce a novel approach named Multimodal Entity Linking with Dynamic Modality Selection and Interactive Prompt Learning (DSMIP). First, we design three expert networks that utilize different subsets of modalities tailored to the task and train them individually. Specifically, for the multimodal expert network, we enhance entity and mention feature extraction by updating multimodal prompts and setting up a coupling function to realize the interaction of prompts between modalities. Subsequently, to select the best-suited expert network for each specific sample, we devise a Modality Selection Gating Network to gain the optimal one-hot selection vector by applying a specialized reparameterization technique and a two-stage training process. Experimental results on three public benchmark datasets demonstrate that the proposed DSMIP outperforms all state-of-the-art baselines.
AB - Recent advances in Multimodal Entity Linking leverage multimodal information to link target mentions to corresponding entities. However, existing methods uniformly adopt a “one-size-fits-all” approach, which overlooks the unique requirements of individual samples and fails to adequately balance modality-assisted disambiguation and modality-induced noise. Also, the commonly used separate large-scale visual and text pre-trained models for feature extraction do not address inter-modal heterogeneity and the high computational cost of fine-tuning. To resolve these two issues, we introduce a novel approach named Multimodal Entity Linking with Dynamic Modality Selection and Interactive Prompt Learning (DSMIP). First, we design three expert networks that utilize different subsets of modalities tailored to the task and train them individually. Specifically, for the multimodal expert network, we enhance entity and mention feature extraction by updating multimodal prompts and setting up a coupling function to realize the interaction of prompts between modalities. Subsequently, to select the best-suited expert network for each specific sample, we devise a Modality Selection Gating Network to gain the optimal one-hot selection vector by applying a specialized reparameterization technique and a two-stage training process. Experimental results on three public benchmark datasets demonstrate that the proposed DSMIP outperforms all state-of-the-art baselines.
KW - Multimodal entity linking
KW - knowledge graph
KW - large pre-trained model
UR - https://www.scopus.com/pages/publications/105009424568
U2 - 10.1109/TKDE.2025.3580754
DO - 10.1109/TKDE.2025.3580754
M3 - Article
AN - SCOPUS:105009424568
SN - 1041-4347
VL - 37
SP - 5467
EP - 5480
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 9
ER -