TY - JOUR
T1 - Enhancing discriminative ability in multimodal LLMs
T2 - A contrastive learning approach for CT report generation
AU - Su, Qingyong
AU - Feng, Chong
AU - Shi, Ge
AU - Wang, Bo
AU - Zhuang, Yan
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/11
Y1 - 2025/11
N2 - Automated CT report generation (CTRG) systems hold significant promise for enhancing clinical workflows. However, current approaches, including those leveraging advanced multimodal large language models (MLLMs), continue to face persistent challenges in ensuring quality and reliability of generated reports. A comprehensive analysis of representation dynamics within MLLM-based CTRG models in this study reveals two primary limitations: the entanglement of reports with varying quality in the representation space, and clinical detail blindness, which stems from traditional training paradigms that primarily focus on ground-truth reports. To address these limitations, we propose a novel contrastive learning framework with three main contributions: (1) a systematic method for generating clinically relevant hard negative reports using GPT-4, which introduces realistic but subtle clinical errors while maintaining report structure and plausibility; (2) a contrastive learning approach that leverages reports of varying quality to effectively disentangle quality representations and enhance the model's sensitivity to clinical details, and (3) a hard negative mining strategy designed to tackle false negatives and optimizing the sampling weights of negatives with varying degrees of semantic effectiveness. Extensive experiments on the CTRG-Chest-548K and CTRG-Brain-263K datasets demonstrate significant improvements in natural language generation (NLG) performance, including a 14% increases in BLEU-1 and 17% improvements in both BLEU-4 and ROUGE-L scores on the CTRG-Chest-548K dataset, compared to current state-of-the-art methods.
AB - Automated CT report generation (CTRG) systems hold significant promise for enhancing clinical workflows. However, current approaches, including those leveraging advanced multimodal large language models (MLLMs), continue to face persistent challenges in ensuring quality and reliability of generated reports. A comprehensive analysis of representation dynamics within MLLM-based CTRG models in this study reveals two primary limitations: the entanglement of reports with varying quality in the representation space, and clinical detail blindness, which stems from traditional training paradigms that primarily focus on ground-truth reports. To address these limitations, we propose a novel contrastive learning framework with three main contributions: (1) a systematic method for generating clinically relevant hard negative reports using GPT-4, which introduces realistic but subtle clinical errors while maintaining report structure and plausibility; (2) a contrastive learning approach that leverages reports of varying quality to effectively disentangle quality representations and enhance the model's sensitivity to clinical details, and (3) a hard negative mining strategy designed to tackle false negatives and optimizing the sampling weights of negatives with varying degrees of semantic effectiveness. Extensive experiments on the CTRG-Chest-548K and CTRG-Brain-263K datasets demonstrate significant improvements in natural language generation (NLG) performance, including a 14% increases in BLEU-1 and 17% improvements in both BLEU-4 and ROUGE-L scores on the CTRG-Chest-548K dataset, compared to current state-of-the-art methods.
KW - Contrastive learning
KW - CT report generation
KW - Multimodal LLMs
KW - Representation learning
UR - http://www.scopus.com/inward/record.url?scp=105004742111&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2025.103240
DO - 10.1016/j.inffus.2025.103240
M3 - Article
AN - SCOPUS:105004742111
SN - 1566-2535
VL - 123
JO - Information Fusion
JF - Information Fusion
M1 - 103240
ER -