TY - GEN
T1 - ESTJ
T2 - 33rd ACM International Conference on Multimedia, MM 2025
AU - Yang, Shu Xun
AU - Mao, Xian Ling
AU - Huang, Heyan
N1 - Publisher Copyright:
© 2025 ACM.
PY - 2025/10/27
Y1 - 2025/10/27
N2 - Hybrid-modal table understanding (HMTU), which targets leveraging multi-modal table evidence for multi-hop reasoning, has garnered widespread attention. Existing models primarily focus on effectively integrating multi-modal table evidence to enhance the table understanding capabilities of multi-modal large language models (MLLMs). However, these models ignore the fact that different types of table understanding questions lean toward different modalities of table evidence. Consequently, these models suffer from low utilization efficiency and poor interpretability. To address these issues, in this paper, we propose a modality preference alignment model, called ESTJ, which Enhances Structured Tendency Judgment in HMTU. Specifically, ESTJ first samples modality preference data from the responses generated by MLLMs. Then, it alleviates modality preference imbalance by adhering to the principle of least modality priority. Finally, ESTJ performs direct preference optimization (DPO) training based on structured tendency judgment to align modality preference effectively. Experimental results on TableQA and TableFV tasks demonstrate that our proposed model outperforms state-of-the-art baselines. Additionally, these results present fascinating phenomena and unveil profound insights into modality preference for table understanding.
AB - Hybrid-modal table understanding (HMTU), which targets leveraging multi-modal table evidence for multi-hop reasoning, has garnered widespread attention. Existing models primarily focus on effectively integrating multi-modal table evidence to enhance the table understanding capabilities of multi-modal large language models (MLLMs). However, these models ignore the fact that different types of table understanding questions lean toward different modalities of table evidence. Consequently, these models suffer from low utilization efficiency and poor interpretability. To address these issues, in this paper, we propose a modality preference alignment model, called ESTJ, which Enhances Structured Tendency Judgment in HMTU. Specifically, ESTJ first samples modality preference data from the responses generated by MLLMs. Then, it alleviates modality preference imbalance by adhering to the principle of least modality priority. Finally, ESTJ performs direct preference optimization (DPO) training based on structured tendency judgment to align modality preference effectively. Experimental results on TableQA and TableFV tasks demonstrate that our proposed model outperforms state-of-the-art baselines. Additionally, these results present fascinating phenomena and unveil profound insights into modality preference for table understanding.
KW - direct preference optimization
KW - hybrid-modal table understanding
KW - least modality priority
KW - modality preference alignment
KW - multi-modal large language models
KW - structured tendency judgment
UR - https://www.scopus.com/pages/publications/105024065497
U2 - 10.1145/3746027.3754796
DO - 10.1145/3746027.3754796
M3 - Conference contribution
AN - SCOPUS:105024065497
T3 - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
SP - 2399
EP - 2408
BT - MM 2025 - Proceedings of the 33rd ACM International Conference on Multimedia, Co-Located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 27 October 2025 through 31 October 2025
ER -