TY - GEN
T1 - Textual Grounding for Open-Vocabulary Visual Information Extraction in Layout-Diversified Documents
AU - Cheng, Mengjun
AU - Zhang, Chengquan
AU - Liu, Chang
AU - Li, Yuke
AU - Li, Bohan
AU - Yao, Kun
AU - Zheng, Xiawu
AU - Ji, Rongrong
AU - Chen, Jie
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Current methodologies have achieved notable success in the closed-set visual information extraction (VIE) task, while the exploration into open-vocabulary settings is comparatively underdeveloped, which is practical for individual users in terms of inferring information across documents of diverse types. Existing proposal solutions, including named entity recognition methods and large language model-based methods, fall short in processing the unlimited range of open-vocabulary keys and missing explicit layout modeling. This paper introduces a novel method for tackling the given challenge by transforming the process of categorizing text tokens into a task of locating regions based on given queries also called textual grounding. Particularly, we take this a step further by pairing open-vocabulary key language embedding with corresponding grounded text visual embedding. We design a document-tailored grounding framework by incorporating layout-aware context learning and document-tailored two-stage pre-training, which significantly improves the model’s understanding of documents. Our method outperforms current proposal solutions on the SVRD benchmark for the open-vocabulary VIE task, offering lower costs and faster inference speed. Specifically, our method infers 20× faster than the QwenVL model and achieves an improvement of 24.3% in the F-score metric.
AB - Current methodologies have achieved notable success in the closed-set visual information extraction (VIE) task, while the exploration into open-vocabulary settings is comparatively underdeveloped, which is practical for individual users in terms of inferring information across documents of diverse types. Existing proposal solutions, including named entity recognition methods and large language model-based methods, fall short in processing the unlimited range of open-vocabulary keys and missing explicit layout modeling. This paper introduces a novel method for tackling the given challenge by transforming the process of categorizing text tokens into a task of locating regions based on given queries also called textual grounding. Particularly, we take this a step further by pairing open-vocabulary key language embedding with corresponding grounded text visual embedding. We design a document-tailored grounding framework by incorporating layout-aware context learning and document-tailored two-stage pre-training, which significantly improves the model’s understanding of documents. Our method outperforms current proposal solutions on the SVRD benchmark for the open-vocabulary VIE task, offering lower costs and faster inference speed. Specifically, our method infers 20× faster than the QwenVL model and achieves an improvement of 24.3% in the F-score metric.
KW - Open-vocabulary
KW - Textual Grounding
KW - Visual Information Extraction
UR - http://www.scopus.com/inward/record.url?scp=85210895047&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-72995-9_27
DO - 10.1007/978-3-031-72995-9_27
M3 - Conference contribution
AN - SCOPUS:85210895047
SN - 9783031729942
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 474
EP - 491
BT - Computer Vision – ECCV 2024 - 18th European Conference, Proceedings
A2 - Leonardis, Aleš
A2 - Ricci, Elisa
A2 - Roth, Stefan
A2 - Russakovsky, Olga
A2 - Sattler, Torsten
A2 - Varol, Gül
PB - Springer Science and Business Media Deutschland GmbH
T2 - 18th European Conference on Computer Vision, ECCV 2024
Y2 - 29 September 2024 through 4 October 2024
ER -