跳到主要导航 跳到搜索 跳到主要内容

Doctopus: Budget-aware Structural Table Extraction from Unstructured Documents

  • Beijing Institute of Technology
  • University of Arizona

科研成果: 期刊稿件会议文章同行评审

摘要

To fulfill the potential great value of unstructured documents, it is critical to extract structural data (e.g., attributes) from them, which can benefit various applications such as analytical SQL queries and decision-making. Multiple strategies, such as pre-trained language models (PLMs), can be employed for this task. However, these methods often struggle to achieve high-quality results, particularly when dealing with attribute extraction that requires intricate reasoning or semantic comprehension. Recently, large language models (LLMs) have proven to be effective in extracting attributes but incur substantial costs caused by token consumption, making them impractical for large-scale document set. To best trade off quality and cost, we present Doctopus, a system designed for accurate attribute extraction from unstructured documents with a user-specified cost constraint. Overall, Doctopus combines LLMs with non-LLM strategies to achieve a good tradeoff. First, the system employs an index-based approach to efficiently identify and process only relevant text chunks, thereby reducing the LLM cost. Afterwards, it further estimates the quality of multiple strategies for each attribute. Finally, based on the cost and estimated quality, Doctopus dynamically selects the optimal strategies through budget-aware optimization. We have built a comprehensive benchmark including 4 document sets with various characteristics and manually labeled ground truth using 1000 human hours. Extensive experiments on the benchmark show that compared with state-of-the-art baselines, Doctopus can improve the quality by 11% given the same cost constraint.

源语言英语
页(从-至)3695-3707
页数13
期刊Proceedings of the VLDB Endowment
18
11
DOI
出版状态已出版 - 2025
活动51st International Conference on Very Large Data Bases, VLDB 2025 - London, 英国
期限: 1 9月 20255 9月 2025

指纹

探究 'Doctopus: Budget-aware Structural Table Extraction from Unstructured Documents' 的科研主题。它们共同构成独一无二的指纹。

引用此