A Hierarchical Context Augmentation Method to Improve Retrieval-Augmented LLMs on Scientific Papers

Tian Yi Che, Xian Ling Mao*, Tian Lan, Heyan Huang

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Scientific papers of a large scale on the Internet encompass a wealth of data and knowledge, attracting the attention of numerous researchers. To fully utilize these knowledge, Retrieval-Augmented Large Language Models (LLMs) usually leverage large-scale scientific corpus to train and then retrieve relevant passages from external memory to improve generation, which have demonstrated outstanding performance. However, existing methods can only capture one-dimension fragmented textual information without incorporating hierarchical structural knowledge, eg. the deduction relationship of abstract and main body, which makes it difficult to grasp the central thought of papers. To tackle this problem, we propose a hierarchical context augmentation method, which helps Retrieval-Augmented LLMs to autoregressively learn the structure knowledge of scientific papers. Specifically, we utilize the document tree to represent the hierarchical relationship of a paper and enhance the structure information of scientific context from three aspects: scale, format and global information. First, we think each top-bottom path of document tree is a logical independent context, which can be used to largely increase the scale of extracted structural corpus. Second, we propose a novel label-based format to represent the structure of context in textual sequences, unified between training and inference. Third, we introduce the global information of retrieved passages to further enhance the structure of context. Extensive experiments on three scientific tasks show that the proposed method significantly improves the performance of Retrieval-Augmented LLMs on all tasks. Besides, our method achieves start-of-art performance in Question Answer task and outperforms ChatGPT. Moreover, it also brings considerate gains with irrelevant retrieval passages, illustrating its effectiveness on practical application scenarios.

Original languageEnglish
Title of host publicationKDD 2024 - Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages243-254
Number of pages12
ISBN (Electronic)9798400704901
DOIs
Publication statusPublished - 24 Aug 2024
Event30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024 - Barcelona, Spain
Duration: 25 Aug 202429 Aug 2024

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
ISSN (Print)2154-817X

Conference

Conference30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2024
Country/TerritorySpain
CityBarcelona
Period25/08/2429/08/24

Keywords

  • context augmentation
  • retrieval-augmented llms
  • scientific papers
  • structure information

Fingerprint

Dive into the research topics of 'A Hierarchical Context Augmentation Method to Improve Retrieval-Augmented LLMs on Scientific Papers'. Together they form a unique fingerprint.

Cite this