TY - JOUR
T1 - Tapas
T2 - enabling faithful data-to-text generation through task-adaptive pre-training with data alignment strategy
AU - Sun, Xin
AU - Zhang, Haoran
AU - Zhao, Shuo
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2025/10/25
Y1 - 2025/10/25
N2 - Data-to-text generation is the task of converting structured data into human-readable and coherent text, with applications in fields such as automated reporting and real-time information dissemination. Despite recent progress with pre-trained language models, which have significantly improved human-readability and coherence, a major challenge remains: hallucination, where generated text fails to faithfully align with the input data. These hallucinations primarily stem from two factors: limitations in the model's ability to understand the structural information of the data, and inconsistencies between structured data and reference texts in the training data. To address these challenges, we propose Tapas, a task-adaptive pre-training model that mitigates hallucination from both the model and data perspectives. First, we employ task-adaptive pre-training with three effective learning objectives. This aims to enhance the ability of pre-trained language models to learn data structure and align structured data with reference texts. Then, during the fine-tuning phase, we incorporate a heuristic data alignment strategy to further mitigate hallucination. Experimental results indicate that Tapas achieves the state-of-the-art BLEU-4 scores on the E2E and WebNLG datasets in fully supervised scenarios. In few-shot scenarios, notable improvements of 2.1 % and 1.8 % are observed for E2E and WebNLG, respectively. These results confirm Tapas’ effectiveness in addressing the core causes of hallucination and improving fidelity in data-to-text generation compared to baseline models.
AB - Data-to-text generation is the task of converting structured data into human-readable and coherent text, with applications in fields such as automated reporting and real-time information dissemination. Despite recent progress with pre-trained language models, which have significantly improved human-readability and coherence, a major challenge remains: hallucination, where generated text fails to faithfully align with the input data. These hallucinations primarily stem from two factors: limitations in the model's ability to understand the structural information of the data, and inconsistencies between structured data and reference texts in the training data. To address these challenges, we propose Tapas, a task-adaptive pre-training model that mitigates hallucination from both the model and data perspectives. First, we employ task-adaptive pre-training with three effective learning objectives. This aims to enhance the ability of pre-trained language models to learn data structure and align structured data with reference texts. Then, during the fine-tuning phase, we incorporate a heuristic data alignment strategy to further mitigate hallucination. Experimental results indicate that Tapas achieves the state-of-the-art BLEU-4 scores on the E2E and WebNLG datasets in fully supervised scenarios. In few-shot scenarios, notable improvements of 2.1 % and 1.8 % are observed for E2E and WebNLG, respectively. These results confirm Tapas’ effectiveness in addressing the core causes of hallucination and improving fidelity in data-to-text generation compared to baseline models.
KW - Data alignment strategy
KW - Data-to-text generation
KW - Faithful text generation
KW - Task-adaptive pre-training
UR - https://www.scopus.com/pages/publications/105013494932
U2 - 10.1016/j.knosys.2025.114240
DO - 10.1016/j.knosys.2025.114240
M3 - Article
AN - SCOPUS:105013494932
SN - 0950-7051
VL - 328
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 114240
ER -