TOSWT: A Dataset for Tracing the Origins of Students' Writing Texts Polished by Large Language Models

  • Pinren Lu
  • , Zhifeng Lin
  • , Lin Zhang
  • , Jiawen Liu
  • , Shaojie Qu*
  • , Kan Li*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, generative large language models (LLMs) have undergone rapid development, producing content that is nearly indistinguishable from human-written text. While this advancement has found widespread application across various fields, it has also raised significant concerns among educators regarding the authenticity of student submissions. Consequently, addressing the misuse of AI-generated text (AIGT) in the educational sector has become an urgent priority. Current detection strategies primarily focus on whole documents, which do not fully satisfy practical requirements. Due to the likelihood that students may modify AI-generated content to some extent before incorporating it into their essays, fine-grained detection, particularly at the sentence level, is of paramount importance. Consequently, the task of tracing text provenance has increasingly garnered attention. In light of this, this study innovatively proposes the task of text provenance tracing within the educational domain and constructs a corresponding dataset named TOSWT (Tracing the Origins of Students' Writing Texts). This dataset, which comprises texts generated by five outstanding large language models, is based on argumentative essays written by students and contains a total of 53,328 document-level and 147,976 sentencelevel data samples. The study evaluates multiple deep learning detection models through experimental assessments on both document-level and sentence-level data. The results indicate that the task of text provenance tracing is highly challenging, with the sentence-level task proving particularly difficult.

Original languageEnglish
Title of host publication2025 13th International Conference on Information and Education Technology, ICIET 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages456-461
Number of pages6
ISBN (Electronic)9798331537845
DOIs
Publication statusPublished - 2025
Event13th International Conference on Information and Education Technology, ICIET 2025 - Fukuyama, Japan
Duration: 18 Apr 202520 Apr 2025

Publication series

Name2025 13th International Conference on Information and Education Technology, ICIET 2025

Conference

Conference13th International Conference on Information and Education Technology, ICIET 2025
Country/TerritoryJapan
CityFukuyama
Period18/04/2520/04/25

Keywords

  • detect
  • education
  • large language model
  • text provenance
  • tracing origin

Fingerprint

Dive into the research topics of 'TOSWT: A Dataset for Tracing the Origins of Students' Writing Texts Polished by Large Language Models'. Together they form a unique fingerprint.

Cite this