Skip to main navigation Skip to search Skip to main content

Japanese Author Attribution Using BERT Finetuning with Stylometric Features

  • Risa Konuma
  • , Zhang Huaping*
  • , Chunxiao Gao
  • , Juan Wang
  • *Corresponding author for this work
  • Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

This study investigates author attribution (AA) in Japanese texts through fine-tuning the pre-trained BERT model “cl-tohoku/bert-large-japanese-v2” with Japanese-specific stylometric features. Experiments explored combinations of these features with classifiers such as LR, SVM, and RF, across varying author counts from 5 to 75. Focusing solely on native Japanese compositions, the study utilized the “Composition Bilingual Database” from the National Institute for Japanese Language and Linguistics to maintain linguistic consistency. The BERT model combined with LR achieved the highest accuracy of 96.3% for 5 authors, demonstrating deep learning’s potential in Japanese AA. However, high-dimensional stylistic features introduced noise when integrated, highlighting challenges in feature alignment. Future work will explore advanced non-linear models like XGBoost, LightGBM, and CatBoost for improved feature integration, and low-resource classification methods such as prototypical networks to enhance performance without extensive dataset expansion. Additionally, further testing of alternative Japanese pre-trained language models will be conducted to capture linguistic nuances more effectively.

Original languageEnglish
Title of host publicationIntelligent Multilingual Information Processing - 1st International Conference, IMLIP 2024, Proceedings
EditorsHuaping Zhang, Jianyun Shang, Jinsong Su
PublisherSpringer Science and Business Media Deutschland GmbH
Pages293-307
Number of pages15
ISBN (Print)9789819651221
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event1st International Conference on Intelligent Multilingual Information Processing, IMLIP 2024 - Beijing, China
Duration: 16 Nov 202417 Nov 2024

Publication series

NameCommunications in Computer and Information Science
Volume2395 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference1st International Conference on Intelligent Multilingual Information Processing, IMLIP 2024
Country/TerritoryChina
CityBeijing
Period16/11/2417/11/24

Keywords

  • BERT
  • Japanese Author Attribution
  • Stylometric features

Fingerprint

Dive into the research topics of 'Japanese Author Attribution Using BERT Finetuning with Stylometric Features'. Together they form a unique fingerprint.

Cite this