TY - GEN
T1 - Japanese Author Attribution Using BERT Finetuning with Stylometric Features
AU - Konuma, Risa
AU - Huaping, Zhang
AU - Gao, Chunxiao
AU - Wang, Juan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - This study investigates author attribution (AA) in Japanese texts through fine-tuning the pre-trained BERT model “cl-tohoku/bert-large-japanese-v2” with Japanese-specific stylometric features. Experiments explored combinations of these features with classifiers such as LR, SVM, and RF, across varying author counts from 5 to 75. Focusing solely on native Japanese compositions, the study utilized the “Composition Bilingual Database” from the National Institute for Japanese Language and Linguistics to maintain linguistic consistency. The BERT model combined with LR achieved the highest accuracy of 96.3% for 5 authors, demonstrating deep learning’s potential in Japanese AA. However, high-dimensional stylistic features introduced noise when integrated, highlighting challenges in feature alignment. Future work will explore advanced non-linear models like XGBoost, LightGBM, and CatBoost for improved feature integration, and low-resource classification methods such as prototypical networks to enhance performance without extensive dataset expansion. Additionally, further testing of alternative Japanese pre-trained language models will be conducted to capture linguistic nuances more effectively.
AB - This study investigates author attribution (AA) in Japanese texts through fine-tuning the pre-trained BERT model “cl-tohoku/bert-large-japanese-v2” with Japanese-specific stylometric features. Experiments explored combinations of these features with classifiers such as LR, SVM, and RF, across varying author counts from 5 to 75. Focusing solely on native Japanese compositions, the study utilized the “Composition Bilingual Database” from the National Institute for Japanese Language and Linguistics to maintain linguistic consistency. The BERT model combined with LR achieved the highest accuracy of 96.3% for 5 authors, demonstrating deep learning’s potential in Japanese AA. However, high-dimensional stylistic features introduced noise when integrated, highlighting challenges in feature alignment. Future work will explore advanced non-linear models like XGBoost, LightGBM, and CatBoost for improved feature integration, and low-resource classification methods such as prototypical networks to enhance performance without extensive dataset expansion. Additionally, further testing of alternative Japanese pre-trained language models will be conducted to capture linguistic nuances more effectively.
KW - BERT
KW - Japanese Author Attribution
KW - Stylometric features
UR - https://www.scopus.com/pages/publications/105008368795
U2 - 10.1007/978-981-96-5123-8_20
DO - 10.1007/978-981-96-5123-8_20
M3 - Conference contribution
AN - SCOPUS:105008368795
SN - 9789819651221
T3 - Communications in Computer and Information Science
SP - 293
EP - 307
BT - Intelligent Multilingual Information Processing - 1st International Conference, IMLIP 2024, Proceedings
A2 - Zhang, Huaping
A2 - Shang, Jianyun
A2 - Su, Jinsong
PB - Springer Science and Business Media Deutschland GmbH
T2 - 1st International Conference on Intelligent Multilingual Information Processing, IMLIP 2024
Y2 - 16 November 2024 through 17 November 2024
ER -