Authorship identification from unstructured texts

Chunxia Zhang; Xindong Wu; Zhendong Niu; Wei Ding

doi:10.1016/j.knosys.2014.04.025

Authorship identification from unstructured texts

Chunxia Zhang^*, Xindong Wu, Zhendong Niu, Wei Ding

^*Corresponding author for this work

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

64 Citations (Scopus)

Abstract

Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification.

Original language	English
Pages (from-to)	99-111
Number of pages	13
Journal	Knowledge-Based Systems
Volume	66
DOIs	https://doi.org/10.1016/j.knosys.2014.04.025
Publication status	Published - Aug 2014

Keywords

Authorship identification
Feature extraction
Linear discriminant analysis
Principal components analysis
Semantic association model

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1016/j.knosys.2014.04.025

Cite this

Zhang, C., Wu, X., Niu, Z., & Ding, W. (2014). Authorship identification from unstructured texts. Knowledge-Based Systems, 66, 99-111. https://doi.org/10.1016/j.knosys.2014.04.025

@article{31defdfe49724b56850daa3eb1b5f571,

title = "Authorship identification from unstructured texts",

abstract = "Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification.",

keywords = "Authorship identification, Feature extraction, Linear discriminant analysis, Principal components analysis, Semantic association model",

author = "Chunxia Zhang and Xindong Wu and Zhendong Niu and Wei Ding",

year = "2014",

month = aug,

doi = "10.1016/j.knosys.2014.04.025",

language = "English",

volume = "66",

pages = "99--111",

journal = "Knowledge-Based Systems",

issn = "0950-7051",

publisher = "Elsevier B.V.",

}

TY - JOUR

T1 - Authorship identification from unstructured texts

AU - Zhang, Chunxia

AU - Wu, Xindong

AU - Niu, Zhendong

AU - Ding, Wei

PY - 2014/8

Y1 - 2014/8

N2 - Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification.

AB - Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification.

KW - Authorship identification

KW - Feature extraction

KW - Linear discriminant analysis

KW - Principal components analysis

KW - Semantic association model

UR - http://www.scopus.com/inward/record.url?scp=84902373138&partnerID=8YFLogxK

U2 - 10.1016/j.knosys.2014.04.025

DO - 10.1016/j.knosys.2014.04.025

M3 - Article

AN - SCOPUS:84902373138

SN - 0950-7051

VL - 66

SP - 99

EP - 111

JO - Knowledge-Based Systems

JF - Knowledge-Based Systems

ER -

Authorship identification from unstructured texts

Abstract

Keywords

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this