TY - JOUR
T1 - Authorship identification from unstructured texts
AU - Zhang, Chunxia
AU - Wu, Xindong
AU - Niu, Zhendong
AU - Ding, Wei
PY - 2014/8
Y1 - 2014/8
N2 - Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification.
AB - Authorship identification is a task of identifying authors of anonymous texts given examples of the writing of authors. The increasingly large volumes of anonymous texts on the Internet enhance the great yet urgent necessity for authorship identification. It has been applied to more and more practical applications including literary works, intelligence, criminal law, civil law, and computer forensics. In this paper, we propose a semantic association model about voice, word dependency relations, and non-subject stylistic words to represent the writing style of unstructured texts of various authors, design an unsupervised approach to extract stylistic features, and employ principal components analysis and linear discriminant analysis to identify authorship of texts. This paper provides a uniform quantified method to capture syntactic and semantic stylistic characteristics of and between words and phrases, and this approach can solve the problem of the independence of different dimensions to some extent. Experimental results on two English text corpora show that our approach significantly improves the overall performance over authorship identification.
KW - Authorship identification
KW - Feature extraction
KW - Linear discriminant analysis
KW - Principal components analysis
KW - Semantic association model
UR - http://www.scopus.com/inward/record.url?scp=84902373138&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2014.04.025
DO - 10.1016/j.knosys.2014.04.025
M3 - Article
AN - SCOPUS:84902373138
SN - 0950-7051
VL - 66
SP - 99
EP - 111
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
ER -