TY - GEN
T1 - Authorship identification of source codes
AU - Zhang, Chunxia
AU - Wang, Sen
AU - Wu, Jiayu
AU - Niu, Zhendong
N1 - Publisher Copyright:
© Springer International Publishing AG 2017.
PY - 2017
Y1 - 2017
N2 - Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.
AB - Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.
KW - Authorship identification
KW - Discrete word-level n-gram
KW - Sequential minimal optimization
KW - Software forensics
KW - Source code
UR - http://www.scopus.com/inward/record.url?scp=85028453731&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-63579-8_22
DO - 10.1007/978-3-319-63579-8_22
M3 - Conference contribution
AN - SCOPUS:85028453731
SN - 9783319635781
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 282
EP - 296
BT - Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings
A2 - Shahabi, Cyrus
A2 - Lian, Xiang
A2 - Jensen, Christian S.
A2 - Yang, Xiaochun
A2 - Chen, Lei
PB - Springer Verlag
T2 - 1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017
Y2 - 7 July 2017 through 9 July 2017
ER -