Authorship identification of source codes

Chunxia Zhang; Sen Wang; Jiayu Wu; Zhendong Niu

doi:10.1007/978-3-319-63579-8_22

Authorship identification of source codes

Chunxia Zhang^*, Sen Wang, Jiayu Wu, Zhendong Niu

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

18 Citations (Scopus)

Abstract

Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.

Original language	English
Title of host publication	Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings
Editors	Cyrus Shahabi, Xiang Lian, Christian S. Jensen, Xiaochun Yang, Lei Chen
Publisher	Springer Verlag
Pages	282-296
Number of pages	15
ISBN (Print)	9783319635781
DOIs	https://doi.org/10.1007/978-3-319-63579-8_22
Publication status	Published - 2017
Event	1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017 - Beijing, China Duration: 7 Jul 2017 → 9 Jul 2017

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	10366 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017
Country/Territory	China
City	Beijing
Period	7/07/17 → 9/07/17

Keywords

Authorship identification
Discrete word-level n-gram
Sequential minimal optimization
Software forensics
Source code

Access to Document

10.1007/978-3-319-63579-8_22

Cite this

Zhang, C., Wang, S., Wu, J., & Niu, Z. (2017). Authorship identification of source codes. In C. Shahabi, X. Lian, C. S. Jensen, X. Yang, & L. Chen (Eds.), Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings (pp. 282-296). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10366 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-63579-8_22

Zhang, Chunxia ; Wang, Sen ; Wu, Jiayu et al. / Authorship identification of source codes. Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings. editor / Cyrus Shahabi ; Xiang Lian ; Christian S. Jensen ; Xiaochun Yang ; Lei Chen. Springer Verlag, 2017. pp. 282-296 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{20f03d594d7a49479b168f80e1dcb28c,

title = "Authorship identification of source codes",

abstract = "Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.",

keywords = "Authorship identification, Discrete word-level n-gram, Sequential minimal optimization, Software forensics, Source code",

author = "Chunxia Zhang and Sen Wang and Jiayu Wu and Zhendong Niu",

note = "Publisher Copyright: {\textcopyright} Springer International Publishing AG 2017.; 1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017 ; Conference date: 07-07-2017 Through 09-07-2017",

year = "2017",

doi = "10.1007/978-3-319-63579-8_22",

language = "English",

isbn = "9783319635781",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "282--296",

editor = "Cyrus Shahabi and Xiang Lian and Jensen, {Christian S.} and Xiaochun Yang and Lei Chen",

booktitle = "Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings",

address = "Germany",

}

Zhang, C, Wang, S, Wu, J & Niu, Z 2017, Authorship identification of source codes. in C Shahabi, X Lian, CS Jensen, X Yang & L Chen (eds), Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10366 LNCS, Springer Verlag, pp. 282-296, 1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017, Beijing, China, 7/07/17. https://doi.org/10.1007/978-3-319-63579-8_22

Authorship identification of source codes. / Zhang, Chunxia; Wang, Sen; Wu, Jiayu et al.
Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings. ed. / Cyrus Shahabi; Xiang Lian; Christian S. Jensen; Xiaochun Yang; Lei Chen. Springer Verlag, 2017. p. 282-296 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10366 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Authorship identification of source codes

AU - Zhang, Chunxia

AU - Wang, Sen

AU - Wu, Jiayu

AU - Niu, Zhendong

N1 - Publisher Copyright: © Springer International Publishing AG 2017.

PY - 2017

Y1 - 2017

N2 - Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.

AB - Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.

KW - Authorship identification

KW - Discrete word-level n-gram

KW - Sequential minimal optimization

KW - Software forensics

KW - Source code

UR - http://www.scopus.com/inward/record.url?scp=85028453731&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-63579-8_22

DO - 10.1007/978-3-319-63579-8_22

M3 - Conference contribution

AN - SCOPUS:85028453731

SN - 9783319635781

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 282

EP - 296

BT - Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings

A2 - Shahabi, Cyrus

A2 - Lian, Xiang

A2 - Jensen, Christian S.

A2 - Yang, Xiaochun

A2 - Chen, Lei

PB - Springer Verlag

T2 - 1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017

Y2 - 7 July 2017 through 9 July 2017

ER -

Zhang C, Wang S, Wu J, Niu Z. Authorship identification of source codes. In Shahabi C, Lian X, Jensen CS, Yang X, Chen L, editors, Web and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings. Springer Verlag. 2017. p. 282-296. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-63579-8_22

Authorship identification of source codes

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this