Authorship identification of source codes

Chunxia Zhang*, Sen Wang, Jiayu Wu, Zhendong Niu

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

18 Citations (Scopus)

Abstract

Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.

Original languageEnglish
Title of host publicationWeb and Big Data - 1st International Joint Conference, APWeb-WAIM 2017, Proceedings
EditorsCyrus Shahabi, Xiang Lian, Christian S. Jensen, Xiaochun Yang, Lei Chen
PublisherSpringer Verlag
Pages282-296
Number of pages15
ISBN (Print)9783319635781
DOIs
Publication statusPublished - 2017
Event1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017 - Beijing, China
Duration: 7 Jul 20179 Jul 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10366 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference1st Asia-Pacific Web and Web-Age Information Management Joint Conference on Web and Big Data, APWeb-WAIM 2017
Country/TerritoryChina
CityBeijing
Period7/07/179/07/17

Keywords

  • Authorship identification
  • Discrete word-level n-gram
  • Sequential minimal optimization
  • Software forensics
  • Source code

Fingerprint

Dive into the research topics of 'Authorship identification of source codes'. Together they form a unique fingerprint.

Cite this