Deep Learning Based Identification of Suspicious Return Statements

Guangjie Li; Hui Liu; Jiahao Jin; Qasim Umer

doi:10.1109/SANER48275.2020.9054826

Deep Learning Based Identification of Suspicious Return Statements

Guangjie Li, Hui Liu^*, Jiahao Jin, Qasim Umer

^*Corresponding author for this work

School of Computer Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

9 Citations (Scopus)

Abstract

Identifiers in source code are composed of terms in natural languages. Such terms, as well as phrases composed of such terms, convey rich semantics that could be exploited for program analysis and comprehension. To this end, in this paper we propose a deep learning based approach, called MLDetector, to identifying suspicious return statements by leveraging semantics conveyed by the natural language phrases that are used as identifiers in the source code. We specially design a deep neural network to tell whether a given return statement matches its corresponding method signature. The rationale is that both method signature and return value should explicitly specify the output of the method, and thus a significant mismatch between method signature and return value may suggest a suspicious return statement. To address the challenge of lacking negative training data, i.e., incorrect return statements, we generate negative training data automatically by transforming real-world correct return statements. To feed code into neural network, we convert them into vectors by Word2Vec, an unsupervised neural network based learning algorithm. We evaluate the proposed approach in two parts. In the first part, we evaluate it on 500 open-source applications by automatically generating labeled training data. Results suggest that the precision of the proposed approach varies from 83% to 90%. In the second part, we conduct a case study on 100 real-world applications. Evaluation results suggest that 42 out of 65 real-world incorrect return statements are detected (with precision of 59%).

Original language	English
Title of host publication	SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering
Editors	Kostas Kontogiannis, Foutse Khomh, Alexander Chatzigeorgiou, Marios-Eleftherios Fokaefs, Minghui Zhou
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	480-491
Number of pages	12
ISBN (Electronic)	9781728151434
DOIs	https://doi.org/10.1109/SANER48275.2020.9054826
Publication status	Published - Feb 2020
Event	27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020 - London, Canada Duration: 18 Feb 2020 → 21 Feb 2020

Publication series

Name	SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering

Conference

Conference	27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020
Country/Territory	Canada
City	London
Period	18/02/20 → 21/02/20

Keywords

Bug Detection
Code Quality
Deep Learning
Identification
Program Analysis
Return Value

Access to Document

10.1109/SANER48275.2020.9054826

Cite this

Li, G., Liu, H., Jin, J., & Umer, Q. (2020). Deep Learning Based Identification of Suspicious Return Statements. In K. Kontogiannis, F. Khomh, A. Chatzigeorgiou, M.-E. Fokaefs, & M. Zhou (Eds.), SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering (pp. 480-491). Article 9054826 (SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/SANER48275.2020.9054826

Li, Guangjie ; Liu, Hui ; Jin, Jiahao et al. / Deep Learning Based Identification of Suspicious Return Statements. SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering. editor / Kostas Kontogiannis ; Foutse Khomh ; Alexander Chatzigeorgiou ; Marios-Eleftherios Fokaefs ; Minghui Zhou. Institute of Electrical and Electronics Engineers Inc., 2020. pp. 480-491 (SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering).

@inproceedings{59826712fb1e4145a62e6c21b37d76c5,

title = "Deep Learning Based Identification of Suspicious Return Statements",

abstract = "Identifiers in source code are composed of terms in natural languages. Such terms, as well as phrases composed of such terms, convey rich semantics that could be exploited for program analysis and comprehension. To this end, in this paper we propose a deep learning based approach, called MLDetector, to identifying suspicious return statements by leveraging semantics conveyed by the natural language phrases that are used as identifiers in the source code. We specially design a deep neural network to tell whether a given return statement matches its corresponding method signature. The rationale is that both method signature and return value should explicitly specify the output of the method, and thus a significant mismatch between method signature and return value may suggest a suspicious return statement. To address the challenge of lacking negative training data, i.e., incorrect return statements, we generate negative training data automatically by transforming real-world correct return statements. To feed code into neural network, we convert them into vectors by Word2Vec, an unsupervised neural network based learning algorithm. We evaluate the proposed approach in two parts. In the first part, we evaluate it on 500 open-source applications by automatically generating labeled training data. Results suggest that the precision of the proposed approach varies from 83% to 90%. In the second part, we conduct a case study on 100 real-world applications. Evaluation results suggest that 42 out of 65 real-world incorrect return statements are detected (with precision of 59%).",

keywords = "Bug Detection, Code Quality, Deep Learning, Identification, Program Analysis, Return Value",

author = "Guangjie Li and Hui Liu and Jiahao Jin and Qasim Umer",

note = "Publisher Copyright: {\textcopyright} 2020 IEEE.; 27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020 ; Conference date: 18-02-2020 Through 21-02-2020",

year = "2020",

month = feb,

doi = "10.1109/SANER48275.2020.9054826",

language = "English",

series = "SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "480--491",

editor = "Kostas Kontogiannis and Foutse Khomh and Alexander Chatzigeorgiou and Marios-Eleftherios Fokaefs and Minghui Zhou",

booktitle = "SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering",

address = "United States",

}

Li, G, Liu, H, Jin, J & Umer, Q 2020, Deep Learning Based Identification of Suspicious Return Statements. in K Kontogiannis, F Khomh, A Chatzigeorgiou, M-E Fokaefs & M Zhou (eds), SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering., 9054826, SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering, Institute of Electrical and Electronics Engineers Inc., pp. 480-491, 27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020, London, Canada, 18/02/20. https://doi.org/10.1109/SANER48275.2020.9054826

Deep Learning Based Identification of Suspicious Return Statements. / Li, Guangjie; Liu, Hui; Jin, Jiahao et al.
SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering. ed. / Kostas Kontogiannis; Foutse Khomh; Alexander Chatzigeorgiou; Marios-Eleftherios Fokaefs; Minghui Zhou. Institute of Electrical and Electronics Engineers Inc., 2020. p. 480-491 9054826 (SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Deep Learning Based Identification of Suspicious Return Statements

AU - Li, Guangjie

AU - Liu, Hui

AU - Jin, Jiahao

AU - Umer, Qasim

PY - 2020/2

Y1 - 2020/2

N2 - Identifiers in source code are composed of terms in natural languages. Such terms, as well as phrases composed of such terms, convey rich semantics that could be exploited for program analysis and comprehension. To this end, in this paper we propose a deep learning based approach, called MLDetector, to identifying suspicious return statements by leveraging semantics conveyed by the natural language phrases that are used as identifiers in the source code. We specially design a deep neural network to tell whether a given return statement matches its corresponding method signature. The rationale is that both method signature and return value should explicitly specify the output of the method, and thus a significant mismatch between method signature and return value may suggest a suspicious return statement. To address the challenge of lacking negative training data, i.e., incorrect return statements, we generate negative training data automatically by transforming real-world correct return statements. To feed code into neural network, we convert them into vectors by Word2Vec, an unsupervised neural network based learning algorithm. We evaluate the proposed approach in two parts. In the first part, we evaluate it on 500 open-source applications by automatically generating labeled training data. Results suggest that the precision of the proposed approach varies from 83% to 90%. In the second part, we conduct a case study on 100 real-world applications. Evaluation results suggest that 42 out of 65 real-world incorrect return statements are detected (with precision of 59%).

AB - Identifiers in source code are composed of terms in natural languages. Such terms, as well as phrases composed of such terms, convey rich semantics that could be exploited for program analysis and comprehension. To this end, in this paper we propose a deep learning based approach, called MLDetector, to identifying suspicious return statements by leveraging semantics conveyed by the natural language phrases that are used as identifiers in the source code. We specially design a deep neural network to tell whether a given return statement matches its corresponding method signature. The rationale is that both method signature and return value should explicitly specify the output of the method, and thus a significant mismatch between method signature and return value may suggest a suspicious return statement. To address the challenge of lacking negative training data, i.e., incorrect return statements, we generate negative training data automatically by transforming real-world correct return statements. To feed code into neural network, we convert them into vectors by Word2Vec, an unsupervised neural network based learning algorithm. We evaluate the proposed approach in two parts. In the first part, we evaluate it on 500 open-source applications by automatically generating labeled training data. Results suggest that the precision of the proposed approach varies from 83% to 90%. In the second part, we conduct a case study on 100 real-world applications. Evaluation results suggest that 42 out of 65 real-world incorrect return statements are detected (with precision of 59%).

KW - Bug Detection

KW - Code Quality

KW - Deep Learning

KW - Identification

KW - Program Analysis

KW - Return Value

UR - http://www.scopus.com/inward/record.url?scp=85083556267&partnerID=8YFLogxK

U2 - 10.1109/SANER48275.2020.9054826

DO - 10.1109/SANER48275.2020.9054826

M3 - Conference contribution

AN - SCOPUS:85083556267

T3 - SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering

SP - 480

EP - 491

BT - SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering

A2 - Kontogiannis, Kostas

A2 - Khomh, Foutse

A2 - Chatzigeorgiou, Alexander

A2 - Fokaefs, Marios-Eleftherios

A2 - Zhou, Minghui

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 27th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2020

Y2 - 18 February 2020 through 21 February 2020

ER -

Li G, Liu H, Jin J, Umer Q. Deep Learning Based Identification of Suspicious Return Statements. In Kontogiannis K, Khomh F, Chatzigeorgiou A, Fokaefs ME, Zhou M, editors, SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering. Institute of Electrical and Electronics Engineers Inc. 2020. p. 480-491. 9054826. (SANER 2020 - Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution, and Reengineering). doi: 10.1109/SANER48275.2020.9054826

Deep Learning Based Identification of Suspicious Return Statements

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this