research-article

On the naturalness of auto-generated code: can we identify auto-generated code automatically?

Authors:

Shinji KusumotoAuthors Info & Claims

ICPC '18: Proceedings of the 26th Conference on Program Comprehension

Pages 340 - 343

https://doi.org/10.1145/3196321.3196356

Published: 28 May 2018 Publication History

Get Access

Abstract

Recently, a variety of studies have been conducted on source code analysis. If auto-generated code is included in the target source code, it is usually removed in a preprocessing phase because the presence of auto-generated code may have negative effects on source code analysis. A straightforward way to remove auto-generated code is searching special comments that are included in the files of auto-generated code. However, it becomes impossible to identify auto-generated code with the way if such special comments have disappeared for some reasons. It is obvious that it takes too much effort to see source files one by one manually. In this paper, we propose a new technique to identify auto-generated code by using the naturalness of auto-generated code. We used a golden set that includes thousands of hand-made source files and source files generated by four kinds of compiler-compilers. Through the evaluation with the dataset, we confirmed that our technique was able to identify auto-generated code with over 99% precision and recall for all the cases.

References

[1]

Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based N-gram Models of Natural Language. Computational Linguistics 18, 4 (1992), 467--479.

Digital Library

Google Scholar

[2]

Bradley Efron. 1992. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics. Springer, 569--593.

Google Scholar

[3]

Kenneth Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation. 187--197.

Digital Library

Google Scholar

[4]

Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable Modified Kneser-Ney Language Model Estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690--696.

Google Scholar

[5]

Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of the 34th International Conference on Software Engineering. 837--847.

Digital Library

Google Scholar

[6]

Alexander C MacLean, Landon J Pratt, Jonathan L Krein, and Charles D Knutson. 2010. Trends That Affect Temporal Analysis Using SourceForge Data. In Proceedings of the 5th International Workshop on Public Data about Software Development. 6--11.

Google Scholar

[7]

Takafumi Ohta, Hiroaki Murakami, Hiroshi Igaki, Yoshiki Higo, and Shinji Kusumoto. 2015. Source Code Reuse Evaluation by Using Real/Potential Copy and Paste. In Proceedings of the 9th International Workshop on Software Clones. 33--39.

Crossref

Google Scholar

[8]

Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the "Naturalness" of Buggy Code. In Proceedings of the 38th International Conference on Software Engineering. 428--439.

Digital Library

Google Scholar

[9]

Kento Shimonaka, Soichi Sumi, Yoshiki Higo, and Shinji Kusumoto. 2016. Identifying Auto-Generated Code by Using Machine Learning Techniques. In Proceedings of the 7th International Workshop on Empirical Software Engineering in Practice. 18--23.

Crossref

Google Scholar

Index Terms

On the naturalness of auto-generated code: can we identify auto-generated code automatically?
1. Software and its engineering
  1. Software creation and management

Recommendations

Toward reusing code changes
MSR '15: Proceedings of the 12th Working Conference on Mining Software Repositories

Existing techniques have succeeded to help developers implement new code. However, they are insufficient to help to change existing code. Previous studies have proposed techniques to support bug fixes but other kinds of code changes such as function ...
On the "naturalness" of buggy code
ICSE '16: Proceedings of the 38th International Conference on Software Engineering

Real software, the kind working programmers produce by the kLOC to solve real-world problems, tends to be "natural", like speech or natural language; it tends to be highly repetitive and predictable. Researchers have captured this naturalness of ...
Dependency-Aware Code Naturalness

Code naturalness, which captures repetitiveness and predictability in programming languages, has proven valuable for various code-related tasks in software engineering. However, precisely measuring code naturalness remains a fundamental challenge. ...

Comments

Information & Contributors

Information

Published In

ICPC '18: Proceedings of the 26th Conference on Program Comprehension

May 2018

423 pages

ISBN:9781450357142

DOI:10.1145/3196321

General Chair:
Foutse Khomh
École Polytechnique de Montréal, Canada
,
Program Chairs:
Chanchal K. Roy
University of Saskatchewan, Canada
,
Janet Siegmund
University of Passau, Germany

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 May 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICSE '18

Sponsor:

SIGSOFT
IEEE-CS

ICSE '18: 40th International Conference on Software Engineering

May 28 - 29, 2018

Gothenburg, Sweden

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
138
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

Toward reusing code changes

On the "naturalness" of buggy code

Dependency-Aware Code Naturalness

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations