skip to main content
10.1145/3196321.3196356acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

On the naturalness of auto-generated code: can we identify auto-generated code automatically?

Published: 28 May 2018 Publication History

Abstract

Recently, a variety of studies have been conducted on source code analysis. If auto-generated code is included in the target source code, it is usually removed in a preprocessing phase because the presence of auto-generated code may have negative effects on source code analysis. A straightforward way to remove auto-generated code is searching special comments that are included in the files of auto-generated code. However, it becomes impossible to identify auto-generated code with the way if such special comments have disappeared for some reasons. It is obvious that it takes too much effort to see source files one by one manually. In this paper, we propose a new technique to identify auto-generated code by using the naturalness of auto-generated code. We used a golden set that includes thousands of hand-made source files and source files generated by four kinds of compiler-compilers. Through the evaluation with the dataset, we confirmed that our technique was able to identify auto-generated code with over 99% precision and recall for all the cases.

References

[1]
Peter F. Brown, Peter V. deSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. 1992. Class-based N-gram Models of Natural Language. Computational Linguistics 18, 4 (1992), 467--479.
[2]
Bradley Efron. 1992. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics. Springer, 569--593.
[3]
Kenneth Heafield. 2011. KenLM: Faster and Smaller Language Model Queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation. 187--197.
[4]
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable Modified Kneser-Ney Language Model Estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690--696.
[5]
Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of the 34th International Conference on Software Engineering. 837--847.
[6]
Alexander C MacLean, Landon J Pratt, Jonathan L Krein, and Charles D Knutson. 2010. Trends That Affect Temporal Analysis Using SourceForge Data. In Proceedings of the 5th International Workshop on Public Data about Software Development. 6--11.
[7]
Takafumi Ohta, Hiroaki Murakami, Hiroshi Igaki, Yoshiki Higo, and Shinji Kusumoto. 2015. Source Code Reuse Evaluation by Using Real/Potential Copy and Paste. In Proceedings of the 9th International Workshop on Software Clones. 33--39.
[8]
Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the "Naturalness" of Buggy Code. In Proceedings of the 38th International Conference on Software Engineering. 428--439.
[9]
Kento Shimonaka, Soichi Sumi, Yoshiki Higo, and Shinji Kusumoto. 2016. Identifying Auto-Generated Code by Using Machine Learning Techniques. In Proceedings of the 7th International Workshop on Empirical Software Engineering in Practice. 18--23.

Index Terms

  1. On the naturalness of auto-generated code: can we identify auto-generated code automatically?

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICPC '18: Proceedings of the 26th Conference on Program Comprehension
    May 2018
    423 pages
    ISBN:9781450357142
    DOI:10.1145/3196321
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. N-gram language model
    2. auto-generated code
    3. source code analysis

    Qualifiers

    • Research-article

    Conference

    ICSE '18
    Sponsor:

    Upcoming Conference

    ICSE 2025

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 138
      Total Downloads
    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media