skip to main content
10.1145/2986012.2986013acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Leveraging a corpus of natural language descriptions for program similarity

Published: 20 October 2016 Publication History

Abstract

Program similarity is a central challenge in many programming-related applications, such as code search, clone detection, automatic translation, and programming education.
We present a novel approach for establishing the similarity of code fragments by: (i) obtaining textual descriptions of code fragments captured in millions of posts on question-answering sites, blogs and other sources, and (ii) using natural language processing techniques to establish similarity between textual descriptions, and thus between their corresponding code fragments.
To improve precision, we use a simple static analysis that extracts type signatures, and combine the results of textual similarity with similarity of the signatures. Because our notion of code similarity is based on similarity of textual descriptions, our approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches. To evaluate our approach, we use data obtained from the popular question-answering site, Stackoverflow. To obtain a ground-truth to compare against, we developed a crowdsourcing system, Like2Drops, that allows users to label the similarity of code fragments. We used the system to collect similarity classifications for a massive corpus of 6,500 program pairs. Our results show that our technique is effective in determining similarity, and achieves more than 85 percent precision, recall and accuracy.

References

[1]
M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learning natural coding conventions. In FSE. ACM, 2014.
[2]
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 2001.
[3]
B. S. Baker. A program for identifying duplicated code. Computing Science and Statistics, 1993.
[4]
I. D. Baxter, A. Yahin, L. Moura, M. S. Anna, and L. Bier. Clone detection using abstract syntax trees. In International Conference on Software Maintenance. IEEE, 1998.
[5]
H. Berghel and D. Sallach. Measurements of program similarity in identical task environments. ACM SIGPLAN Notices, 1984.
[6]
A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997.
[7]
C. Chen and K. Zhang. Who asked what: integrating crowdsourced FAQs into API documentation. In Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 2014.
[8]
L. Chen, D. Xu, I. W. Tsang, and J. Luo. Tag-based web photo retrieval improved by batch mode re-tagging. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
[9]
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 1990.
[10]
S. Ducasse, M. Rieger, and S. Demeyer. A language independent approach for detecting duplicated code. In IEEE International Conference on Software Maintenance (ICSM) Proceedings, 1999.
[11]
E. Flores, A. Barrón-Cedeno, P. Rosso, and L. Moreno. DeSo-CoRe: Detecting source code re-use across programming languages. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstration Session, 2012.
[12]
M. Gabel and Z. Su. A study of the uniqueness of source code. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2010.
[13]
M. Gegick, P. Rotella, and T. Xie. Identifying security bug reports via text mining: An industrial case study. In 7th IEEE Working Conference on Mining Software Repositories (MSR), 2010.
[14]
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In 34th International Conference on Software Engineering (ICSE). IEEE, 2012.
[15]
S. Horwitz. Identifying the semantic and textual differences between two versions of a program. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1990.
[16]
A. Islam and D. Inkpen. Semantic text similarity using corpusbased word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2008.
[17]
J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM International conference on Information and knowledge management, 2005.
[18]
L. Jiang. Scalable Detection of Similar Code: Techniques and Applications. PhD thesis, University of California, Davis, 2009.
[19]
S. Karaivanov, V. Raychev, and M. Vechev. Phrase-based statistical translation of programming languages. In Proceedings of the ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, 2014.
[20]
I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 2014.
[21]
M. Kimmig, M. Monperrus, and M. Mezini. Querying source code with natural language. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering, 2011.
[22]
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, 1995.
[23]
R. Koschke, R. Falke, and P. Frenzel. Clone detection using abstract syntax suffix trees. In 13th Working Conference on Reverse Engineering (WCRE). IEEE, 2006.
[24]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
[25]
A. Kuhn, S. Ducasse, and T. Gírba. Semantic clustering: Identifying topics in source code. Information and Software Technology, 2007.
[26]
T. K. Landauer and S. T. Dumais. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 1997.
[27]
R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.
[28]
C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to Information Retrieval. Cambridge university press, 2008.
[29]
Q. McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 1947.
[30]
G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 1995.
[31]
G. Mishne and M. De Rijke. Source code retrieval using conceptual similarity. In RIAO, 2004.
[32]
M. Monperrus and A. Maia. Debugging with the crowd: A debug recommendation system based on StackOverflow. 2014.
[33]
S. M. Nasehi, J. Sillito, F. Maurer, and C. Burns. What makes a good code example?: A study of programming Q&A in StackOverflow. In 28th IEEE International Conference on Software Maintenance (ICSM), 2012.
[34]
R. Oliveto, M. Gethers, D. Poshyvanyk, and A. De Lucia. On the equivalence of information retrieval methods for automated traceability link recovery. In International Conference on Program Comprehension (IPCP), 2010.
[35]
R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and A. Paradkar. Inferring method specifications from natural language API descriptions. In Proceedings of the 34th International Conference on Software Engineering, 2012.
[36]
N. Partush and E. Yahav. Abstract semantic differencing via speculative correlation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages & Applications, 2014.
[37]
T. Pedersen, S. Patwardhan, and J. Michelizzi. WordNet:: Similarity - measuring the relatedness of concepts. In Demonstration papers at hlt-naacl. Association for Computational Linguistics, 2004.
[38]
M. T. Pilehvar, D. Jurgens, and R. Navigli. Align, disambiguate and walk: A unified approach for measuring semantic similarity. In ACL, 2013.
[39]
A. Pnueli, M. Siegel, and E. Singerman. Translation validation. In Tools and Algorithms for the Construction and Analysis of Systems, LNCS. 1998.
[40]
L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Mining stackoverflow to turn the IDE into a selfconfident programming prompter. In Proc. of the Working Conference on Mining Software Repositories, 2014.
[41]
L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. UCS, 2002.
[42]
G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 1975.
[43]
J. R. A. Santos. Cronbach’s alpha: A tool for assessing the reliability of scales. Journal of extension, 1999.
[44]
J. F. Sowa. Conceptual structures: information processing in mind and machine. 1983.
[45]
G. Sridhara, L. Pollock, and K. Vijay-Shanker. Automatically detecting and describing high level actions within methods. In Proceedings of the International Conference on Software Engineering, ICSE, 2011.
[46]
K. T. Stolee and S. Elbaum. Exploring the use of crowdsourcing to support empirical studies in software engineering. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement. ACM, 2010.
[47]
Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specific word similarity database. In Conference on Software Maintenance, Reengineering and Reverse Engineering, Software Evolution Week-IEEE, 2014.
[48]
V. Vinayakarao, R. Purandare, and A. V. Nori. Structurally heterogeneous source code examples from unstructured knowledge sources. In Proceedings of the Workshop on Partial Evaluation and Program Manipulation. ACM, 2015.
[49]
E. M. Voorhees. Query expansion using lexical-semantic relations. In SIGIR. Springer, 1994.
[50]
C. Wang, J. P. Reese, H. Zhang, J. Tao, Y. Gu, J. Ma, and R. J. Nemiroff. Similarity-based visualization of large image collections. Information Visualization, 2013.
[51]
X. Xiao, A. Paradkar, S. Thummalapenta, and T. Xie. Automated extraction of security policies from natural-language software documents. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, 2012.
[52]
T. Yeh, K. Tollmar, and T. Darrell. Searching the web with mobile images for location recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2004.
[53]
W.-t. Yih. Learning term-weighting functions for similarity measures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 2009.
[54]
A. T. Ying. Mining challenge: Comparing and combining different information sources on the stack overflow data set. In Working Conference on Mining Software Repositories, 2015.
[55]
W. Zhang, T. Yoshida, and X. Tang. A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications, 2011.
[56]
H. Zhong and Z. Su. Detecting API documentation errors. In ACM SIGPLAN Notices, 2013.
[57]
H. Zhong, L. Zhang, T. Xie, and H. Mei. Inferring resource specifications from natural language API documentation. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, 2009.

Cited By

View all
  • (2024)A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of MethodsIEEE Access10.1109/ACCESS.2024.340177012(70773-70791)Online publication date: 2024
  • (2022)Mining the Limits of Granularity for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_19(273-281)Online publication date: 22-Nov-2022
  • (2022)Semantics-Driven Learning for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_17(255-263)Online publication date: 22-Nov-2022
  • Show More Cited By

Index Terms

  1. Leveraging a corpus of natural language descriptions for program similarity

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    Onward! 2016: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software
    October 2016
    268 pages
    ISBN:9781450340762
    DOI:10.1145/2986012
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Code Similarity
    2. Natural Language
    3. Program Analysis
    4. Semantics

    Qualifiers

    • Research-article

    Conference

    SPLASH '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 40 of 105 submissions, 38%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of MethodsIEEE Access10.1109/ACCESS.2024.340177012(70773-70791)Online publication date: 2024
    • (2022)Mining the Limits of Granularity for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_19(273-281)Online publication date: 22-Nov-2022
    • (2022)Semantics-Driven Learning for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_17(255-263)Online publication date: 22-Nov-2022
    • (2021)Validating static warnings via testing code fragmentsProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3460319.3464832(540-552)Online publication date: 11-Jul-2021
    • (2021)Automatic API Usage Scenario Documentation from Technical Q&A SitesACM Transactions on Software Engineering and Methodology10.1145/343976930:3(1-45)Online publication date: 23-Apr-2021
    • (2021)Artefact Relation Graphs for Unit Test Reuse Recommendation2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST49551.2021.00025(137-147)Online publication date: Apr-2021
    • (2019)code2vec: learning distributed representations of codeProceedings of the ACM on Programming Languages10.1145/32903533:POPL(1-29)Online publication date: 2-Jan-2019
    • (2019)SeSaMeProceedings of the 16th International Conference on Mining Software Repositories10.1109/MSR.2019.00079(529-533)Online publication date: 26-May-2019
    • (2018)A general path-based representation for predicting program propertiesACM SIGPLAN Notices10.1145/3296979.319241253:4(404-419)Online publication date: 11-Jun-2018
    • (2018)A general path-based representation for predicting program propertiesProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3192366.3192412(404-419)Online publication date: 11-Jun-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media