research-article

Leveraging a corpus of natural language descriptions for program similarity

Authors:

Meital Zilberstein,

Eran YahavAuthors Info & Claims

Onward! 2016: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software

Pages 197 - 211

https://doi.org/10.1145/2986012.2986013

Published: 20 October 2016 Publication History

Abstract

Program similarity is a central challenge in many programming-related applications, such as code search, clone detection, automatic translation, and programming education.

We present a novel approach for establishing the similarity of code fragments by: (i) obtaining textual descriptions of code fragments captured in millions of posts on question-answering sites, blogs and other sources, and (ii) using natural language processing techniques to establish similarity between textual descriptions, and thus between their corresponding code fragments.

To improve precision, we use a simple static analysis that extracts type signatures, and combine the results of textual similarity with similarity of the signatures. Because our notion of code similarity is based on similarity of textual descriptions, our approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches. To evaluate our approach, we use data obtained from the popular question-answering site, Stackoverflow. To obtain a ground-truth to compare against, we developed a crowdsourcing system, Like2Drops, that allows users to label the similarity of code fragments. We used the system to collect similarity classifications for a massive corpus of 6,500 program pairs. Our results show that our technique is effective in determining similarity, and achieves more than 85 percent precision, recall and accuracy.

References

[1]

M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learning natural coding conventions. In FSE. ACM, 2014.

Digital Library

[2]

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 2001.

Digital Library

[3]

B. S. Baker. A program for identifying duplicated code. Computing Science and Statistics, 1993.

[4]

I. D. Baxter, A. Yahin, L. Moura, M. S. Anna, and L. Bier. Clone detection using abstract syntax trees. In International Conference on Software Maintenance. IEEE, 1998.

Digital Library

[5]

H. Berghel and D. Sallach. Measurements of program similarity in identical task environments. ACM SIGPLAN Notices, 1984.

Digital Library

[6]

A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997.

Digital Library

[7]

C. Chen and K. Zhang. Who asked what: integrating crowdsourced FAQs into API documentation. In Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 2014.

Digital Library

[8]

L. Chen, D. Xu, I. W. Tsang, and J. Luo. Tag-based web photo retrieval improved by batch mode re-tagging. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[9]

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JAsIs, 1990.

[10]

S. Ducasse, M. Rieger, and S. Demeyer. A language independent approach for detecting duplicated code. In IEEE International Conference on Software Maintenance (ICSM) Proceedings, 1999.

Digital Library

[11]

E. Flores, A. Barrón-Cedeno, P. Rosso, and L. Moreno. DeSo-CoRe: Detecting source code re-use across programming languages. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstration Session, 2012.

Digital Library

[12]

M. Gabel and Z. Su. A study of the uniqueness of source code. In Proceedings of the 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2010.

Digital Library

[13]

M. Gegick, P. Rotella, and T. Xie. Identifying security bug reports via text mining: An industrial case study. In 7th IEEE Working Conference on Mining Software Repositories (MSR), 2010.

[14]

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In 34th International Conference on Software Engineering (ICSE). IEEE, 2012.

Digital Library

[15]

S. Horwitz. Identifying the semantic and textual differences between two versions of a program. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1990.

Digital Library

[16]

A. Islam and D. Inkpen. Semantic text similarity using corpusbased word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data (TKDD), 2008.

Digital Library

[17]

J. Jeon, W. B. Croft, and J. H. Lee. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM International conference on Information and knowledge management, 2005.

Digital Library

[18]

L. Jiang. Scalable Detection of Similar Code: Techniques and Applications. PhD thesis, University of California, Davis, 2009.

Digital Library

[19]

S. Karaivanov, V. Raychev, and M. Vechev. Phrase-based statistical translation of programming languages. In Proceedings of the ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, 2014.

Digital Library

[20]

I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 2014.

Digital Library

[21]

M. Kimmig, M. Monperrus, and M. Mezini. Querying source code with natural language. In Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering, 2011.

Digital Library

[22]

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, 1995.

Digital Library

[23]

R. Koschke, R. Falke, and P. Frenzel. Clone detection using abstract syntax suffix trees. In 13th Working Conference on Reverse Engineering (WCRE). IEEE, 2006.

Digital Library

[24]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

Digital Library

[25]

A. Kuhn, S. Ducasse, and T. Gírba. Semantic clustering: Identifying topics in source code. Information and Software Technology, 2007.

Digital Library

[26]

T. K. Landauer and S. T. Dumais. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 1997.

[27]

R. Likert. A technique for the measurement of attitudes. Archives of psychology, 1932.

[28]

C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to Information Retrieval. Cambridge university press, 2008.

Digital Library

[29]

Q. McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 1947.

[30]

G. A. Miller. WordNet: a lexical database for English. Communications of the ACM, 1995.

Digital Library

[31]

G. Mishne and M. De Rijke. Source code retrieval using conceptual similarity. In RIAO, 2004.

Digital Library

[32]

M. Monperrus and A. Maia. Debugging with the crowd: A debug recommendation system based on StackOverflow. 2014.

[33]

S. M. Nasehi, J. Sillito, F. Maurer, and C. Burns. What makes a good code example?: A study of programming Q&A in StackOverflow. In 28th IEEE International Conference on Software Maintenance (ICSM), 2012.

Digital Library

[34]

R. Oliveto, M. Gethers, D. Poshyvanyk, and A. De Lucia. On the equivalence of information retrieval methods for automated traceability link recovery. In International Conference on Program Comprehension (IPCP), 2010.

Digital Library

[35]

R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and A. Paradkar. Inferring method specifications from natural language API descriptions. In Proceedings of the 34th International Conference on Software Engineering, 2012.

[36]

N. Partush and E. Yahav. Abstract semantic differencing via speculative correlation. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages & Applications, 2014.

Digital Library

[37]

T. Pedersen, S. Patwardhan, and J. Michelizzi. WordNet:: Similarity - measuring the relatedness of concepts. In Demonstration papers at hlt-naacl. Association for Computational Linguistics, 2004.

Digital Library

[38]

M. T. Pilehvar, D. Jurgens, and R. Navigli. Align, disambiguate and walk: A unified approach for measuring semantic similarity. In ACL, 2013.

[39]

A. Pnueli, M. Siegel, and E. Singerman. Translation validation. In Tools and Algorithms for the Construction and Analysis of Systems, LNCS. 1998.

Digital Library

[40]

L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Mining stackoverflow to turn the IDE into a selfconfident programming prompter. In Proc. of the Working Conference on Mining Software Repositories, 2014.

Digital Library

[41]

L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. UCS, 2002.

[42]

G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 1975.

Digital Library

[43]

J. R. A. Santos. Cronbach’s alpha: A tool for assessing the reliability of scales. Journal of extension, 1999.

[44]

J. F. Sowa. Conceptual structures: information processing in mind and machine. 1983.

Digital Library

[45]

G. Sridhara, L. Pollock, and K. Vijay-Shanker. Automatically detecting and describing high level actions within methods. In Proceedings of the International Conference on Software Engineering, ICSE, 2011.

Digital Library

[46]

K. T. Stolee and S. Elbaum. Exploring the use of crowdsourcing to support empirical studies in software engineering. In Proceedings of the International Symposium on Empirical Software Engineering and Measurement. ACM, 2010.

Digital Library

[47]

Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specific word similarity database. In Conference on Software Maintenance, Reengineering and Reverse Engineering, Software Evolution Week-IEEE, 2014.

[48]

V. Vinayakarao, R. Purandare, and A. V. Nori. Structurally heterogeneous source code examples from unstructured knowledge sources. In Proceedings of the Workshop on Partial Evaluation and Program Manipulation. ACM, 2015.

Digital Library

[49]

E. M. Voorhees. Query expansion using lexical-semantic relations. In SIGIR. Springer, 1994.

Digital Library

[50]

C. Wang, J. P. Reese, H. Zhang, J. Tao, Y. Gu, J. Ma, and R. J. Nemiroff. Similarity-based visualization of large image collections. Information Visualization, 2013.

[51]

X. Xiao, A. Paradkar, S. Thummalapenta, and T. Xie. Automated extraction of security policies from natural-language software documents. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, 2012.

Digital Library

[52]

T. Yeh, K. Tollmar, and T. Darrell. Searching the web with mobile images for location recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2004.

Digital Library

[53]

W.-t. Yih. Learning term-weighting functions for similarity measures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 2009.

Digital Library

[54]

A. T. Ying. Mining challenge: Comparing and combining different information sources on the stack overflow data set. In Working Conference on Mining Software Repositories, 2015.

[55]

W. Zhang, T. Yoshida, and X. Tang. A comparative study of TF-IDF, LSI and multi-words for text classification. Expert Systems with Applications, 2011.

Digital Library

[56]

H. Zhong and Z. Su. Detecting API documentation errors. In ACM SIGPLAN Notices, 2013.

Digital Library

[57]

H. Zhong, L. Zhang, T. Xie, and H. Mei. Inferring resource specifications from natural language API documentation. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, 2009.

Digital Library

Cited By

Gupta AGoyal R(2024)A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of MethodsIEEE Access10.1109/ACCESS.2024.340177012(70773-70791)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3401770
Ramírez FMera-Gómez CBahsoon RZhang Y(2022)Mining the Limits of Granularity for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_19(273-281)Online publication date: 22-Nov-2022
https://doi.org/10.1007/978-3-031-20984-0_19
Ramírez FMera-Gómez CChen SBahsoon RZhang Y(2022)Semantics-Driven Learning for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_17(255-263)Online publication date: 22-Nov-2022
https://doi.org/10.1007/978-3-031-20984-0_17
Show More Cited By

Index Terms

Leveraging a corpus of natural language descriptions for program similarity
1. Theory of computation
  1. Semantics and reasoning
    1. Program semantics

Recommendations

Creating a Corpus of Geospatial Natural Language
COSIT 2013: Proceedings of the 11th International Conference on Spatial Information Theory - Volume 8116

The description of location using natural language is of interest for a number of research activities including the automated interpretation and generation of natural language to ease interaction with geographic information systems. For such activities, ...
Developing a Cross-lingual Semantic Word Similarity Corpus for English–Urdu Language Pair
Semantic word similarity is a quantitative measure of how much two words are contextually similar. Evaluation of semantic word similarity models requires a benchmark corpus. However, despite the millions of speakers and the large digital text of the Urdu ...
Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy

Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we introduce a novel metric, $({\rm PMI}_{max})$, that augments ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

Onward! 2016: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software

October 2016

268 pages

ISBN:9781450340762

DOI:10.1145/2986012

General Chair:
Eelco Visser
Delft University of Technology, Netherlands
,
Program Chairs:
Emerson Murphy-Hill
North Carolina State University, USA
,
Crista Lopes
University of California at Irvine, USA

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

In-Cooperation

SIGAda: ACM Special Interest Group on Ada Programming Language

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPLASH '16

Sponsor:

SIGPLAN

SPLASH '16: Conference on Systems, Programming, Languages, and Applications: Software for Humanity

November 2 - 4, 2016

Amsterdam, Netherlands

Acceptance Rates

Overall Acceptance Rate 40 of 105 submissions, 38%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
260
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gupta AGoyal R(2024)A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of MethodsIEEE Access10.1109/ACCESS.2024.340177012(70773-70791)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3401770
Ramírez FMera-Gómez CBahsoon RZhang Y(2022)Mining the Limits of Granularity for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_19(273-281)Online publication date: 22-Nov-2022
https://doi.org/10.1007/978-3-031-20984-0_19
Ramírez FMera-Gómez CChen SBahsoon RZhang Y(2022)Semantics-Driven Learning for Microservice AnnotationsService-Oriented Computing10.1007/978-3-031-20984-0_17(255-263)Online publication date: 22-Nov-2022
https://doi.org/10.1007/978-3-031-20984-0_17
Kallingal Joshy AChen XSteenhoek BLe WCadar CZhang X(2021)Validating static warnings via testing code fragmentsProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3460319.3464832(540-552)Online publication date: 11-Jul-2021
https://dl.acm.org/doi/10.1145/3460319.3464832
Uddin GKhomh FRoy C(2021)Automatic API Usage Scenario Documentation from Technical Q&A SitesACM Transactions on Software Engineering and Methodology10.1145/343976930:3(1-45)Online publication date: 23-Apr-2021
https://dl.acm.org/doi/10.1145/3439769
White RKrinke JBarr ESarro FRagkhitwetsagul C(2021)Artefact Relation Graphs for Unit Test Reuse Recommendation2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST49551.2021.00025(137-147)Online publication date: Apr-2021
https://doi.org/10.1109/ICST49551.2021.00025
Alon UZilberstein MLevy OYahav E(2019)code2vec: learning distributed representations of codeProceedings of the ACM on Programming Languages10.1145/32903533:POPL(1-29)Online publication date: 2-Jan-2019
https://dl.acm.org/doi/10.1145/3290353
Kamp MKreutzer PPhilippsen MStorey MAdams BHaiduc S(2019)SeSaMeProceedings of the 16th International Conference on Mining Software Repositories10.1109/MSR.2019.00079(529-533)Online publication date: 26-May-2019
https://dl.acm.org/doi/10.1109/MSR.2019.00079
Alon UZilberstein MLevy OYahav E(2018)A general path-based representation for predicting program propertiesACM SIGPLAN Notices10.1145/3296979.319241253:4(404-419)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192412
Alon UZilberstein MLevy OYahav EFoster JGrossman D(2018)A general path-based representation for predicting program propertiesProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3192366.3192412(404-419)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3192366.3192412
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten