research-article

Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code Search

Authors:

Liang DuanAuthors Info & Claims

ESEM '24: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

Pages 1 - 12

https://doi.org/10.1145/3674805.3686664

Published: 24 October 2024 Publication History

Abstract

Background: Code pre-training and large language models are heavily dependent on data quality. These models require a vast, high-quality corpus matching text descriptions with codes to establish semantic correlations between natural and programming languages. Unlike NLP tasks, code comment heavily relies on specialized programming knowledge and is often limited in quantity and variety. Thus, most widely available open-source datasets are established with compromise and noise from platforms, such as StackOverflow, where code snippets are often incomplete. This may lead to significant errors when deploying the trained models in real-world applications. Aims: Comments as a substitute for queries are used to build code search datasets from GitHub. While comments describe code functionality and details, they often contain noise and differ from queries. Thus, our research focuses on improving the syntactic and semantic quality of code comments. Method: We propose a comment-based data refinement framework CoCoRF 1 via an unsupervised and supervised co-learning technique. It applies manually defined rules for syntax filtering and constructs a bootstrap query corpus via the WTFF algorithm for training the TVAE model for further semantic filtering. Results: Our study shows that CoCoRF achieves high efficiency with less computational resource, and outperforms comparison models in DeepCS code search task. Conclusions: Our findings indicate that the CoCoRF framework significantly improves the performance of code search tasks by enhancing the quality of code datasets.

References

[1]

Miltos Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal modelling of source code and natural language. In International conference on machine learning. PMLR, 2123–2132.

[2]

Jinwon An and Sungzoon Cho. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE 2, 1 (2015), 1–18.

[3]

Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275 (2017).

[4]

Felix Biessmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, and Philipp Schmidt. 2021. Automated data validation in machine learning systems. (2021).

[5]

Sanae Borrohou, Rachida Fissoune, and Hassan Badir. 2023. Data cleaning survey and challenges–improving outlier detection algorithm in machine learning. Journal of Smart Cities and Society 2, 3 (2023), 125–140.

[6]

Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In MLSys.

[7]

Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 964–974.

Digital Library

[8]

Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 international conference on management of data. 2201–2206.

Digital Library

[9]

Wong Edmund. 2014. Mining question and answer sites for automatic comment generation. Master’s thesis. University of Waterloo.

[10]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).

[11]

Zhipeng Gao, Xin Xia, David Lo, John Grundy, and Yuan-Fang Li. 2021. Code2que: A tool for improving question titles from mined code snippets in stack overflow. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1525–1529.

Digital Library

[12]

Mohammad Gharehyazie, Baishakhi Ray, and Vladimir Filkov. 2017. Some from here, some from there: Cross-project code reuse in github. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 291–301.

Digital Library

[13]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. 933–944.

Digital Library

[14]

Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed Mehrab, Md Mahim Anjum Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya Iqbal, and Rifat Shahriyar. 2021. Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220 (2021).

[15]

Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) 25 (2008), 1–42.

[16]

Mehdi Hosseinzadeh, Amir Masoud Rahmani, Bay Vo, Moazam Bidaki, Mohammad Masdari, and Mehran Zangakani. 2021. Improving security using SVM-based anomaly detection: issues and challenges. Soft Computing 25, 4 (2021), 3195–3223.

Digital Library

[17]

Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, Wang Gao, and Mengting Yuan. 2020. Unsupervised software repositories mining and its application to code search. Software: Practice and Experience 50, 3 (2020), 299–322.

[18]

Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, and Mengting Yuan. 2020. Neural joint attention code search over structure embeddings for software Q&A sites. Journal of Systems and Software 170 (2020), 110773.

[19]

Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th conference on program comprehension. 200–210.

Digital Library

[20]

Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25 (2020), 2179–2217.

Digital Library

[21]

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. Cosqa: 20,000+ web queries for code search and question answering. arXiv preprint arXiv:2105.13239 (2021).

[22]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).

[23]

MohammadNoor Injadat, Fadi Salo, Ali Bou Nassif, Aleksander Essex, and Abdallah Shami. 2018. Bayesian optimization with machine learning algorithms towards anomaly detection. In 2018 IEEE global communications conference (GLOBECOM). IEEE, 1–6.

[24]

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016. Association for Computational Linguistics, 2073–2083.

[25]

Theodore Johnson and Tamraparni Dasu. 2003. Data quality and data cleaning: An overview. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 681–681.

Digital Library

[26]

Sanjay Krishnan, Michael J Franklin, Ken Goldberg, Jiannan Wang, and Eugene Wu. 2016. Activeclean: An interactive data cleaning framework for modern machine learning. In Proceedings of the 2016 international conference on management of data. 2117–2120.

Digital Library

[27]

Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. 2016. Towards reliable interactive data cleaning: A user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1–5.

Digital Library

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).

[29]

Wallace Lawson, Esube Bekele, and Keith Sullivan. 2017. Finding anomalies with generative adversarial networks for a patrolbot. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 12–13.

[30]

Ga Young Lee, Lubna Alzamil, Bakhtiyar Doskenov, and Arash Termehchy. 2021. A survey on data cleaning methods for improved machine learning model performance. arXiv preprint arXiv:2109.07127 (2021).

[31]

Shuyu Lin, Ronald Clark, Robert Birke, Sandro Schönborn, Niki Trigoni, and Stephen Roberts. 2020. Anomaly detection for time series using vae-lstm hybrid model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ieee, 4322–4326.

[32]

Chunyang Ling, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. Adaptive deep code search. In Proceedings of the 28th International Conference on Program Comprehension. 48–59.

Digital Library

[33]

Chao Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and John Grundy. 2021. Opportunities and challenges in code search tools. ACM Computing Surveys (CSUR) 54, 9 (2021), 1–40.

[34]

Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A neural architecture for generating natural language descriptions from source code changes. arXiv preprint arXiv:1704.04856 (2017).

[35]

Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the experiences of adopting automated data validation in an industrial machine learning project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 248–257.

Digital Library

[36]

Andrzej Maćkiewicz and Waldemar Ratajczak. 1993. Principal components analysis (PCA). Computers & Geosciences 19, 3 (1993), 303–342.

Digital Library

[37]

Todd K Moon. 1996. The expectation-maximization algorithm. IEEE Signal processing magazine 13, 6 (1996), 47–60.

[38]

Amuthan Prabakar Muniyandi, R Rajeswari, and R Rajaram. 2012. Network anomaly detection by cascading k-Means clustering and C4. 5 decision tree algorithm. Procedia Engineering 30 (2012), 174–182.

[39]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807–814.

Digital Library

[40]

Seyed Mehdi Nasehi, Jonathan Sillito, Frank Maurer, and Chris Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 25–34.

Digital Library

[41]

Andrew M. Nesbitt and Benjamin Nickolls. 2017. Libraries.io Open Source Repository and Dependency Metadata. Jun.

[42]

Mrutyunjaya Panda and Manas Ranjan Patra. 2007. Network intrusion detection using naive bayes. International journal of computer science and network security 7, 12 (2007), 258–263.

[43]

Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. Advances in neural information processing systems 29 (2016).

[44]

Shazia Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F Ilyas, Sebastian Link, Miller J Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. 2018. Data quality: The role of empiricism. ACM SIGMOD Record 46, 4 (2018), 35–43.

Digital Library

[45]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment 11, 12 (2018), 1781–1794.

Digital Library

[46]

Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, and Qing Wang. 2022. Are we building on the rock? on the importance of data preprocessing for code summarization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 107–119.

Digital Library

[47]

Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension. 196–207.

Digital Library

[48]

Akash Singh. 2017. Anomaly detection for temporal data using long short-term memory (lstm).

[49]

Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, and Mark Encarnacion. 2017. Building natural language interfaces to web apis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 177–186.

Digital Library

[50]

Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, and Li Li. 2022. On the importance of building high-quality training datasets for neural code search. In Proceedings of the 44th International Conference on Software Engineering. 1609–1620.

Digital Library

[51]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[52]

Melina Vidoni. 2022. A systematic process for Mining Software Repositories: Results from a systematic literature review. Information and Software Technology 144 (2022), 106791.

Digital Library

[53]

Edmund Wong, Jinqiu Yang, and Lin Tan. 2013. Autocomment: Mining question and answer sites for automatic comment generation. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 562–567.

Digital Library

[54]

Yutao Xie, Jiayi Lin, Hande Dong, Lei Zhang, and Zhonghai Wu. 2023. Survey of code search based on deep learning. ACM Transactions on Software Engineering and Methodology 33, 2 (2023), 1–42.

Digital Library

[55]

Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the code snippets what we are searching for? a benchmark and an empirical study on code search with natural-language queries. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 344–354.

[56]

Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the 2018 World Wide Web Conference. 1693–1703.

Digital Library

[57]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories. 476–486.

Digital Library

[58]

Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696 (2017).

[59]

Dengsheng Zhang and Dengsheng Zhang. 2019. Wavelet transform. Fundamentals of image data mining: Analysis, Features, Classification and Retrieval (2019), 35–44.

[60]

Marc-André Zöller and Marco F Huber. 2021. Benchmark and survey of automated machine learning frameworks. Journal of artificial intelligence research 70 (2021), 409–472.

Digital Library

Index Terms

Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code Search
1. Software and its engineering
  1. Software creation and management
    1. Search-based software engineering
  2. Software notations and tools
    1. Software libraries and repositories

Recommendations

Code Search is All You Need? Improving Code Suggestions with Code Search
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Modern integrated development environments (IDEs) provide various automated code suggestion techniques (e.g., code completion and code generation) to help developers improve their efficiency. Such techniques may retrieve similar code snippets from the ...
Code semantic enrichment for deep code search
Abstract
Code search aims to retrieve code snippets from a large-scale codebase, where the semantics of the searched code match developers’ query intent. Code is a low-level implementation of programming intents, but query is always expressed as clear and ...
Graphical abstract

Display Omitted
Highlights
- Finding that the code semantics can be enriched by incorporating with the description of its most similar code.
- Proposing a code semantic enrichment approach named SemEnr for deep code search.
- Evaluating the performance of SemEnr ...
Code Search: A Survey of Techniques for Finding Code
The immense amounts of source code provide ample challenges and opportunities during software development. To handle the size of code bases, developers commonly search for code, e.g., when trying to find where a particular feature is implemented or when ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEM '24: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement

October 2024

633 pages

ISBN:9798400710476

DOI:10.1145/3674805

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Open Project Program of Yunnan Key Laboratory of Intelligent Systems and Computing
General Program of Applied Basic Research Programs of the Science and Technology Department of Yunnan Province
Science Research Fund Project of the Education Department of Yunnan Province
Key R&D Project of Hubei Province

Conference

ESEM '24

Sponsor:

SIGSOFT

ESEM '24: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement

October 24 - 25, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 130 of 594 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
50
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)7

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten