skip to main content
10.1145/3674805.3686664acmconferencesArticle/Chapter ViewAbstractPublication PagesesemConference Proceedingsconference-collections
research-article

Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code Search

Published: 24 October 2024 Publication History

Abstract

 Background: Code pre-training and large language models are heavily dependent on data quality. These models require a vast, high-quality corpus matching text descriptions with codes to establish semantic correlations between natural and programming languages. Unlike NLP tasks, code comment heavily relies on specialized programming knowledge and is often limited in quantity and variety. Thus, most widely available open-source datasets are established with compromise and noise from platforms, such as StackOverflow, where code snippets are often incomplete. This may lead to significant errors when deploying the trained models in real-world applications.  Aims: Comments as a substitute for queries are used to build code search datasets from GitHub. While comments describe code functionality and details, they often contain noise and differ from queries. Thus, our research focuses on improving the syntactic and semantic quality of code comments.  Method: We propose a comment-based data refinement framework CoCoRF 1 via an unsupervised and supervised co-learning technique. It applies manually defined rules for syntax filtering and constructs a bootstrap query corpus via the WTFF algorithm for training the TVAE model for further semantic filtering.  Results: Our study shows that CoCoRF achieves high efficiency with less computational resource, and outperforms comparison models in DeepCS code search task.  Conclusions: Our findings indicate that the CoCoRF framework significantly improves the performance of code search tasks by enhancing the quality of code datasets.

References

[1]
Miltos Allamanis, Daniel Tarlow, Andrew Gordon, and Yi Wei. 2015. Bimodal modelling of source code and natural language. In International conference on machine learning. PMLR, 2123–2132.
[2]
Jinwon An and Sungzoon Cho. 2015. Variational autoencoder based anomaly detection using reconstruction probability. Special lecture on IE 2, 1 (2015), 1–18.
[3]
Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv preprint arXiv:1707.02275 (2017).
[4]
Felix Biessmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, and Philipp Schmidt. 2021. Automated data validation in machine learning systems. (2021).
[5]
Sanae Borrohou, Rachida Fissoune, and Hassan Badir. 2023. Data cleaning survey and challenges–improving outlier detection algorithm in machine learning. Journal of Smart Cities and Society 2, 3 (2023), 125–140.
[6]
Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In MLSys.
[7]
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 964–974.
[8]
Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 international conference on management of data. 2201–2206.
[9]
Wong Edmund. 2014. Mining question and answer sites for automatic comment generation. Master’s thesis. University of Waterloo.
[10]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
[11]
Zhipeng Gao, Xin Xia, David Lo, John Grundy, and Yuan-Fang Li. 2021. Code2que: A tool for improving question titles from mined code snippets in stack overflow. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1525–1529.
[12]
Mohammad Gharehyazie, Baishakhi Ray, and Vladimir Filkov. 2017. Some from here, some from there: Cross-project code reuse in github. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 291–301.
[13]
Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. 933–944.
[14]
Masum Hasan, Tanveer Muttaqueen, Abdullah Al Ishtiaq, Kazi Sajeed Mehrab, Md Mahim Anjum Haque, Tahmid Hasan, Wasi Uddin Ahmad, Anindya Iqbal, and Rifat Shahriyar. 2021. Codesc: A large code-description parallel dataset. arXiv preprint arXiv:2105.14220 (2021).
[15]
Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) 25 (2008), 1–42.
[16]
Mehdi Hosseinzadeh, Amir Masoud Rahmani, Bay Vo, Moazam Bidaki, Mohammad Masdari, and Mehran Zangakani. 2021. Improving security using SVM-based anomaly detection: issues and challenges. Soft Computing 25, 4 (2021), 3195–3223.
[17]
Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, Wang Gao, and Mengting Yuan. 2020. Unsupervised software repositories mining and its application to code search. Software: Practice and Experience 50, 3 (2020), 299–322.
[18]
Gang Hu, Min Peng, Yihan Zhang, Qianqian Xie, and Mengting Yuan. 2020. Neural joint attention code search over structure embeddings for software Q&A sites. Journal of Systems and Software 170 (2020), 110773.
[19]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th conference on program comprehension. 200–210.
[20]
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25 (2020), 2179–2217.
[21]
Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. Cosqa: 20,000+ web queries for code search and question answering. arXiv preprint arXiv:2105.13239 (2021).
[22]
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
[23]
MohammadNoor Injadat, Fadi Salo, Ali Bou Nassif, Aleksander Essex, and Abdallah Shami. 2018. Bayesian optimization with machine learning algorithms towards anomaly detection. In 2018 IEEE global communications conference (GLOBECOM). IEEE, 1–6.
[24]
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016. Association for Computational Linguistics, 2073–2083.
[25]
Theodore Johnson and Tamraparni Dasu. 2003. Data quality and data cleaning: An overview. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 681–681.
[26]
Sanjay Krishnan, Michael J Franklin, Ken Goldberg, Jiannan Wang, and Eugene Wu. 2016. Activeclean: An interactive data cleaning framework for modern machine learning. In Proceedings of the 2016 international conference on management of data. 2117–2120.
[27]
Sanjay Krishnan, Daniel Haas, Michael J Franklin, and Eugene Wu. 2016. Towards reliable interactive data cleaning: A user survey and recommendations. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1–5.
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
[29]
Wallace Lawson, Esube Bekele, and Keith Sullivan. 2017. Finding anomalies with generative adversarial networks for a patrolbot. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 12–13.
[30]
Ga Young Lee, Lubna Alzamil, Bakhtiyar Doskenov, and Arash Termehchy. 2021. A survey on data cleaning methods for improved machine learning model performance. arXiv preprint arXiv:2109.07127 (2021).
[31]
Shuyu Lin, Ronald Clark, Robert Birke, Sandro Schönborn, Niki Trigoni, and Stephen Roberts. 2020. Anomaly detection for time series using vae-lstm hybrid model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Ieee, 4322–4326.
[32]
Chunyang Ling, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. Adaptive deep code search. In Proceedings of the 28th International Conference on Program Comprehension. 48–59.
[33]
Chao Liu, Xin Xia, David Lo, Cuiyun Gao, Xiaohu Yang, and John Grundy. 2021. Opportunities and challenges in code search tools. ACM Computing Surveys (CSUR) 54, 9 (2021), 1–40.
[34]
Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A neural architecture for generating natural language descriptions from source code changes. arXiv preprint arXiv:1704.04856 (2017).
[35]
Lucy Ellen Lwakatare, Ellinor Rånge, Ivica Crnkovic, and Jan Bosch. 2021. On the experiences of adopting automated data validation in an industrial machine learning project. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 248–257.
[36]
Andrzej Maćkiewicz and Waldemar Ratajczak. 1993. Principal components analysis (PCA). Computers & Geosciences 19, 3 (1993), 303–342.
[37]
Todd K Moon. 1996. The expectation-maximization algorithm. IEEE Signal processing magazine 13, 6 (1996), 47–60.
[38]
Amuthan Prabakar Muniyandi, R Rajeswari, and R Rajaram. 2012. Network anomaly detection by cascading k-Means clustering and C4. 5 decision tree algorithm. Procedia Engineering 30 (2012), 174–182.
[39]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807–814.
[40]
Seyed Mehdi Nasehi, Jonathan Sillito, Frank Maurer, and Chris Burns. 2012. What makes a good code example?: A study of programming Q&A in StackOverflow. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 25–34.
[41]
Andrew M. Nesbitt and Benjamin Nickolls. 2017. Libraries.io Open Source Repository and Dependency Metadata. Jun.
[42]
Mrutyunjaya Panda and Manas Ranjan Patra. 2007. Network intrusion detection using naive bayes. International journal of computer science and network security 7, 12 (2007), 258–263.
[43]
Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. Advances in neural information processing systems 29 (2016).
[44]
Shazia Sadiq, Tamraparni Dasu, Xin Luna Dong, Juliana Freire, Ihab F Ilyas, Sebastian Link, Miller J Miller, Felix Naumann, Xiaofang Zhou, and Divesh Srivastava. 2018. Data quality: The role of empiricism. ACM SIGMOD Record 46, 4 (2018), 35–43.
[45]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment 11, 12 (2018), 1781–1794.
[46]
Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, and Qing Wang. 2022. Are we building on the rock? on the importance of data preprocessing for code summarization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 107–119.
[47]
Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving code search with co-attentive representation learning. In Proceedings of the 28th International Conference on Program Comprehension. 196–207.
[48]
Akash Singh. 2017. Anomaly detection for temporal data using long short-term memory (lstm).
[49]
Yu Su, Ahmed Hassan Awadallah, Madian Khabsa, Patrick Pantel, Michael Gamon, and Mark Encarnacion. 2017. Building natural language interfaces to web apis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 177–186.
[50]
Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, and Li Li. 2022. On the importance of building high-quality training datasets for neural code search. In Proceedings of the 44th International Conference on Software Engineering. 1609–1620.
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[52]
Melina Vidoni. 2022. A systematic process for Mining Software Repositories: Results from a systematic literature review. Information and Software Technology 144 (2022), 106791.
[53]
Edmund Wong, Jinqiu Yang, and Lin Tan. 2013. Autocomment: Mining question and answer sites for automatic comment generation. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 562–567.
[54]
Yutao Xie, Jiayi Lin, Hande Dong, Lei Zhang, and Zhonghai Wu. 2023. Survey of code search based on deep learning. ACM Transactions on Software Engineering and Methodology 33, 2 (2023), 1–42.
[55]
Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the code snippets what we are searching for? a benchmark and an empirical study on code search with natural-language queries. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 344–354.
[56]
Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. Staqc: A systematically mined question-code dataset from stack overflow. In Proceedings of the 2018 World Wide Web Conference. 1693–1703.
[57]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of the 15th international conference on mining software repositories. 476–486.
[58]
Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696 (2017).
[59]
Dengsheng Zhang and Dengsheng Zhang. 2019. Wavelet transform. Fundamentals of image data mining: Analysis, Features, Classification and Retrieval (2019), 35–44.
[60]
Marc-André Zöller and Marco F Huber. 2021. Benchmark and survey of automated machine learning frameworks. Journal of artificial intelligence research 70 (2021), 409–472.

Index Terms

  1. Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code Search

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ESEM '24: Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
      October 2024
      633 pages
      ISBN:9798400710476
      DOI:10.1145/3674805
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 October 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Code Search
      2. CodeSearchNet Cleaning
      3. Comment-Code dataset
      4. Self-attention Mechanism

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Open Project Program of Yunnan Key Laboratory of Intelligent Systems and Computing
      • General Program of Applied Basic Research Programs of the Science and Technology Department of Yunnan Province
      • Science Research Fund Project of the Education Department of Yunnan Province
      • Key R&D Project of Hubei Province

      Conference

      ESEM '24
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 130 of 594 submissions, 22%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 50
        Total Downloads
      • Downloads (Last 12 months)50
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 18 Feb 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media