The effectiveness of data augmentation in code readability classification
Introduction
Code readability refers to a human judgment of how easy a piece of source code is to understand [1]. The research of code readability classification has drawn increasing attention from the software engineering community. To classify a source code into a Readable or Unreadable class, most prior studies built machine learning models based on a set of handcrafted surface-level features (e.g., the number of identifiers) [1], [2]. While in our latest research [3], we proposed to introduce deep learning techniques to capture complicated features automatically from the source code. Although the experimental results showed that our approach outperformed the state-of-the-art, we argue that the model performance is still limited by the shortage of training data. Actually, there are only a few hundred human-annotated code snippets available in the literature (see Section 2.1 for details), which may not be sufficient to sustain the training process. The problem can lead to undesirable overfitting and therefore impede the model performance, which inspires this research into finding effective ways to artificially enlarge the training set, with an underlying goal of further improving the classification accuracy.
Current practice for collecting readability data is to perform a large-scale survey, inviting as many domain experts as possible to rate code snippets by readability using a five-point Likert scale [1], [4]. However, the survey process is usually associated with quite a high cost [5]. Considering that it is always expensive (and sometimes impractical) to obtain adequate code snippets with manual labels for model training, we propose to augment existing readability data to support code readability classification. Specifically, the major contributions of this paper are:
- •
We propose a group of domain-specific transformation techniques to generate additional code snippets. We also make use of Auxiliary Classifier GANs to produce synthetic data. To the best of our knowledge, this study is the first to adapt data augmentation for code readability classification.
- •
We conduct a series of experiments to validate the effectiveness of the proposed approach using robust statistical tests, i.e., the Brunner-Munzel test and the Cliff’s δ effect size. The empirical results show that the model trained on the augmented corpus performs significantly better on code readability classification, reaching up to 87.38% accuracy.
Section snippets
Proposed approach
We begin by briefly reviewing existing readability data. Based on these data, we design two types of augmentation schemes. The workflow of our research is illustrated in Fig. 1.
Experimental setup
To explore whether the proposed approach can reduce the risk of overfitting and help improve classification accuracy, we plan to conduct a series of experimental evaluations. In particular, we aim to answer the following RQs:
- •
RQ1: To what extent does data augmentation help improve code readability classification?
- •
RQ2: How do different levels of data augmentation influence model performance?
RQ1 is to verify whether data augmentation can enhance model performance when used for code readability
Results and discussion
In this section, we present experimental results with respect to each RQ and discuss the findings. For simplicity, we denote the size of the original corpus as N.
Conclusions and future work
In this pioneering research, we investigated different strategies to augment existing readability data to support code readability classification. The experimental results showed that deep neural networks trained with the augmented corpus performed significantly better than those trained with only real data. The improvement in accuracy ranges from 3% to 7%, confirming the notable effectiveness of data augmentation approaches in code readability classification.
Our work is a first step toward
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work is supported by the Beijing Municipal Natural Science Foundation (No. 4202004).
References (10)
- et al.
Improving code readability classification using convolutional neural networks
Inf. Softw. Technol.
(2018) - et al.
Learning a metric for code readability
IEEE Trans. Softw. Eng.
(2010) - et al.
Improving code readability models with textual features
2016 IEEE 24th International Conference on Program Comprehension (ICPC)
(2016) A General Software Readability Model
(2012)- et al.
A simpler model of software readability
Proceeding of the 8th Working Conference on Mining Software Repositories - MSR ’11
(2011)
Cited by (19)
Towards using visual, semantic and structural features to improve code readability classification
2022, Journal of Systems and SoftwareCitation Excerpt :Actually, even we have used all available data for model training in our experiments (see RQ3 in Section 5 for details), we argue that the model performance is still limited by the shortage of training data. Since in our latest research (Mi et al., 2021), we have preliminarily demonstrated that the classification performance of deep neural networks can be significantly improved when they are trained on an augmented corpus. Besides, a small test set also leads to a phenomenon that the distribution of the training set is very likely to be disparate from the test set distribution.
An analytical code quality methodology using Latent Dirichlet Allocation and Convolutional Neural Networks
2022, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :Those models are ConvNetcR (character level representation), ConvNetTR (taken level representation) and DeepCRM (DL-Based Code Readability). ( Mi et al., 2021) examined the 12 data sets using Generative Adversarial Networks (GANs) model. The average overall Acc was 87.38% using only the readability feature.
Gs-DeblurGANv2: a QR code deblurring algorithm based on lightweight network structure
2024, Multimedia SystemsMeasuring code maintainability with deep neural networks
2023, Frontiers of Computer Science