The effectiveness of data augmentation in code readability classification

doi:10.1016/j.infsof.2020.106378

Information and Software Technology

Volume 129, January 2021, 106378

https://doi.org/10.1016/j.infsof.2020.106378 Get rights and content

Abstract

Context: Training deep learning models for code readability classification requires large datasets of quality pre-labeled data. However, it is almost always time-consuming and expensive to acquire readability data with manual labels.Objective: We thus propose to introduce data augmentation approaches to artificially increase the size of training set, this is to reduce the risk of overfitting caused by the lack of readability data and further improve the classification accuracy as the ultimate goal.Method: We create transformed versions of code snippets by manipulating original data from aspects such as comments, indentations, and names of classes/methods/variables based on domain-specific knowledge. In addition to basic transformations, we also explore the use of Auxiliary Classifier GANs to produce synthetic data.Results: To evaluate the proposed approach, we conduct a set of experiments. The results show that the classification performance of deep neural networks can be significantly improved when they are trained on the augmented corpus, achieving a state-of-the-art accuracy of 87.38%.Conclusion:We consider the findings of this study as primary evidence of the effectiveness of data augmentation in the field of code readability classification.

Introduction

Code readability refers to a human judgment of how easy a piece of source code is to understand [1]. The research of code readability classification has drawn increasing attention from the software engineering community. To classify a source code into a Readable or Unreadable class, most prior studies built machine learning models based on a set of handcrafted surface-level features (e.g., the number of identifiers) [1], [2]. While in our latest research [3], we proposed to introduce deep learning techniques to capture complicated features automatically from the source code. Although the experimental results showed that our approach outperformed the state-of-the-art, we argue that the model performance is still limited by the shortage of training data. Actually, there are only a few hundred human-annotated code snippets available in the literature (see Section 2.1 for details), which may not be sufficient to sustain the training process. The problem can lead to undesirable overfitting and therefore impede the model performance, which inspires this research into finding effective ways to artificially enlarge the training set, with an underlying goal of further improving the classification accuracy.

Current practice for collecting readability data is to perform a large-scale survey, inviting as many domain experts as possible to rate code snippets by readability using a five-point Likert scale [1], [4]. However, the survey process is usually associated with quite a high cost [5]. Considering that it is always expensive (and sometimes impractical) to obtain adequate code snippets with manual labels for model training, we propose to augment existing readability data to support code readability classification. Specifically, the major contributions of this paper are:

•
We propose a group of domain-specific transformation techniques to generate additional code snippets. We also make use of Auxiliary Classifier GANs to produce synthetic data. To the best of our knowledge, this study is the first to adapt data augmentation for code readability classification.
•
We conduct a series of experiments to validate the effectiveness of the proposed approach using robust statistical tests, i.e., the Brunner-Munzel test and the Cliff’s δ effect size. The empirical results show that the model trained on the augmented corpus performs significantly better on code readability classification, reaching up to 87.38% accuracy.

Section snippets

Proposed approach

We begin by briefly reviewing existing readability data. Based on these data, we design two types of augmentation schemes. The workflow of our research is illustrated in Fig. 1.

Experimental setup

To explore whether the proposed approach can reduce the risk of overfitting and help improve classification accuracy, we plan to conduct a series of experimental evaluations. In particular, we aim to answer the following RQs:

•
RQ1: To what extent does data augmentation help improve code readability classification?
•
RQ2: How do different levels of data augmentation influence model performance?

RQ1 is to verify whether data augmentation can enhance model performance when used for code readability

Results and discussion

In this section, we present experimental results with respect to each RQ and discuss the findings. For simplicity, we denote the size of the original corpus as N.

Conclusions and future work

In this pioneering research, we investigated different strategies to augment existing readability data to support code readability classification. The experimental results showed that deep neural networks trained with the augmented corpus performed significantly better than those trained with only real data. The improvement in accuracy ranges from 3% to 7%, confirming the notable effectiveness of data augmentation approaches in code readability classification.

Our work is a first step toward

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by the Beijing Municipal Natural Science Foundation (No. 4202004).

References (10)

Q. Mi et al.
Improving code readability classification using convolutional neural networks
Inf. Softw. Technol.
(2018)
R.P.L. Buse et al.
Learning a metric for code readability
IEEE Trans. Softw. Eng.
(2010)
S. Scalabrino et al.
Improving code readability models with textual features
2016 IEEE 24th International Conference on Program Comprehension (ICPC)
(2016)
J. Dorn
A General Software Readability Model
(2012)
D. Posnett et al.
A simpler model of software readability
Proceeding of the 8th Working Conference on Mining Software Repositories - MSR ’11
(2011)

There are more references available in the full text version of this article.

Cited by (19)

Towards using visual, semantic and structural features to improve code readability classification
2022, Journal of Systems and Software
Citation Excerpt :
Actually, even we have used all available data for model training in our experiments (see RQ3 in Section 5 for details), we argue that the model performance is still limited by the shortage of training data. Since in our latest research (Mi et al., 2021), we have preliminarily demonstrated that the classification performance of deep neural networks can be significantly improved when they are trained on an augmented corpus. Besides, a small test set also leads to a phenomenon that the distribution of the training set is very likely to be disparate from the test set distribution.
Code readability, which correlates strongly with software quality, plays a critical role in software maintenance and evolvement. Although existing deep learning-based code readability models have reached a rather high classification accuracy, only structural features are utilized which inevitably limits their model performance.
To address this problem, we propose to extract readability-related features from visual, semantic, and structural aspects from source code in an attempt to further improve code readability classification.
First, we convert a code snippet into a RGB matrix (for visual feature extraction), a token sequence (for semantic feature extraction) and a character matrix (for structural feature extraction). Then, we input them into a hybrid neural network that is composed of BERT, CNN, and BiLSTM for feature extraction. Finally, the extracted features are concatenated and input into a classifier to make a code readability classification.
A series of experiments are conducted to evaluate our method. The results show that the average accuracy could reach 85.3%, which outperforms all existing models.
As an innovative work of extracting readability-related features automatically from visual, semantic, and structural aspects, our method is proved to be effective for the task of code readability classification.
An analytical code quality methodology using Latent Dirichlet Allocation and Convolutional Neural Networks
2022, Journal of King Saud University - Computer and Information Sciences
Citation Excerpt :
Those models are ConvNetcR (character level representation), ConvNetTR (taken level representation) and DeepCRM (DL-Based Code Readability). ( Mi et al., 2021) examined the 12 data sets using Generative Adversarial Networks (GANs) model. The average overall Acc was 87.38% using only the readability feature.
Recently, Code Quality (CQ) has become critical in a wide range of organizations and in many areas from academia to industry. CQ, in terms of readability, security, and testability, is a major goal throughout the software development process because it affects overall Software Quality (SQ) in terms of subsequent releases, maintenance, and updates. It is particularly important for the development of safety critical systems. Existing studies on CQ have several shortcomings in that they are based on incomplete information about the source code, and tend to focus on only one feature, which is likely to determine the performance of the model. Moreover, these considerations often limit obtaining high accuracy because there is no strong relationship between the input data and the output data. Thus, it is necessary to design an effective and efficient SQ measurement system for measuring multiple quality factors. To that end, we propose a deep learning framework that employed a Latent Dirichlet Allocation (LDA) with Convolutional Neural Networks (CNN), called CNN-LDA, to classify input data into topics that are related to CQ features and to identify hidden patterns and correlations in programming data. Three SQ metrics (i.e., readability, security, and testability) and machine learning techniques (e.g., random forest (RF) and support vector machine (SVM)) are taken into account to validate the proposed model. The proposed CNN-LDA outperformed its peers across the vast majority of datasets examined. The average overall F-measure for readability, security, and testability are 94%,94% and 93%. The average overall accuracy for readability, security, and testability are 93%,93% and 92%. The superiority of LDA-CNN over the other classifiers was very clear based on a Wilcoxon’s non-parametric statistical test ( $α = 0.05$ ).
Behavioral spatial-temporal characteristics-based appetite assessment for fish school in recirculating aquaculture systems
2021, Aquaculture
Knowing precise fish appetite is a prerequisite for developing a high-efficient feeding system in aquaculture. However, the current studies on the assessment of fish appetite mostly focus on relevant spatial features of fish school, ignoring the time series-based variation characteristics in the process of fish feeding, which may decrease the accuracy of appetite assessment. To address the research gap and solve these problems, a novel and efficient fish appetite grading method, based on the spatial-temporal characteristics of fish behavior, was proposed in this study, using the modified kinetic energy model and customized recurrent neural network. First, the modified kinetic energy model was used to quantify and extract the behavioral spatial characteristics of fish school without foreground segmentation and individual tracking. The temporal features of fish feeding behavior were learned based on the vector sequence of spatial characteristics above, by means of a customized recurrent neural network. Following this, fish appetite level was determined with the help of layers of full connection and softmax. Through the exhaustive test on four different behavior datasets, the presented method shows better performance (accuracy: 97.08%, 97.35%, 92.50%, 98.31%, respectively) on appetite assessment of fish than many other state-of-the-art methods.
Gs-DeblurGANv2: a QR code deblurring algorithm based on lightweight network structure
2024, Multimedia Systems
Measuring code maintainability with deep neural networks
2023, Frontiers of Computer Science
On-the-fly Improving Performance of Deep Code Models via Input Denoising
2023, arXiv

View all citing articles on Scopus

View full text

The effectiveness of data augmentation in code readability classification

Abstract

Introduction

Section snippets

Proposed approach

Experimental setup

Results and discussion

Conclusions and future work

Declaration of Competing Interest

Acknowledgment

Inf. Softw. Technol.

Learning a metric for code readability

IEEE Trans. Softw. Eng.

Improving code readability models with textual features

2016 IEEE 24th International Conference on Program Comprehension (ICPC)

A General Software Readability Model

A simpler model of software readability

Proceeding of the 8th Working Conference on Mining Software Repositories - MSR ’11