The effectiveness of data augmentation in code readability classification

https://doi.org/10.1016/j.infsof.2020.106378Get rights and content

Abstract

Context: Training deep learning models for code readability classification requires large datasets of quality pre-labeled data. However, it is almost always time-consuming and expensive to acquire readability data with manual labels.Objective: We thus propose to introduce data augmentation approaches to artificially increase the size of training set, this is to reduce the risk of overfitting caused by the lack of readability data and further improve the classification accuracy as the ultimate goal.Method: We create transformed versions of code snippets by manipulating original data from aspects such as comments, indentations, and names of classes/methods/variables based on domain-specific knowledge. In addition to basic transformations, we also explore the use of Auxiliary Classifier GANs to produce synthetic data.Results: To evaluate the proposed approach, we conduct a set of experiments. The results show that the classification performance of deep neural networks can be significantly improved when they are trained on the augmented corpus, achieving a state-of-the-art accuracy of 87.38%.Conclusion:We consider the findings of this study as primary evidence of the effectiveness of data augmentation in the field of code readability classification.

Introduction

Code readability refers to a human judgment of how easy a piece of source code is to understand [1]. The research of code readability classification has drawn increasing attention from the software engineering community. To classify a source code into a Readable or Unreadable class, most prior studies built machine learning models based on a set of handcrafted surface-level features (e.g., the number of identifiers) [1], [2]. While in our latest research [3], we proposed to introduce deep learning techniques to capture complicated features automatically from the source code. Although the experimental results showed that our approach outperformed the state-of-the-art, we argue that the model performance is still limited by the shortage of training data. Actually, there are only a few hundred human-annotated code snippets available in the literature (see Section 2.1 for details), which may not be sufficient to sustain the training process. The problem can lead to undesirable overfitting and therefore impede the model performance, which inspires this research into finding effective ways to artificially enlarge the training set, with an underlying goal of further improving the classification accuracy.

Current practice for collecting readability data is to perform a large-scale survey, inviting as many domain experts as possible to rate code snippets by readability using a five-point Likert scale [1], [4]. However, the survey process is usually associated with quite a high cost [5]. Considering that it is always expensive (and sometimes impractical) to obtain adequate code snippets with manual labels for model training, we propose to augment existing readability data to support code readability classification. Specifically, the major contributions of this paper are:

  • We propose a group of domain-specific transformation techniques to generate additional code snippets. We also make use of Auxiliary Classifier GANs to produce synthetic data. To the best of our knowledge, this study is the first to adapt data augmentation for code readability classification.

  • We conduct a series of experiments to validate the effectiveness of the proposed approach using robust statistical tests, i.e., the Brunner-Munzel test and the Cliff’s δ effect size. The empirical results show that the model trained on the augmented corpus performs significantly better on code readability classification, reaching up to 87.38% accuracy.

Section snippets

Proposed approach

We begin by briefly reviewing existing readability data. Based on these data, we design two types of augmentation schemes. The workflow of our research is illustrated in Fig. 1.

Experimental setup

To explore whether the proposed approach can reduce the risk of overfitting and help improve classification accuracy, we plan to conduct a series of experimental evaluations. In particular, we aim to answer the following RQs:

  • RQ1: To what extent does data augmentation help improve code readability classification?

  • RQ2: How do different levels of data augmentation influence model performance?

RQ1 is to verify whether data augmentation can enhance model performance when used for code readability

Results and discussion

In this section, we present experimental results with respect to each RQ and discuss the findings. For simplicity, we denote the size of the original corpus as N.

Conclusions and future work

In this pioneering research, we investigated different strategies to augment existing readability data to support code readability classification. The experimental results showed that deep neural networks trained with the augmented corpus performed significantly better than those trained with only real data. The improvement in accuracy ranges from 3% to 7%, confirming the notable effectiveness of data augmentation approaches in code readability classification.

Our work is a first step toward

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by the Beijing Municipal Natural Science Foundation (No. 4202004).

References (10)

  • Q. Mi et al.

    Improving code readability classification using convolutional neural networks

    Inf. Softw. Technol.

    (2018)
  • R.P.L. Buse et al.

    Learning a metric for code readability

    IEEE Trans. Softw. Eng.

    (2010)
  • S. Scalabrino et al.

    Improving code readability models with textual features

    2016 IEEE 24th International Conference on Program Comprehension (ICPC)

    (2016)
  • J. Dorn

    A General Software Readability Model

    (2012)
  • D. Posnett et al.

    A simpler model of software readability

    Proceeding of the 8th Working Conference on Mining Software Repositories - MSR ’11

    (2011)
There are more references available in the full text version of this article.

Cited by (19)

  • Towards using visual, semantic and structural features to improve code readability classification

    2022, Journal of Systems and Software
    Citation Excerpt :

    Actually, even we have used all available data for model training in our experiments (see RQ3 in Section 5 for details), we argue that the model performance is still limited by the shortage of training data. Since in our latest research (Mi et al., 2021), we have preliminarily demonstrated that the classification performance of deep neural networks can be significantly improved when they are trained on an augmented corpus. Besides, a small test set also leads to a phenomenon that the distribution of the training set is very likely to be disparate from the test set distribution.

  • An analytical code quality methodology using Latent Dirichlet Allocation and Convolutional Neural Networks

    2022, Journal of King Saud University - Computer and Information Sciences
    Citation Excerpt :

    Those models are ConvNetcR (character level representation), ConvNetTR (taken level representation) and DeepCRM (DL-Based Code Readability). ( Mi et al., 2021) examined the 12 data sets using Generative Adversarial Networks (GANs) model. The average overall Acc was 87.38% using only the readability feature.

View all citing articles on Scopus
View full text