Boosting source code suggestion with self-supervised Transformer Gated Highway

https://doi.org/10.1016/j.jss.2022.111553Get rights and content

Highlights

  • Transformer Gated Highway with two different self-supervised strategies.

  • Shown training procedures impact significantly on modeling performance.

  • Similar learning procedures with different model architectures, marginal difference

  • Extensive evaluation with two datasets (Java & C#) with five different baselines.

Abstract

Attention-based transformer language models have shown significant performance gains in various natural language tasks. In this work, we explore the impact of transformer language models on the task of source code suggestion. The core intention of this work is to boost the modeling performance for the source code suggestion task and to explore how the training procedures and model architectures impact modeling performance. Additionally, we propose a transformer-based self-supervised learning technique called Transformer Gated Highway that outperforms recurrent and transformer language models of comparable size. The proposed approach combines the Transformer language model with Gated Highway introducing a notion of recurrence. We compare the performance of the proposed approach with transformer-based BERT (CodeTran), RoBERTa (RoBERTaCode), GPT2 (TravTrans), CodeGen and recurrent neural language-based LSTM (CodeLSTM) models. Moreover, we have experimented with various architectural settings for the transformer models to evaluate their impact on modeling performance. The extensive evaluation of the presented approach exhibits better performance on two programming language datasets; Java and C#. Additionally, we have adopted the presented approach for the syntax error correction task to predict the correct syntax token to render its possible implications for other source code modeling tasks.

Introduction

Source code suggestion that is also known as code completion is the most frequently used feature of Integrated Development Environments (IDEs). This feature offers predictions for the next possible token to the software developers, allowing the developers to speed up the software development process. The effectiveness of automatically providing code suggestions requires the next possible code token to be predicted in the top ranks. Traditional code suggestion tools rely on the code context already written in the IDEs and order the predictions based on the counts or arrange them into alphabetical order which significantly limits their capabilities.

This motivated the use of machine learning for the task of source code suggestion, as machine learning methods not only learn source code context effectively but also serve as a code sighting method. These approaches can be trained on real-world codebases that are capable of providing predictions that have not been observed by the IDE yet. Early approaches adapted n-gram language models on sequences of source code tokens (Hindle et al., 2012, Franks et al., 2015). Later, Recurrent Neural Networks (RNNs) (White et al., 2015) and their variants (Hussain et al., 2020, Hussain et al., 2021) have been adopted to improve their accuracy. Although predictive models cannot be expected to be perfect, the accuracy of the current state-of-the-art methods leaves a substantial margin to be improved.

Recently, attention-based transformer model (Vaswani et al., 2017) and their variants (Devlin et al., 2018, Liu et al., 2019a) have revolutionized NLP domain by removing recurrent layers completely. The transformer-based models allow parallelization and achieve a new state-of-the-art in translation and other NLP tasks. The architecture such as BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) that are based on transformer architecture, have shown superior performance as compared to RNNs. These models are trained on a large amount of data and with high-performance computational machines having several GPUs.

To this end, we present an empirical study to explore the capabilities of neural language models for the task of source code suggestion. Here, we consider the source code suggestion task as a classification task. Among the various neural language models proposed in the literature, we focus on traditional recurrent neural language models and revolutionized transformer-based language models. Specifically, we experimented with the transformer-based CodeTran (BERT), RoBERTaCode (RoBERTa), TravTrans (GPT2), CodeGen, and LSTM-based CodeLSTM models. We propose a methodology called Transformer Gated Highway which employs a self-supervised learning approach along with an enhanced encoder–decoder model that outperforms the recurrent and transformer baselines. The Gated Highway introduces a notion of recurrence, resulting in better modeling performance. We experimented with two different self-supervised learning approaches to study their impact on modeling performance. Additionally, we show the generalization of our approach by employing it for a different source code modeling task of syntax error correction. For the syntax error correction task, our target is to predict the correct syntax token (abstract token) rather than locating the position of the syntax error. The presented approach is evaluated extensively with two programming language datasets; Java and C#. The results have suggested that on average the proposed approach achieves around 95% accuracy rate for the correct syntax token prediction within the top three ranks and 90% for the source code suggestion task within the top-10 ranks for Java. The core intentions of this work are to explore the impact of learning strategies and model architecture design in the modeling of source code for enhanced performance. Specifically, we are interested in studying how various factors (learning strategies and model architecture) impact modeling performance.

The main contributions of this work are as follows:

  • 1.

    We propose a methodology called Transformer Gated Highway (TGH) with self-supervised learning techniques that outperform four baselines including one recurrent-based and four transformer-based. Using a customized decoder along with a specifically tailored learning procedure, the presented Transformer Gated Highway exhibits enhanced performance compared to other baselines.

  • 2.

    We evaluate the proposed approach with two different programming language datasets, Java, and C#, for the task of source code suggestion. Additionally, we employed the proposed methodology for the prediction of correct syntax tokens to reflect its generalization and discuss other possible implications.

  • 3.

    Additionally, we experimented with different model architecture settings to evaluate their impact on the modeling performance. Specifically, we experimented with data size and model architecture designs for performance evaluation. We make the material used in this work publicly available.1

As mentioned earlier, the intent here is to boost the modeling performance for the source code suggestion task. Additionally, we explore how training procedures (i.e., encoding, self-supervised learning) and model architectures impact modeling performance. Specifically, we aim to explore the following research questions during this study.

  • 1.

    How well does the proposed approach perform compared to the transformer models for the task of source code suggestion? To explore and answer this question, we first trained four transformer-based models as baselines for the task of source code suggestion. Specifically, we experimented with CodeTran which employs BERT, ROBERTACode which employs RoBERTa, TravTrans which employs GPT2, and CodeGen language models to evaluate the modeling performance. Additionally, we experiment with different self-supervised learning techniques and explore their impact on modeling performance. The intent here is to have a baseline and push that baseline for enhanced performance. Finally, we compare the enhanced baselines with the proposed approach for performance evaluation.

  • 2.

    What is the modeling performance of traditional Recurrent Neural Networks (RNNs) compared to the proposed approach for the Source code suggestion task? In this research question, we explore the capabilities of traditional RNN-based models compared to the proposed approach. Here, we aim to study if the recurrent language models are completely dominated by transformer language models or if their performance can still be improved. Specifically, we aim to evaluate how different learning procedures with similar model architecture will impact the modeling performance for the task of source code suggestion.

  • 3.

    Exploring the impact of model design, data size, and the generalization of the proposed approach. Here, we study the impact of model design by altering the encoder and decoder architectures. We also aim to study the impact of data size on modeling performance to learn how data size impacts modeling performance. We also aim to study if similar learning procedures with different models have any significant impact on the modeling performance for the task of source code suggestion. Finally, to validate the generalization we employ the proposed approach for the prediction of correct syntax tokens and discuss its applicability for other related tasks.

The rest of the paper is organized in the following manner. We discussed the related background in Section 2. The proposed methodology and the model architecture are discussed in detail in Sections 3 Proposed methodology, 4 Transformer Gated Highway. We discussed the training and evaluation procedures in Section 5 and the results are discussed in Section 6. We discuss the threats to validity in Section 7 followed by the related work discussed in Section 8. Finally, we conclude the work in Section 9.

Section snippets

Background

In this section, we have detailed briefly the vanilla RNN, LSTM, and transformer models.

Proposed methodology

In this section, we define the source code suggestion task as a classification task and comprehensively describe the proposed methodology. The task of syntax error correction is closely related to the task of source code suggestion which aims to predict the correct syntax token given the source code context. The overall workflow of the proposed approach is illustrated in Fig. 1. In the first step, we collect open-source datasets for two different programming languages; Java and C#. Next, we

Transformer Gated Highway

In this section, we describe the architecture of the Transformer Gated Highway, shown in Fig. 2. The model maps an input sequence context of tokens X=(x1,x2,,xn) into a sequence of continuous representations Z=(z1,z2,,zn). Given Z, the model classifies the next code token Y. The Transformer Gated Highway uses an attention-based transformer model to map the input tokens into a continuous representation and uses the Gated Highway Decoder to classify the next token.

Evaluation

In this section, we describe the training procedure for our proposed approach and baseline models.

Exploring the impact of learning procedures and transformers on the source code suggestion task

In this section, we evaluate the performance of CodeTran, CodeRoberta, TravTrans, and CodeGen on the source code suggestion task. Additionally, we explore how different learning procedures can impact modeling performance. The intent here is to build baselines and push for enhanced performance. During our experiments, we observe that CodeTran is under-performing and early stopping was triggered on early epochs (35) during model training. This may be because the model is unable to learn most of

Threats to validity

Although the proposed approach improves the modeling performance, several limitations still need to be addressed. The evaluation of the proposed approach is done on each programming language and folds independently. It may exhibit different modeling performances for cross-fold or language evaluation. This limitation could be overcome by utilizing transfer learning where we could utilize the knowledge of these models and evolve it for cross-fold or language evaluation. Moreover, neural modeling

Related work

In this section, we detailed the literature that is related to the task of source code suggestion. Specifically, we are interested in approaches that are aimed at completing the next code token. Due to lack of space, we omit approaches that are related to other source code modeling tasks such as code readability classification (Mi et al., 2018a, Mi et al., 2018b), API recommendation (Nguyen et al., 2016, Zhou et al., 2021, Svyatkovskiy et al., 2021, Chen et al., 2021), code and comment

Conclusion

In this work, we have explored the transformer language models for the task of source code suggestion to enhance the modeling performance. We have proposed a self-supervised Transformer Gated Highway approach that outperforms traditional recurrent and transformer-based language models of comparable size. Further, we have shown the generalization of our approach by employing it for a different source code modeling task for syntax error correction. The core intentions of this work were to explore

CRediT authorship contribution statement

Yasir Hussain: Conceptualization, Methodology, Investigation, Writing – original draft, Visualization, Funding acquisition. Zhiqiu Huang: Supervision, Administration, Resources, Funding acquisition. Yu Zhou: Supervision, Validation, Reviewing. Senzhang Wang: Validation, Reviewing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (No. 62150410441), the Natural Science Foundation of Jiangsu Province (No. BK20201292), National Natural Science Foundation of China (No. 61972197, No. 61802179), and Joint Funds of the National Natural Science Foundation of China (No. U2241216).

Yasir Hussain received his B.Sc. degree from Bahauddin Zakariya University (BZU), Pakistan, in 2013 and the Master’s degree in computer science from the Virtual University of Pakistan in 2015. He received his Ph.D. degree in Computer Science from Nanjing University of Aeronautics and Astronautics (China) in 2020. Currently, he is conducting his PostDoc research in software engineering at Nanjing University of Aeronautics and Astronautics (China). He is particularly interested in software

References (60)

  • AlonU. et al.

    Code2Vec: Learning distributed representations of code

  • BaderJ. et al.

    Getafix: Learning to fix bugs automatically

  • Bajracharya, S., Ngo, T., Linstead, E., Dou, Y., Rigor, P., Baldi, P., Lopes, C., 2006. Sourcerer: a search engine for...
  • ChenC. et al.

    Holistic combination of structural and textual code information for context based API recommendation

    IEEE Trans. Softw. Eng.

    (2021)
  • CiniselliM. et al.

    An empirical study on the usage of BERT models for code completion

  • Ciniselli, M., Cooper, N., Pascarella, L., Poshyvanyk, D., Di Penta, M., Bavota, G., 2021. An empirical study on the...
  • DevlinJ. et al.

    Bert: Pre-training of deep bidirectional transformers for language understanding

    (2018)
  • DingY. et al.

    Contrastive learning for source code with structural and functional properties

    (2021)
  • FengZ. et al.

    Codebert: A pre-trained model for programming and natural languages

    (2020)
  • FranksC. et al.

    Cacheca: A cache language model based code suggestion tool

  • FreitagM. et al.

    Beam search strategies for neural machine translation

    (2017)
  • GuoD. et al.

    Graphcodebert: Pre-training code representations with data flow

    (2020)
  • HarerJ.A. et al.

    Automated software vulnerability detection with machine learning

    (2018)
  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedings of the IEEE...
  • HindleA. et al.

    On the naturalness of software

  • HochreiterS. et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • HuX. et al.

    Deep code comment generation with hybrid lexical and syntactical information

    Empir. Softw. Eng.

    (2020)
  • Huo, X., Li, M., 2017. Enhancing the unified features to locate buggy files by exploiting the sequential nature of...
  • HussainY. et al.

    Improving source code suggestion with code embedding and enhanced convolutional long short-term memory

    IET Softw.

    (2021)
  • KanadeA. et al.

    Learning and evaluating contextual embedding of source code

  • Cited by (4)

    Yasir Hussain received his B.Sc. degree from Bahauddin Zakariya University (BZU), Pakistan, in 2013 and the Master’s degree in computer science from the Virtual University of Pakistan in 2015. He received his Ph.D. degree in Computer Science from Nanjing University of Aeronautics and Astronautics (China) in 2020. Currently, he is conducting his PostDoc research in software engineering at Nanjing University of Aeronautics and Astronautics (China). He is particularly interested in software engineering, source code modeling, machine learning, deep learning, recommendation systems, and predictive modeling.

    Huang Zhiqiu is a full professor at Nanjing University of Aeronautics and Astronautics. He received his BSc. and M.Sc. degrees in computer science from the National University of Defense Technology of China. He received his Ph.D. degree in computer science from Nanjing University of Aeronautics and Astronautics of China. His research interests include big data analysis, cloud computing, and web services.

    Zhou YU is a full professor at Nanjing University of Aeronautics and Astronautics. He received his B.Sc. degree in 2004 and his Ph.D. degree in 2009, both in Computer Science from Nanjing University China, and conducted his PostDoc research in software engineering at Politechnico di Milano, Italy. In 2015–2016, he visited the SEAL lab at the University of Zurich Switzerland, where he was also an adjunct researcher. His research interests mainly include software evolution analysis, mining software repositories, software architecture, and reliability analysis.

    Senzhang Wang is a full professor at the College of Computer Science and Technology, Central South University. He received his B.Sc. degree from Southeast University, Nanjing, China, in 2009, and his Ph.D. degree from Beihang University, Beijing, China, in 2016. His main research interests include data mining, social computing, and urban computing. He has published more than 80 referred conference and journal papers.

    Editor: Lingxiao Jiang.

    View full text