Abstract
In software engineering (SE), code classification and related tasks, such as code clone detection are still challenging problems. Due to the elusive syntax and complicated semantics in software programs, existing traditional SE approaches still have difficulty differentiating between the functionalities of code snippets at the semantic level with high accuracy. As artificial intelligence (AI) techniques have increased in recent years, exploring different machine/deep learning techniques for code classification algorithms has become important. However, most existing machine/deep learning-based approaches often consider using convolutional neural networks (CNNs) or recurrent neural networks (RNNs) to process code texts. However, the two networks inevitably suffer from gradient vanishing problems and fail to capture long-distance dependencies from code statements, resulting in poor performance in downstream tasks. In this paper, we propose the TBCC (Transformer-Based Code Classifier), a novel transformer-based neural network for programming language processing, which can avoid these two problems. Moreover, to capture the important syntactical features from programming languages, we split the deep abstract syntax trees (ASTs) into smaller subtrees that, aim to exploit syntactical information in code statements. We have applied the TBCC to two different common program comprehension tasks to verify its effectiveness: a code classification task for C programs and a code clone detection task for Java programs. The experimental results show that the TBCC achieves state-of-the-art performance, outperforming the baseline methods in terms of accuracy, recall, and F1 score. For subsequent research, the code of TBCC has been released ∗.
Similar content being viewed by others
References
Ahmad W, Chakraborty S, Ray B, Chang KW (2020) A transformer-based approach for source code summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp 4998–5007
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Cheng X, Wang H, Hua J, Zhang M, Xu G, Yi L, Sui Y (2019) Static detection of control-flow-related vulnerabilities using graph embedding. In: 2019 24Th international conference on engineering of complex computer systems (ICECCS). IEEE, pp 41–50
Chirkova N, Troshin S (2020) Empirical study of transformers for source code. arXiv:2010.07987
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on deep learning
Frantzeskou G, MacDonell S, Stamatatos E, Gritzalis S (2008) Examining the significance of high-level programming features in source code author classification. J Syst Softw 81(3):447–460
Goldberg Y, Levy O (2014) word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680
Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: 2010 Acm/ieee 32nd international conference on software engineering, vol. 2, pp 223–226. IEEE
Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: 2018 IEEE/ACM 26Th international conference on program comprehension (ICPC), pp 200–20,010. IEEE
Jain S, Wallace BC (2019) Attention is not explanation. arXiv:1902.10186
Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: Scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 96–105
Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp 1746–1751
LAHEY R (2003) What types of people perform competitive intelligence best. In: Fleisher CS, Blenkhom DL (eds) Controversies in competitive intelligence: the enduring issues. Praeger, Westport, pp 243–256
Li X, Song J, Gao L, Liu X, Huang W, He X, Gan C (2019) Beyond rnns: Positional self-attention with co-attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp 8658–8665
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297
Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: 2008 15Th working conference on reverse engineering, pp 155–164. IEEE
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Thirtieth AAAI conference on artificial intelligence
Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In: AAAI, vol 16, pp 2786–2792
Neamtiu I, Foster JS, Hicks M (2005) Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Softw Eng Notes 30(4):1–5
Parr T (2013) The definitive ANTLR 4 reference. Pragmatic Bookshelf
Roy CK, Cordy JR (2008) Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: 2008. ICPC 2008. The 16th IEEE international conference on 2008. ICPC 2008. The 16th IEEE international conference on Program comprehension. IEEE, pp 172–181
Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 379–389
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: scaling code clone detection to big-code. In: 2016 IEEE/ACM 38th international conference on Software engineering (ICSE), pp 1157–1168. IEEE
See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1073–1083
Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) End-to-end memory networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp 2440–2448
Svajlenko J, Roy CK (2015) Evaluating clone detection tools with bigclonebench. In: 2015 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 131–140
Tai Y, Yang J, Liu X (2017) Image super-resolution via deep recursive residual network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3147–3155
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł., Polosukhin, I. (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wan Y, Shu J, Sui Y, Xu G, Zhao Z, Wu J, Yu P (2019) Multi-modal attention network learning for semantic source code retrieval. In: 2019 34Th IEEE/ACM international conference on automated software engineering (ASE). IEEE Computer Society, pp 13–25
Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu P (2018) Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, pp 397–407
Wang W, Zhang Y, Sui Y, Wan Y, Zhao Z, Wu J, Yu P, Xu G (2020) Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Transactions on software Engineering
Wang Y, Huang M, Zhao L, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 606–615
Wei H, Li M (2017) Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp 3034–3040
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 87–98
Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of cnn and rnn for natural language processing. arXiv:1702.01923
Zaremba W, Sutskever I (2014) Learning to execute. arXiv:1410.4615
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: Proceedings of the 41st International Conference on Software Engineering (ICSE). IEEE Press, pp 783–794
Zhao G, Huang J (2018) Deepsim: deep learning code functional similarity. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, pp 141–151
Zhu X, Li L, Liu J, Peng H, Niu X (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739
Zhu X, Sobihani P, Guo H (2015) Long short-term memory over recursive structures. In: International conference on machine learning. PMLR, pp 1604–1612
Acknowledgements
We would like to thank the anonymous reviewers for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest statement
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of the manuscript entitled. We would like to thank the anonymous reviewers for their helpful comments. We would like to thank Dr. Yulei.Sui from University of Technology, Sydney for his helpful comments. This research is supported by National Natural Science Foundation of China (Grant No. 61672338 and 61373028).
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hua, W., Liu, G. Transformer-based networks over tree structures for code classification. Appl Intell 52, 8895–8909 (2022). https://doi.org/10.1007/s10489-021-02894-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02894-2