Skip to main content
Log in

Transformer-based networks over tree structures for code classification

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In software engineering (SE), code classification and related tasks, such as code clone detection are still challenging problems. Due to the elusive syntax and complicated semantics in software programs, existing traditional SE approaches still have difficulty differentiating between the functionalities of code snippets at the semantic level with high accuracy. As artificial intelligence (AI) techniques have increased in recent years, exploring different machine/deep learning techniques for code classification algorithms has become important. However, most existing machine/deep learning-based approaches often consider using convolutional neural networks (CNNs) or recurrent neural networks (RNNs) to process code texts. However, the two networks inevitably suffer from gradient vanishing problems and fail to capture long-distance dependencies from code statements, resulting in poor performance in downstream tasks. In this paper, we propose the TBCC (Transformer-Based Code Classifier), a novel transformer-based neural network for programming language processing, which can avoid these two problems. Moreover, to capture the important syntactical features from programming languages, we split the deep abstract syntax trees (ASTs) into smaller subtrees that, aim to exploit syntactical information in code statements. We have applied the TBCC to two different common program comprehension tasks to verify its effectiveness: a code classification task for C programs and a code clone detection task for Java programs. The experimental results show that the TBCC achieves state-of-the-art performance, outperforming the baseline methods in terms of accuracy, recall, and F1 score. For subsequent research, the code of TBCC has been released .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://sites.google.com/site/treebasedcnn

  2. https://pypi.python.org/pypi/pycparser

  3. https://github.com/clonebench/BigCloneBench

  4. https://github.com/antlr/antlr4/blob/master/doc/index.md

  5. https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html

References

  1. Ahmad W, Chakraborty S, Ray B, Chang KW (2020) A transformer-based approach for source code summarization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp 4998–5007

  2. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473

  3. Cheng X, Wang H, Hua J, Zhang M, Xu G, Yi L, Sui Y (2019) Static detection of control-flow-related vulnerabilities using graph embedding. In: 2019 24Th international conference on engineering of complex computer systems (ICECCS). IEEE, pp 41–50

  4. Chirkova N, Troshin S (2020) Empirical study of transformers for source code. arXiv:2010.07987

  5. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on deep learning

  6. Frantzeskou G, MacDonell S, Stamatatos E, Gritzalis S (2008) Examining the significance of high-level programming features in source code author classification. J Syst Softw 81(3):447–460

    Article  Google Scholar 

  7. Goldberg Y, Levy O (2014) word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722

  8. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  9. Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: 2010 Acm/ieee 32nd international conference on software engineering, vol. 2, pp 223–226. IEEE

  10. Hu X, Li G, Xia X, Lo D, Jin Z (2018) Deep code comment generation. In: 2018 IEEE/ACM 26Th international conference on program comprehension (ICPC), pp 200–20,010. IEEE

  11. Jain S, Wallace BC (2019) Attention is not explanation. arXiv:1902.10186

  12. Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: Scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 96–105

  13. Jiang M, Liang Y, Feng X, Fan X, Pei Z, Xue Y, Guan R (2018) Text classification based on deep belief network and softmax regression. Neural Comput Appl 29(1):61–70

    Article  Google Scholar 

  14. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pp 1746–1751

  15. LAHEY R (2003) What types of people perform competitive intelligence best. In: Fleisher CS, Blenkhom DL (eds) Controversies in competitive intelligence: the enduring issues. Praeger, Westport, pp 243–256

  16. Li X, Song J, Gao L, Liu X, Huang W, He X, Gan C (2019) Beyond rnns: Positional self-attention with co-attention for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp 8658–8665

  17. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances in neural information processing systems, pp 289–297

  18. Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: 2008 15Th working conference on reverse engineering, pp 155–164. IEEE

  19. Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. In: Thirtieth AAAI conference on artificial intelligence

  20. Mueller J, Thyagarajan A (2016) Siamese recurrent architectures for learning sentence similarity. In: AAAI, vol 16, pp 2786–2792

  21. Neamtiu I, Foster JS, Hicks M (2005) Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Softw Eng Notes 30(4):1–5

    Article  Google Scholar 

  22. Parr T (2013) The definitive ANTLR 4 reference. Pragmatic Bookshelf

  23. Roy CK, Cordy JR (2008) Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: 2008. ICPC 2008. The 16th IEEE international conference on 2008. ICPC 2008. The 16th IEEE international conference on Program comprehension. IEEE, pp 172–181

  24. Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 379–389

  25. Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: scaling code clone detection to big-code. In: 2016 IEEE/ACM 38th international conference on Software engineering (ICSE), pp 1157–1168. IEEE

  26. See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1073–1083

  27. Sukhbaatar S, Szlam A, Weston J, Fergus R (2015) End-to-end memory networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, pp 2440–2448

  28. Svajlenko J, Roy CK (2015) Evaluating clone detection tools with bigclonebench. In: 2015 IEEE international conference on Software maintenance and evolution (ICSME). IEEE, pp 131–140

  29. Tai Y, Yang J, Liu X (2017) Image super-resolution via deep recursive residual network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3147–3155

  30. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł., Polosukhin, I. (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  31. Wan Y, Shu J, Sui Y, Xu G, Zhao Z, Wu J, Yu P (2019) Multi-modal attention network learning for semantic source code retrieval. In: 2019 34Th IEEE/ACM international conference on automated software engineering (ASE). IEEE Computer Society, pp 13–25

  32. Wan Y, Zhao Z, Yang M, Xu G, Ying H, Wu J, Yu P (2018) Improving automatic source code summarization via deep reinforcement learning. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, pp 397–407

  33. Wang W, Zhang Y, Sui Y, Wan Y, Zhao Z, Wu J, Yu P, Xu G (2020) Reinforcement-learning-guided source code summarization via hierarchical attention. IEEE Transactions on software Engineering

  34. Wang Y, Huang M, Zhao L, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 606–615

  35. Wei H, Li M (2017) Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In: IJCAI, pp 3034–3040

  36. White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, pp 87–98

  37. Yin W, Kann K, Yu M, Schütze H (2017) Comparative study of cnn and rnn for natural language processing. arXiv:1702.01923

  38. Zaremba W, Sutskever I (2014) Learning to execute. arXiv:1410.4615

  39. Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A novel neural source code representation based on abstract syntax tree. In: Proceedings of the 41st International Conference on Software Engineering (ICSE). IEEE Press, pp 783–794

  40. Zhao G, Huang J (2018) Deepsim: deep learning code functional similarity. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, pp 141–151

  41. Zhu X, Li L, Liu J, Peng H, Niu X (2018) Captioning transformer with stacked attention modules. Appl Sci 8(5):739

    Article  Google Scholar 

  42. Zhu X, Sobihani P, Guo H (2015) Long short-term memory over recursive structures. In: International conference on machine learning. PMLR, pp 1604–1612

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Hua.

Ethics declarations

Conflict of interest statement

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of the manuscript entitled. We would like to thank the anonymous reviewers for their helpful comments. We would like to thank Dr. Yulei.Sui from University of Technology, Sydney for his helpful comments. This research is supported by National Natural Science Foundation of China (Grant No. 61672338 and 61373028).

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

http://github.com/preesee/tbcc

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hua, W., Liu, G. Transformer-based networks over tree structures for code classification. Appl Intell 52, 8895–8909 (2022). https://doi.org/10.1007/s10489-021-02894-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02894-2

Keywords

Navigation