Abstract
Applying machine learning techniques in program analysis has attracted much attention. Recent research efforts in detecting code clones and classifying code have shown that neural models based on abstract syntax trees (ASTs) can better represent source code than other approaches. However, existing AST-based approaches do not take into account contextual information of a program, like statement context. To address this issue, we propose a novel approach path context to capture the context of statements, and a path context augmented network (PCAN) to learn a program. We evaluate PCAN on code clone detection, source code classification, and method naming. The results show that compared to state-of-the-art approaches, PCAN performs the best on code clone detection and has comparable performance on code classification and method naming.








Similar content being viewed by others
Notes
https://docs.oracle.com/javase/tutorial/java/nutsandbolts/expressions.html
pyparser homepage, https://pypi.python.org/pypi/pycparser
javaparser homepage, https://github.com/javaparser/javaparser
We built our dataset by following the data preparation method in CDLH (Wei and Li , 2017) so that we could directly use and compare with their results.
Astminer: https://github.com/JetBrains-Research/astminer
References
Ahmadi M, Farkhani RM, Williams R, Lu L (2021) Finding bugs using your own code: detecting functionally-similar yet inconsistent code. In: 30th USENIX Security Symposium (USENIX Security 21)
Allamanis M, Brockschmidt M, Khademi M (2018) Learning to represent programs with graphs. In: 6th international conference on learning representations, ICLR 2018 - Conference Track Proceedings, 1711.00740
Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. SIGPLAN Not 53(4):404–419. https://doi.org/10.1145/3296979.3192412
Alon U, Levy O, Brody S, Yahav E (2019a) Code2Seq: generating sequences from structured representations of code. 7th International Conference on Learning Representations, ICLR 2019 (1):1–22, 1808.01400
Alon U, Zilberstein M, Levy O, Yahav E (2019b) Code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages 3(POPL):1–29. https://doi.org/10.1145/3290353, 1803.09473
Cai D, Lam W (2020) Graph transformer for graph-to-sequence learning. arXiv:1911.07470
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. In: Wu D, Carpuat M, Carreras X, Vecchi EM (eds) Proceedings of SSST@EMNLP 2014, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014. Association for Computational Linguistics, pp 103–111. https://doi.org/10.3115/v1/W14-4012. https://www.aclweb.org/anthology/W14-4012/
Cummins C, Fisches ZV, Ben-Nun T, Hoefler T, Leather H (2020) Programl: graph-based deep learning for program optimization and analysis. arXiv preprint arXiv:200310536
Falke R, Frenzel P, Koschke R (2008) Empirical evaluation of clone detection using syntax suffix trees. Empirical Software Engineering 13(6):601–643. https://doi.org/10.1007/s10664-008-9073-9
Fang C, Liu Z, Shi Y, Huang J, Shi Q (2020) Functional code clone detection with syntax and semantics fusion learning. Issta 20. https://doi.org/10.1145/3395363.3397362
Fernandes P, Allamanis M, Brockschmidt M (2019) Structured neural summarization. 7th International Conference on Learning Representations, ICLR 2019 (2018):1–18, 1811.01824
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. 34th International Conference on Machine Learning, ICML 2017 3:2053–2070, 1704.01212
Goffi A, Gorla A, Mattavelli A, Pezzè M, Tonella P (2014) Search-based synthesis of equivalent method sequences. In: FSE 2014
Hellendoorn VJ, Sutton C, Singh R, Maniatis P, Bieber D (2019) Global relational models of source code. In: International conference on learning representations
Hindle A, Barr ET, Gabel M, Su Z, Devanbu PT (2016) On the naturalness of software. Commun ACM 59(5):122–131. https://doi.org/10.1145/2902362
Hu X, Li G, Xia X, Lo D, Jin Z (2020) Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25(3):2179–2217. https://doi.org/10.1007/s10664-019-09730-9
Jiang L, Misherghi G, Su Z, é phane Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. 29th International Conference on Software Engineering (ICSE’07), pp 96–105
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Software Eng 28:654–670
Khandelwal U, He H, Qi P, Jurafsky D (2018) Sharp nearby, fuzzy far away: how neural language models use context. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 284–294. https://doi.org/10.18653/v1/P18-1027. https://www.aclweb.org/anthology/P18-1027
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp 1–15, 1412.6980
Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) CCLearner: a deep learning-based clone detection approach. In: Proceedings - 2017 IEEE international conference on software maintenance and evolution, ICSME 2017, pp 249–260. https://doi.org/10.1109/ICSME.2017.46
Li Y, Zemel R, Brockschmidt M, Tarlow D (2016) Gated graph sequence neural networks. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings (1):1–20. arXiv:1511.05493v4
Li Y, Gu C, Dullien T, Vinyals O, Kohli P (2019) Graph matching networks for learning the similarity of graph structured objects. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, PMLR, Proceedings of Machine Learning Research, vol 97, pp 3835–3845. http://proceedings.mlr.press/v97/li19d.html
Lin Z, Feng M, Dos Santos CN, Yu M, Xiang B, Zhou B, Bengio Y (2017) A structured self-attention sentence embedding. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings, pp 1–15. 1703.03130
Linares-Vásquez M, Mcmillan C, Poshyvanyk D, Grechanik M (2014) On using machine learning to automatically classify software applications into domain categories. Empirical Software Engineering 19(3):582–618. https://doi.org/10.1007/s10664-012-9230-z
Luong T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. In: Màrquez L, Callison-Burch C, Su J, Pighin D, Marton Y (eds) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, The Association for Computational Linguistics, pp 1412–1421. https://doi.org/10.18653/v1/d15-1166
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings pp 1–12, 1301.3781
Mou L, Li G, Zhang L, Wang T, Jin Z (2016) Convolutional neural networks over tree structures for programming language processing. 30th AAAI Conference on Artificial Intelligence, AAAI 2016, pp 1287–1293, 1409.5718
Nafi KW, Kar TS, Roy B, Roy CK, Schneider KA (2019) CLCDSA: Cross language code clone detection using syntactical features and API documentation. Proceedings - 2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, pp 1026–1037. https://doi.org/10.1109/ASE.2019.00099
Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: Learning distributed representations of graphs. CoRR abs/1707.05005, http://arxiv.org/abs/1707.05005, 1707.05005
Ragkhitwetsagul C, Krinke J (2019) Siamese: scalable and incremental code clone search via multiple code representations. Empirical Software Engineering 24(4):2236–2284. https://doi.org/10.1007/s10664-019-09697-7
Roy CK, Cordy JR (2008) NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. IEEE International Conference on Program Comprehension, pp 172–181. https://doi.org/10.1109/ICPC.2008.41
Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes CV (2018) Oreo: detection of clones in the twilight zone. In: ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp 354–365. https://doi.org/10.1145/3236024.3236026, 1806.05837
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: Scaling code clone detection to big-code. 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp 1157–1168
Socher R, Pennington J, Huang EH, Ng AY, Manning CD (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27-31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, ACL, pp 151–161, https://www.aclweb.org/anthology/D11-1014/
Tai KS, Socher R, Manning CD (2015) Improved semantic representations from tree-structured long short-Term memory networks. In: ACL-IJCNLP 2015 - 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Proceedings of the Conference, vol 1, pp 1556–1566. https://doi.org/10.3115/v1/p15-1150, 1503.00075
Tufano M, Watson C, Bavota G, Di Penta M, White M, Poshyvanyk D (2018) Deep learning similarities from different representations of source code. In: Proceedings - International Conference on Software Engineering, IEEE Computer Society, vol 18, pp 542–553. https://doi.org/10.1145/3196398.3196431
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Łu, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, vol 2017-Decem, pp 5999–6009, 1706.03762
Wang W, Li G, Ma B, Xia X, Jin Z (2020a) Detecting code clones with graph neural networkand flow-augmented abstract syntax tree. arXiv preprint arXiv:200208653
Wang W, Li G, Shen S, Xia X, Jin Z (2020b) Modular Tree Network for Source Code Representation Learning. ACM Transactions on Software Engineering and Methodology 29(4):1–23. https://doi.org/10.1145/3409331
Wang Y, Wang K, Gao F, Wang L (2020c) Learning semantic program embeddings with graph interval neural network. Proceedings of the ACM on Programming Languages 4(OOPSLA):1–27
Wei HH, Li M (2017) Supervised deep features for Software functional clone detection by exploiting lexical and syntactical information in source code. IJCAI International Joint Conference on Artificial Intelligence pp 3034–3040. https://doi.org/10.24963/ijcai.2017/423
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. ASE 2016 - Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp 87–98. https://doi.org/10.1145/2970276.2970326
Zhang J, Wang X, Zhang H, Sun H, Wang K, Liu X (2019) A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In: Proceedings - International Conference on Software Engineering, IEEE Computer Society, vol 2019-May, pp 783–794. https://doi.org/10.1109/ICSE.2019.00086
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by Foutse Khomh, Gemma Catolino and Pasquale Salza.
This article belongs to the Topical Collection: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)
Rights and permissions
About this article
Cite this article
Xiao, D., Hang, D., Ai, L. et al. Path context augmented statement and network for learning programs. Empir Software Eng 27, 37 (2022). https://doi.org/10.1007/s10664-021-10098-y
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-021-10098-y