skip to main content
research-article

Toward Interpretable Graph Tensor Convolution Neural Network for Code Semantics Embedding

Authors Info & Claims
Published:21 July 2023Publication History
Skip Abstract Section

Abstract

Intelligent deep learning-based models have made significant progress for automated source code semantics embedding, and current research works mainly leverage natural language-based methods and graph-based methods. However, natural language-based methods do not capture the rich semantic structural information of source code, and graph-based methods do not utilize rich distant information of source code due to the high cost of message-passing steps.

In this article, we propose a novel interpretable model, called graph tensor convolution neural network (GTCN), to generate accurate code embedding, which is capable of comprehensively capturing the distant information of code sequences and rich code semantics structural information. First, we propose to utilize a high-dimensional tensor to integrate various heterogeneous code graphs with node sequence features, such as control flow, data flow. Second, inspired by the current advantages of graph-based deep learning and efficient tensor computations, we propose a novel interpretable graph tensor convolution neural network for learning accurate code semantic embedding from the code graph tensor. Finally, we evaluate three popular applications on the GTCN model: variable misuse detection, source code prediction, and vulnerability detection. Compared with current state-of-the-art methods, our model achieves higher scores with respect to the top-1 accuracy while costing less training time.

REFERENCES

  1. [1] Yaqin Zhou. 2019. Source codes of the paper: Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. https://github.com/epicosy/devign.Google ScholarGoogle Scholar
  2. [2] Yu Wang. 2020. Source codes of the paper: Learning semantic program embeddings with graph interval neural network. https://github.com/GINN-Imp/GINN.Google ScholarGoogle Scholar
  3. [3] Vincent J. Hellendoorn. 2020. Source codes of the paper: Global relational models of source code. https://github.com/VHellendoorn/ICLR20-Great.Google ScholarGoogle Scholar
  4. [4] Zhangyin Feng. 2021. Source codes of the paper: CodeBERT: A pre-trained model for programming and natural languages. https://github.com/microsoft/CodeBERT.Google ScholarGoogle Scholar
  5. [5] Jia Yang. 2022. Source codes of this paper. https://gitee.com/cse-sss/GTCN.Google ScholarGoogle Scholar
  6. [6] Jia Yang. 2022. Source codes of this paper. https://github.com/SmileResearch/GTCN.Google ScholarGoogle Scholar
  7. [7] Yi Li. 2021. Source codes of the paper: Vulnerability detection with fine-grained interpretations. https://github.com/vulnerabilitydetection/VulnerabilityDetectionResearch.Google ScholarGoogle Scholar
  8. [8] Allamanis Miltiadis, Brockschmidt Marc, and Khademi Mahmoud. 2018. Learning to represent programs with graphs. In 6th International Conference on Learning Representations.Google ScholarGoogle Scholar
  9. [9] Aye Gareth Ari and Kaiser Gail E.. 2020. Sequence model design for code completion in the modern IDE. CoRR abs/2004.05249 (2020).Google ScholarGoogle Scholar
  10. [10] Baraniuk Richard G.. 2011. More is less: Signal processing and the data deluge. Science 331(6018) (2011), 717719.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Bichsel Benjamin, Raychev Veselin, Tsankov Petar, and Vechev Martin T.. 2016. Statistical deobfuscation of Android applications. In ACM SIGSAC Conference on Computer and Communications Security. 343355.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Bielik Pavol, Raychev Veselin, and Vechev Martin T.. 2016. PHOG: Probabilistic model for code. In 33rd International Conference on Machine Learning. 29332942.Google ScholarGoogle Scholar
  13. [13] Cai Tianle, Luo Shengjie, and Xu Keyulu. 2021. GraphNorm: A principled approach to accelerating graph neural network training. In 38th International Conference on Machine Learning. 12041215.Google ScholarGoogle Scholar
  14. [14] Cao Sicong, Sun Xiaobing, Bo Lilial. et2022. MVD: Memory-related vulnerability detection based on flow-sensitive graph neural networks. In 44th IEEE/ACM 44th International Conference on Software Engineering. ACM, 14561468.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Cheng Lei and Shi Qingjiang. 2021. Towards overfitting avoidance: Tuning-free tensor-aided multi-user channel estimation for 3D massive MIMO communications. IEEE J. Sel. Top. Sig. Process. 15, 3 (2021), 832846.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Cheng Xiao, Wang Haoyu, Hua Jiayial. et2021. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Trans. Softw. Eng. Methodol. 30, 3 (2021), 38:1–38:33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Cherkassky Vladimir. 1997. The nature of statistical learning theory. IEEE Trans. Neural Netw. 8, 6 (1997), 1564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Cichocki Andrzej and Mandic Danilo P.. 2015. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Sig. Process. Mag. 32, 2 (2015), 145163.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Feng Zhangyin, Guo Daya, Tang Duyual. et2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020 (Findings of ACL), Vol. EMNLP 2020. Association for Computational Linguistics, 15361547.Google ScholarGoogle Scholar
  20. [20] Fernandes Patrick, Allamanis Miltiadis, and Brockschmidt Marc. 2019. Structured neural summarization. In 7th International Conference on Learning Representations. OpenReview.net.Google ScholarGoogle Scholar
  21. [21] Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. 2017. Frank-Wolfe algorithms for Saddle point problems. In 20th International Conference on Artificial Intelligence and Statistics AISTATS. 362–371.Google ScholarGoogle Scholar
  22. [22] Grohe Martin. 2020. word2vec, node2vec, graph2vec, X2vec: Towards a theory of vector embeddings of structured data. In 39th ACM Symposium on Principles of Database Systems. 116.Google ScholarGoogle Scholar
  23. [23] Guo Daya, Ren Shuo, and Lu Shuai. 2021. GraphCodeBERT: Pre-training code representations with data flow. In 9th International Conference on Learning Representations. OpenReview.net.Google ScholarGoogle Scholar
  24. [24] Guo Xiawei, Yao Quanming, and Kwok James Tin-Yau. 2017. Efficient Sparse low-rank tensor completion using the Frank-Wolfe algorithm. In 31st AAAI Conference on Artificial Intelligence. 19481954.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Hanayama Kaisei, Matsumoto Shinsuke, and Kusumoto Shinji. 2020. Humpback: Code completion system for Dockerfile based on language models (short paper). In Joint Proceedings of SEED & NLPaSE co-located with 27th Asia Pacific Software Engineering Conference. 6773.Google ScholarGoogle Scholar
  26. [26] Hawkins Douglas M.. 2004. The problem of overfitting. J. Chem. Inf. Model. 44, 1 (2004), 112.Google ScholarGoogle Scholar
  27. [27] Hellendoorn Vincent J., Sutton Charles, Singh Rishabh, Maniatis Petros, and Bieber David. 2020. Global relational models of source code. In 8th International Conference on Learning Representations. OpenReview.net.Google ScholarGoogle Scholar
  28. [28] Hindle Abram, Barr Earl T., Su Zhendongal. et2012. On the naturalness of software. In 34th International Conference on Software Engineering. 837847.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Horenko Illia. 2020. On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning. Neural Comput. 32, 8 (2020), 15631579.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Huang Ruizhe, Li Ke, and Arora Ashish. 2020. Efficient MDI adaptation for n-gram language models. In 21st Annual Conference of the International Speech Communication Association. 49164920.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Huang Zhichao, Li Xutao, Ye Yunming, and Ng Michael K.. 2020. MR-GCN: Multi-relational graph convolutional networks based on generalized tensor product. In 29th International Joint Conference on Artificial Intelligence. 12581264.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Jaggi Martin. 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In 30th International Conference on Machine Learning. 427435.Google ScholarGoogle Scholar
  33. [33] Jain Prateek, Thakkar Om, and Thakurta Abhradeep. 2017. Differentially private matrix completion, revisited. CoRR abs/1712.09765 (2017).Google ScholarGoogle Scholar
  34. [34] Jayadeva, Sharma Mayank, Soman Sumit, and Pant Himanshu. 2018. Ultra-sparse classifiers through minimizing the VC dimension in the empirical feature space—submitted to the special issue on “off the mainstream: Advances in neural networks and machine learning for pattern recognition.” Neural Process. Lett. 48, 2 (2018), 881913.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Jukic Ante and Filipovic Marko. 2013. Supervised feature extraction for tensor objects based on maximization of mutual information. Pattern Recognit. Lett. 34, 13 (2013), 14761484.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Kajo Ibrahim, Kamel Nidal S., and Ruichek Yassine. 2020. Self-motion-assisted tensor completion method for background initialization in complex video sequences. IEEE Trans. Image Process. 29 (2020), 19151928.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Karampatsis Rafael-Michael, Babii Hlib, Robbes Romain, Sutton Charles, and Janes Andrea. 2020. Big code != big vocabulary: Open-vocabulary models for source code. In 42nd International Conference on Software Engineering. 10731085.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Kempf Dominic, Heß René, Müthing Steffen, and Bastian Peter. 2021. Automatic code generation for high-performance discontinuous Galerkin methods on modern architectures. ACM Trans. Math. Softw. 47, 1 (2021), 6:1–6:31.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Khaire Utkarsh Mahadeo and Dhanalakshmi R.. 2020. High-dimensional microarray dataset classification using an improved adam optimizer (iAdam). J. Amb. Intell. Humaniz. Comput. 11, 11 (2020), 51875204.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Khavari Behnoush and Rabusseau Guillaume. 2021. Lower and upper bounds on the VC-dimension of tensor network models. CoRR abs/2106.11827 (2021).Google ScholarGoogle Scholar
  41. [41] Kilmer Misha E. and Martin Carla D.. 2011. Factorization strategies for third-order tensors. Linear Algeb. Applic. 435, 3 (2011), 641658.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Kim Seohyun, Zhao Jinman, Tian Yuchi, and Chandra Satish. 2021. Code prediction by feeding trees to transformers. In 43rd IEEE/ACM International Conference on Software Engineering. 150162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations. OpenReview.net.Google ScholarGoogle Scholar
  44. [44] Kuo Bor-Chen and Landgrebe David A.. 2002. A covariance estimator for small sample size classification problems and its application to feature extraction. IEEE Trans. Geosci. Rem. Sensor 40, 4 (2002), 814819.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Li Jian, Wang Yue, Lyu Michael R., and King Irwin. 2018. Code completion with neural attention and pointer networks. In 27th International Joint Conference on Artificial Intelligence. 41594165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Li Xiang, Qiu Kefan, Qian Cheng, and Zhao Gang. 2020. An adversarial machine learning method based on OpCode N-grams feature in malware detection. In 5th IEEE International Conference on Data Science in Cyberspace. 380387.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Li Yujia, Tarlow Daniel, Brockschmidt Marc, and Zemel Richard S.. 2016. Gated graph sequence neural networks. In 4th International Conference on Learning Representations.Google ScholarGoogle Scholar
  48. [48] Li Yi, Wang Shaohua, and Nguyen Tien N.. 2021. Vulnerability detection with fine-grained interpretations. In 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 292303.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Li Zhao, Xing Yuying, Huang Jiamingal. et2021. Large-scale online multi-view graph neural network and applications. Fut. Gen. Comput. Syst. 116 (2021), 145155.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Li-mei Zhang, Li-shan Qiao, and Song-can Chen. 2009. A survey of feature extraction and classifier design based on tensor pattern. J. Shandong Univ. (Eng. Sci.) 39, 1 (2009), 614.Google ScholarGoogle Scholar
  51. [51] Liu Fang, Li Ge, and Wei Bolin. 2020. A self-attentional neural architecture for code completion with multi-task learning. In 28th International Conference on Program Comprehension. 3747.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Liu Fang, Li Ge, Zhao Yunfei, and Jin Zhi. 2020. Multi-task learning based pre-trained language model for code completion. In 35th IEEE/ACM International Conference on Automated Software Engineering. 473485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Liu Risheng, Lin Zhouchen, Su Zhixun, and Tang Kewei. 2010. Feature extraction by learning Lorentzian metric tensor and its extensions. Pattern Recognit. 43, 10 (2010), 32983306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Liu Xien, You Xinxin, and Zhang Xiao. 2020. Tensor graph convolutional networks for text classification. In 34th AAAI Conference on Artificial Intelligence. 84098416.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Myburgh Johannes C., Mouton Coenraad, and Davel Marelie H.. 2021. Tracking translation invariance in CNNs. CoRR abs/2104.05997 (2021).Google ScholarGoogle Scholar
  56. [56] Nguyen Anh Tuan and Nguyen Tien N.. 2015. Graph-based statistical language model for code. In 37th IEEE/ACM International Conference on Software Engineering. IEEE Computer Society, 858868.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Nguyen Anh Tuan, Nguyen Tung Thanh, and Nguyen Hoan Anh. 2012. Graph-based pattern-oriented, context-sensitive source code completion. In 34th International Conference on Software Engineering. IEEE Computer Society, 6979.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Nguyen Son, Phan Hung, Le Trinh, and Nguyen Tien N.. 2020. Suggesting natural method names to check name consistencies. In 42nd International Conference on Software Engineering. 13721384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Pandey Vijay. 2020. Overcoming overfitting and large weight update problem in linear rectifiers: Thresholded exponential rectified linear units. CoRR abs/2006.02797 (2020).Google ScholarGoogle Scholar
  60. [60] Pham Nam H., Nguyen Tung Thanh, Nguyen Hoan Anh, and Nguyen Tien N.. 2010. Detection of recurring software vulnerabilities. In 25th IEEE/ACM International Conference on Automated Software Engineering. ACM, 447456.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Quoc Do Le, Gregor Franz, Arnautov Sergei, Kunkel Roland, Bhatotia Pramod, and Fetzer Christof. 2021. secureTF: A secure TensorFlow framework. CoRR abs/2101.08204 (2021).Google ScholarGoogle Scholar
  62. [62] Md. Mostafizer Rahman, Yutaka Watanobe, and Keita Nakamura. 2020. A neural network based intelligent support model for program code completion. Sci. Program. 2020, 7426461 (2020), 1–18.Google ScholarGoogle Scholar
  63. [63] Ratre Avinash and Pankajakshan Vinod. 2018. Tucker tensor decomposition-based tracking and Gaussian mixture model for anomaly localisation and detection in surveillance videos. IET Comput. Vis. 12, 6 (2018), 933940.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Romaniuk Myroslava. 2020. N-gram models for code completion in Pharo. In 4th International Conference on the Art, Science, and Engineering of Programming. 227228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Russell Rebecca L., Kim Louis Y., Hamilton Lei H.al. et2018. Automated vulnerability detection in source code using deep representation learning. In 17th IEEE International Conference on Machine Learning and Applications. IEEE, 757762.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Schlichtkrull Michael Sejr, Kipf Thomas N., Bloem Peteral. et2018. Modeling relational data with graph convolutional networks. In The Semantic Web—15th International Conference, ESWC 2018, Heraklion, Crete, Greece (Lecture Notes in Computer Science), Vol. 10843. Springer, 593607.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Sennrich Rico, Haddow Barry, and Birch Alexandra. 2016. Neural machine translation of rare words with subword units. In 54th Annual Meeting of the Association for Computational Linguistics, ACL.Google ScholarGoogle Scholar
  68. [68] Sun Penghao, Lan Julong, Li Junfeial. et2021. Combining deep reinforcement learning with graph neural networks for optimal VNF placement. IEEE Commun. Lett. 25, 1 (2021), 176180.Google ScholarGoogle ScholarCross RefCross Ref
  69. [69] Svyatkovskiy Alexey, Zhao Ying, Fu Shengyu, and Sundaresan Neel. 2019. In 25th ACM International Conference on Knowledge Discovery & Data Mining. 27272735.Google ScholarGoogle Scholar
  70. [70] Tjandra Andros, Sakti Sakriani, and Manurung Ruli. 2016. Gated recurrent neural tensor network. In International Joint Conference on Neural Networks. 448455.Google ScholarGoogle Scholar
  71. [71] S. Waner. 1986. Introduction to differential geometry and general relativity. Lecture Notes by Stefan Waner, with a Special Guest Lecture by Gregory C. Levine, Department of Mathematics, Hofstra University, https://medusa.teodesian.net/docs/mathematics/Intro%20to%20Differential%20Geometry%20and%20General%20Relativity%20-%20S.%20Warner%20(2002)%20WW.pdf.Google ScholarGoogle Scholar
  72. [72] Wang Huanting, Ye Guixin, and Tang Zhanyong. 2021. Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forens. Secur. 16 (2021), 19431958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Wang Wenhan, Zhang Kechi, Li Ge, and Jin Zhi. 2020. Learning to Represent Programs with Heterogeneous Graphs. (2020). arxiv:cs.SE/2012.04188Google ScholarGoogle Scholar
  74. [74] Wang Yu, Wang Ke, Gao Fengjuanal. et2020. Learning semantic program embeddings with graph interval neural network. Proc. ACM Program. Lang. 4, OOPSLA (2020), 137:1–137:27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Wu Yingxin, Wang Xiang, Zhang An, He Xiangnan, and Chua Tat-Seng. 2022. Discovering invariant rationales for graph neural networks. In 10th International Conference on Learning Representations. OpenReview.net.Google ScholarGoogle Scholar
  76. [76] Xu Le, Cheng Lei, Wong Ngai, and Wu Yik-Chung. 2021. Overfitting avoidance in tensor train factorization and completion: Prior analysis and inference. In IEEE International Conference on Data Mining. IEEE, 14391444.Google ScholarGoogle ScholarCross RefCross Ref
  77. [77] Yamaguchi Fabian, Golde Nico, Arp Daniel, and Rieck Konrad. 2014. Modeling and discovering vulnerabilities with code property graphs. In IEEE Symposium on Security and Privacy. 590604.Google ScholarGoogle Scholar
  78. [78] Yun Seongjun, Jeong Minbyul, Kim Raehyunal. et2019. Graph transformer networks. In Annual Conference on Neural Information Processing Systems. 1196011970.Google ScholarGoogle Scholar
  79. [79] Zhang Wei, Lin Zhouchen, and Tang Xiaoou. 2009. Tensor linear Laplacian discrimination (TLLD) for feature extraction. Pattern Recognit. 42, 9 (2009), 19411948.Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Zhang Yu, Tiño Peter, Leonardis Ales, and Tang Ke. 2021. A survey on neural network interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 5, 5 (2021), 726742.Google ScholarGoogle ScholarCross RefCross Ref
  81. [81] Zhou Yaqin, Liu Shangqing, Siow Jing Kaial. et2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Annual Conference on Neural Information Processing Systems. 1019710207.Google ScholarGoogle Scholar

Index Terms

  1. Toward Interpretable Graph Tensor Convolution Neural Network for Code Semantics Embedding

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Software Engineering and Methodology
              ACM Transactions on Software Engineering and Methodology  Volume 32, Issue 5
              September 2023
              905 pages
              ISSN:1049-331X
              EISSN:1557-7392
              DOI:10.1145/3610417
              • Editor:
              • Mauro Pezzè
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 21 July 2023
              • Online AM: 20 February 2023
              • Accepted: 4 January 2023
              • Revised: 17 November 2022
              • Received: 18 July 2022
              Published in tosem Volume 32, Issue 5

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
            • Article Metrics

              • Downloads (Last 12 months)483
              • Downloads (Last 6 weeks)41

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Full Text

            View this article in Full Text.

            View Full Text