Abstract
Intelligent deep learning-based models have made significant progress for automated source code semantics embedding, and current research works mainly leverage natural language-based methods and graph-based methods. However, natural language-based methods do not capture the rich semantic structural information of source code, and graph-based methods do not utilize rich distant information of source code due to the high cost of message-passing steps.
In this article, we propose a novel interpretable model, called graph tensor convolution neural network (GTCN), to generate accurate code embedding, which is capable of comprehensively capturing the distant information of code sequences and rich code semantics structural information. First, we propose to utilize a high-dimensional tensor to integrate various heterogeneous code graphs with node sequence features, such as control flow, data flow. Second, inspired by the current advantages of graph-based deep learning and efficient tensor computations, we propose a novel interpretable graph tensor convolution neural network for learning accurate code semantic embedding from the code graph tensor. Finally, we evaluate three popular applications on the GTCN model: variable misuse detection, source code prediction, and vulnerability detection. Compared with current state-of-the-art methods, our model achieves higher scores with respect to the top-1 accuracy while costing less training time.
- [1] Yaqin Zhou. 2019. Source codes of the paper: Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. https://github.com/epicosy/devign.Google Scholar
- [2] Yu Wang. 2020. Source codes of the paper: Learning semantic program embeddings with graph interval neural network. https://github.com/GINN-Imp/GINN.Google Scholar
- [3] Vincent J. Hellendoorn. 2020. Source codes of the paper: Global relational models of source code. https://github.com/VHellendoorn/ICLR20-Great.Google Scholar
- [4] Zhangyin Feng. 2021. Source codes of the paper: CodeBERT: A pre-trained model for programming and natural languages. https://github.com/microsoft/CodeBERT.Google Scholar
- [5] Jia Yang. 2022. Source codes of this paper. https://gitee.com/cse-sss/GTCN.Google Scholar
- [6] Jia Yang. 2022. Source codes of this paper. https://github.com/SmileResearch/GTCN.Google Scholar
- [7] Yi Li. 2021. Source codes of the paper: Vulnerability detection with fine-grained interpretations. https://github.com/vulnerabilitydetection/VulnerabilityDetectionResearch.Google Scholar
- [8] . 2018. Learning to represent programs with graphs. In 6th International Conference on Learning Representations.Google Scholar
- [9] . 2020. Sequence model design for code completion in the modern IDE. CoRR abs/2004.05249 (2020).Google Scholar
- [10] . 2011. More is less: Signal processing and the data deluge. Science 331(6018) (2011), 717–719.Google ScholarCross Ref
- [11] . 2016. Statistical deobfuscation of Android applications. In ACM SIGSAC Conference on Computer and Communications Security. 343–355.Google ScholarDigital Library
- [12] . 2016. PHOG: Probabilistic model for code. In 33rd International Conference on Machine Learning. 2933–2942.Google Scholar
- [13] . 2021. GraphNorm: A principled approach to accelerating graph neural network training. In 38th International Conference on Machine Learning. 1204–1215.Google Scholar
- [14] 2022. MVD: Memory-related vulnerability detection based on flow-sensitive graph neural networks. In 44th IEEE/ACM 44th International Conference on Software Engineering. ACM, 1456–1468.Google ScholarDigital Library
- [15] . 2021. Towards overfitting avoidance: Tuning-free tensor-aided multi-user channel estimation for 3D massive MIMO communications. IEEE J. Sel. Top. Sig. Process. 15, 3 (2021), 832–846.Google ScholarCross Ref
- [16] 2021. DeepWukong: Statically detecting software vulnerabilities using deep graph neural network. ACM Trans. Softw. Eng. Methodol. 30, 3 (2021), 38:1–38:33.Google ScholarDigital Library
- [17] . 1997. The nature of statistical learning theory. IEEE Trans. Neural Netw. 8, 6 (1997), 1564.Google ScholarDigital Library
- [18] . 2015. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Sig. Process. Mag. 32, 2 (2015), 145–163.Google ScholarCross Ref
- [19] 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020 (Findings of ACL), Vol. EMNLP 2020. Association for Computational Linguistics, 1536–1547.Google Scholar
- [20] . 2019. Structured neural summarization. In 7th International Conference on Learning Representations. OpenReview.net.Google Scholar
- [21] Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. 2017. Frank-Wolfe algorithms for Saddle point problems. In 20th International Conference on Artificial Intelligence and Statistics AISTATS. 362–371.Google Scholar
- [22] . 2020. word2vec, node2vec, graph2vec, X2vec: Towards a theory of vector embeddings of structured data. In 39th ACM Symposium on Principles of Database Systems. 1–16.Google Scholar
- [23] . 2021. GraphCodeBERT: Pre-training code representations with data flow. In 9th International Conference on Learning Representations. OpenReview.net.Google Scholar
- [24] . 2017. Efficient Sparse low-rank tensor completion using the Frank-Wolfe algorithm. In 31st AAAI Conference on Artificial Intelligence. 1948–1954.Google ScholarCross Ref
- [25] . 2020. Humpback: Code completion system for Dockerfile based on language models (short paper). In Joint Proceedings of SEED & NLPaSE co-located with 27th Asia Pacific Software Engineering Conference. 67–73.Google Scholar
- [26] . 2004. The problem of overfitting. J. Chem. Inf. Model. 44, 1 (2004), 1–12.Google Scholar
- [27] . 2020. Global relational models of source code. In 8th International Conference on Learning Representations. OpenReview.net.Google Scholar
- [28] 2012. On the naturalness of software. In 34th International Conference on Software Engineering. 837–847.Google ScholarDigital Library
- [29] . 2020. On a scalable entropic breaching of the overfitting barrier for small data problems in machine learning. Neural Comput. 32, 8 (2020), 1563–1579.Google ScholarDigital Library
- [30] . 2020. Efficient MDI adaptation for n-gram language models. In 21st Annual Conference of the International Speech Communication Association. 4916–4920.Google ScholarCross Ref
- [31] . 2020. MR-GCN: Multi-relational graph convolutional networks based on generalized tensor product. In 29th International Joint Conference on Artificial Intelligence. 1258–1264.Google ScholarCross Ref
- [32] . 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In 30th International Conference on Machine Learning. 427–435.Google Scholar
- [33] . 2017. Differentially private matrix completion, revisited. CoRR abs/1712.09765 (2017).Google Scholar
- [34] . 2018. Ultra-sparse classifiers through minimizing the VC dimension in the empirical feature space—submitted to the special issue on “off the mainstream: Advances in neural networks and machine learning for pattern recognition.” Neural Process. Lett. 48, 2 (2018), 881–913.Google ScholarCross Ref
- [35] . 2013. Supervised feature extraction for tensor objects based on maximization of mutual information. Pattern Recognit. Lett. 34, 13 (2013), 1476–1484.Google ScholarDigital Library
- [36] . 2020. Self-motion-assisted tensor completion method for background initialization in complex video sequences. IEEE Trans. Image Process. 29 (2020), 1915–1928.Google ScholarDigital Library
- [37] . 2020. Big code != big vocabulary: Open-vocabulary models for source code. In 42nd International Conference on Software Engineering. 1073–1085.Google ScholarDigital Library
- [38] . 2021. Automatic code generation for high-performance discontinuous Galerkin methods on modern architectures. ACM Trans. Math. Softw. 47, 1 (2021), 6:1–6:31.Google ScholarDigital Library
- [39] . 2020. High-dimensional microarray dataset classification using an improved adam optimizer (iAdam). J. Amb. Intell. Humaniz. Comput. 11, 11 (2020), 5187–5204.Google ScholarCross Ref
- [40] . 2021. Lower and upper bounds on the VC-dimension of tensor network models. CoRR abs/2106.11827 (2021).Google Scholar
- [41] . 2011. Factorization strategies for third-order tensors. Linear Algeb. Applic. 435, 3 (2011), 641–658.Google ScholarCross Ref
- [42] . 2021. Code prediction by feeding trees to transformers. In 43rd IEEE/ACM International Conference on Software Engineering. 150–162.Google ScholarDigital Library
- [43] . 2017. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations. OpenReview.net.Google Scholar
- [44] . 2002. A covariance estimator for small sample size classification problems and its application to feature extraction. IEEE Trans. Geosci. Rem. Sensor 40, 4 (2002), 814–819.Google ScholarCross Ref
- [45] . 2018. Code completion with neural attention and pointer networks. In 27th International Joint Conference on Artificial Intelligence. 4159–4165.Google ScholarDigital Library
- [46] . 2020. An adversarial machine learning method based on OpCode N-grams feature in malware detection. In 5th IEEE International Conference on Data Science in Cyberspace. 380–387.Google ScholarCross Ref
- [47] . 2016. Gated graph sequence neural networks. In 4th International Conference on Learning Representations.Google Scholar
- [48] . 2021. Vulnerability detection with fine-grained interpretations. In 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 292–303.Google ScholarDigital Library
- [49] 2021. Large-scale online multi-view graph neural network and applications. Fut. Gen. Comput. Syst. 116 (2021), 145–155.Google ScholarCross Ref
- [50] . 2009. A survey of feature extraction and classifier design based on tensor pattern. J. Shandong Univ. (Eng. Sci.) 39, 1 (2009), 6–14.Google Scholar
- [51] . 2020. A self-attentional neural architecture for code completion with multi-task learning. In 28th International Conference on Program Comprehension. 37–47.Google ScholarDigital Library
- [52] . 2020. Multi-task learning based pre-trained language model for code completion. In 35th IEEE/ACM International Conference on Automated Software Engineering. 473–485.Google ScholarDigital Library
- [53] . 2010. Feature extraction by learning Lorentzian metric tensor and its extensions. Pattern Recognit. 43, 10 (2010), 3298–3306.Google ScholarDigital Library
- [54] . 2020. Tensor graph convolutional networks for text classification. In 34th AAAI Conference on Artificial Intelligence. 8409–8416.Google ScholarCross Ref
- [55] . 2021. Tracking translation invariance in CNNs. CoRR abs/2104.05997 (2021).Google Scholar
- [56] . 2015. Graph-based statistical language model for code. In 37th IEEE/ACM International Conference on Software Engineering. IEEE Computer Society, 858–868.Google ScholarCross Ref
- [57] . 2012. Graph-based pattern-oriented, context-sensitive source code completion. In 34th International Conference on Software Engineering. IEEE Computer Society, 69–79.Google ScholarDigital Library
- [58] . 2020. Suggesting natural method names to check name consistencies. In 42nd International Conference on Software Engineering. 1372–1384.Google ScholarDigital Library
- [59] . 2020. Overcoming overfitting and large weight update problem in linear rectifiers: Thresholded exponential rectified linear units. CoRR abs/2006.02797 (2020).Google Scholar
- [60] . 2010. Detection of recurring software vulnerabilities. In 25th IEEE/ACM International Conference on Automated Software Engineering. ACM, 447–456.Google ScholarDigital Library
- [61] . 2021. secureTF: A secure TensorFlow framework. CoRR abs/2101.08204 (2021).Google Scholar
- [62] Md. Mostafizer Rahman, Yutaka Watanobe, and Keita Nakamura. 2020. A neural network based intelligent support model for program code completion. Sci. Program. 2020, 7426461 (2020), 1–18.Google Scholar
- [63] . 2018. Tucker tensor decomposition-based tracking and Gaussian mixture model for anomaly localisation and detection in surveillance videos. IET Comput. Vis. 12, 6 (2018), 933–940.Google ScholarCross Ref
- [64] . 2020. N-gram models for code completion in Pharo. In 4th International Conference on the Art, Science, and Engineering of Programming. 227–228.Google ScholarDigital Library
- [65] 2018. Automated vulnerability detection in source code using deep representation learning. In 17th IEEE International Conference on Machine Learning and Applications. IEEE, 757–762.Google ScholarCross Ref
- [66] 2018. Modeling relational data with graph convolutional networks. In The Semantic Web—15th International Conference, ESWC 2018, Heraklion, Crete, Greece (Lecture Notes in Computer Science), Vol. 10843. Springer, 593–607.Google ScholarDigital Library
- [67] . 2016. Neural machine translation of rare words with subword units. In 54th Annual Meeting of the Association for Computational Linguistics, ACL.Google Scholar
- [68] 2021. Combining deep reinforcement learning with graph neural networks for optimal VNF placement. IEEE Commun. Lett. 25, 1 (2021), 176–180.Google ScholarCross Ref
- [69] . 2019. In 25th ACM International Conference on Knowledge Discovery & Data Mining. 2727–2735.Google Scholar
- [70] . 2016. Gated recurrent neural tensor network. In International Joint Conference on Neural Networks. 448–455.Google Scholar
- [71] S. Waner. 1986. Introduction to differential geometry and general relativity. Lecture Notes by Stefan Waner, with a Special Guest Lecture by Gregory C. Levine, Department of Mathematics, Hofstra University, https://medusa.teodesian.net/docs/mathematics/Intro%20to%20Differential%20Geometry%20and%20General%20Relativity%20-%20S.%20Warner%20(2002)%20WW.pdf.Google Scholar
- [72] . 2021. Combining graph-based learning with automated data collection for code vulnerability detection. IEEE Trans. Inf. Forens. Secur. 16 (2021), 1943–1958.Google ScholarDigital Library
- [73] . 2020. Learning to Represent Programs with Heterogeneous Graphs. (2020).
arxiv:cs.SE/2012.04188 Google Scholar - [74] 2020. Learning semantic program embeddings with graph interval neural network. Proc. ACM Program. Lang. 4, OOPSLA (2020), 137:1–137:27.Google ScholarDigital Library
- [75] . 2022. Discovering invariant rationales for graph neural networks. In 10th International Conference on Learning Representations. OpenReview.net.Google Scholar
- [76] . 2021. Overfitting avoidance in tensor train factorization and completion: Prior analysis and inference. In IEEE International Conference on Data Mining. IEEE, 1439–1444.Google ScholarCross Ref
- [77] . 2014. Modeling and discovering vulnerabilities with code property graphs. In IEEE Symposium on Security and Privacy. 590–604.Google Scholar
- [78] 2019. Graph transformer networks. In Annual Conference on Neural Information Processing Systems. 11960–11970.Google Scholar
- [79] . 2009. Tensor linear Laplacian discrimination (TLLD) for feature extraction. Pattern Recognit. 42, 9 (2009), 1941–1948.Google ScholarDigital Library
- [80] . 2021. A survey on neural network interpretability. IEEE Trans. Emerg. Top. Comput. Intell. 5, 5 (2021), 726–742.Google ScholarCross Ref
- [81] 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Annual Conference on Neural Information Processing Systems. 10197–10207.Google Scholar
Index Terms
- Toward Interpretable Graph Tensor Convolution Neural Network for Code Semantics Embedding
Recommendations
Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems
AbstractIn open-source software ecosystems, the scale of source code is getting larger and larger, and developers often use various methods (good code comments or method names, etc.) to make the code easier to read and understand. However, high-quality ...
Multi-location cryptographic code repair with neural-network-based methodologies
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringJava Cryptographic API libraries are error-prone and result in vulnerabilities. The fixes of them often require security expertise and extra consideration for cryptographic consistency at multiple code locations. My Ph.D. research aims to help ...
Detecting Condition-Related Bugs with Control Flow Graph Neural Network
ISSTA 2023: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and AnalysisAutomated bug detection is essential for high-quality software development and has attracted much attention over the years. Among the various bugs, previous studies show that the condition expressions are quite error-prone and the condition-related ...
Comments