Universal Representation for Code

Liu, Linfeng; Nguyen, Hoan; Karypis, George; Sengamedu, Srinivasan

doi:10.1007/978-3-030-75768-7_2

Linfeng Liu¹⁵,
Hoan Nguyen¹⁶,
George Karypis¹⁶ &
…
Srinivasan Sengamedu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12714))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

Abstract

Learning from source code usually requires a large amount of labeled data. Despite the possible scarcity of labeled data, the trained model is highly task-specific and lacks transferability to different tasks. In this work, we present effective pre-training strategies on top of a novel graph-based code representation, to produce universal representations for code. Specifically, our graph-based representation captures important semantics between code elements (e.g., control flow and data flow). We pre-train graph neural networks on the representation to extract universal code properties. The pre-trained model then enables the possibility of fine-tuning to support various downstream applications. We evaluate our model on two real-world datasets – spanning over 30M Java methods and 770K Python methods. Through visualization, we reveal discriminative properties in our universal code representation. By comparing multiple benchmarks, we demonstrate that the proposed framework achieves state-of-the-art results on method name prediction and code graph link prediction.

L. Liu—Work done while the author was an intern at Amazon Web Services.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Code Representation Based on Hybrid Graph Modelling

JEMMA: An extensible Java dataset for ML4Code applications

Article Open access 10 March 2023

A graph-based code representation method to improve code readability classification

Article 23 May 2023

Notes

1.
We use simple types instead of fully qualified types since we create graphs from source files and not builds. In this case, types are not fully resolvable.
2.
https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero.

References

Dinella, E., Dai, H., Li, Z., Naik, M., Song, L., Wang, K.: Hoppity: learning graph transformations to detect and fix bugs in programs. In: ICLR (2019)
Google Scholar
Cambronero, J., Li, H., Kim, S., Sen, K., Chandra, S.: When deep learning met code search. In: ESEC/FSE, pp. 964–974 (2019)
Google Scholar
Raychev, V., Vechev, M., Yahav, E.: Code completion with statistical language models. In: PLDI, pp. 419–428 (2014)
Google Scholar
Allamanis, M., Brockschmidt, M., Khademi, M.: Learning to represent programs with graphs. In: ICLR (2018)
Google Scholar
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: AAAI (2016)
Google Scholar
Hu, W., et al.: Open graph benchmark: datasets for machine learning on graphs. In: NeurIPS (2020)
Google Scholar
Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: ICSE, pp. 837–847. IEEE (2012)
Google Scholar
Ioannidis, V.N., Zheng, D., Karypis, G.: PanRep: universal node embeddings for heterogeneous graphs. In: DLG-KDD (2020)
Google Scholar
Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. In: TNNLS (2020)
Google Scholar
Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 593–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_38
Chapter Google Scholar
Hu, W., et al.: Strategies for pre-training graph neural networks. In: ICLR (2020)
Google Scholar
Jin, W., et al.: Self-supervised learning on graphs: deep insights and new direction. arXiv preprint (2020)
Google Scholar
Kanade, A., Maniatis, P., Balakrishnan, G., Shi, K.: Learning and evaluating contextual embedding of source code. In: ICML (2020)
Google Scholar
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. In: EMNLP (2020)
Google Scholar
Svyatkovskiy, A., Deng, S.K., Fu, S., Sundaresan, N.: IntelliCode compose: code generation using transformer. In: ESEC/FSE (2020)
Google Scholar
Alon, U., Brody, S., Levy, O., Yahav, E.: code2seq: generating sequences from structured representations of code. In: ICLR (2019)
Google Scholar
Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. In: TOPLAS (1987)
Google Scholar
Ahmed, N.K., Neville, J., Rossi, R.A., Duffield, N.G., Willke, T.L.: Graphlet decomposition: framework, algorithms, and applications. KAIS 50(3), 689–722 (2017)
Google Scholar
Zhang, Y., Yang, Q.: A survey on multi-task learning. CoRR (2017)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Wang, M., et al.: Deep graph library: a graph-centric, highly-performant package for graph neural networks. arXiv (2019)
Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)
Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016)
Google Scholar
Yang, B., Yih, W.-T., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: ICLR (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Tufts University, Medford, MA, 02155, USA
Linfeng Liu
Amazon Web Services, Seattle, WA, 98109, USA
Hoan Nguyen, George Karypis & Srinivasan Sengamedu

Authors

Linfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hoan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar
Srinivasan Sengamedu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Linfeng Liu .

Editor information

Editors and Affiliations

IIIT, Hyderabad, Hyderabad, India
Kamal Karlapalem
Chinese University of Hong Kong, Shatin, Hong Kong
Hong Cheng
Virginia Tech, Arlington, VA, USA
Naren Ramakrishnan
Jawaharlal Nehru University, New Delhi, India
R. K. Agrawal
IIIT Hyderabad, Hyderabad, India
P. Krishna Reddy
University of Minnesota, Minneapolis, MN, USA
Jaideep Srivastava
IIIT Delhi, New Delhi, India
Tanmoy Chakraborty

Copyright information

About this paper

Cite this paper

Liu, L., Nguyen, H., Karypis, G., Sengamedu, S. (2021). Universal Representation for Code. In: Karlapalem, K., et al. Advances in Knowledge Discovery and Data Mining. PAKDD 2021. Lecture Notes in Computer Science(), vol 12714. Springer, Cham. https://doi.org/10.1007/978-3-030-75768-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-75768-7_2
Published: 08 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75767-0
Online ISBN: 978-3-030-75768-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Universal Representation for Code

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Code Representation Based on Hybrid Graph Modelling

JEMMA: An extensible Java dataset for ML4Code applications

A graph-based code representation method to improve code readability classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Universal Representation for Code

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Code Representation Based on Hybrid Graph Modelling

JEMMA: An extensible Java dataset for ML4Code applications

A graph-based code representation method to improve code readability classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation