Skip to main content

Universal Representation for Code

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12714))

Included in the following conference series:

Abstract

Learning from source code usually requires a large amount of labeled data. Despite the possible scarcity of labeled data, the trained model is highly task-specific and lacks transferability to different tasks. In this work, we present effective pre-training strategies on top of a novel graph-based code representation, to produce universal representations for code. Specifically, our graph-based representation captures important semantics between code elements (e.g., control flow and data flow). We pre-train graph neural networks on the representation to extract universal code properties. The pre-trained model then enables the possibility of fine-tuning to support various downstream applications. We evaluate our model on two real-world datasets – spanning over 30M Java methods and 770K Python methods. Through visualization, we reveal discriminative properties in our universal code representation. By comparing multiple benchmarks, we demonstrate that the proposed framework achieves state-of-the-art results on method name prediction and code graph link prediction.

L. Liu—Work done while the author was an intern at Amazon Web Services.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We use simple types instead of fully qualified types since we create graphs from source files and not builds. In this case, types are not fully resolvable.

  2. 2.

    https://github.com/dmlc/dgl/tree/master/examples/pytorch/rgcn-hetero.

References

  1. Dinella, E., Dai, H., Li, Z., Naik, M., Song, L., Wang, K.: Hoppity: learning graph transformations to detect and fix bugs in programs. In: ICLR (2019)

    Google Scholar 

  2. Cambronero, J., Li, H., Kim, S., Sen, K., Chandra, S.: When deep learning met code search. In: ESEC/FSE, pp. 964–974 (2019)

    Google Scholar 

  3. Raychev, V., Vechev, M., Yahav, E.: Code completion with statistical language models. In: PLDI, pp. 419–428 (2014)

    Google Scholar 

  4. Allamanis, M., Brockschmidt, M., Khademi, M.: Learning to represent programs with graphs. In: ICLR (2018)

    Google Scholar 

  5. Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: AAAI (2016)

    Google Scholar 

  6. Hu, W., et al.: Open graph benchmark: datasets for machine learning on graphs. In: NeurIPS (2020)

    Google Scholar 

  7. Hindle, A., Barr, E.T., Su, Z., Gabel, M., Devanbu, P.: On the naturalness of software. In: ICSE, pp. 837–847. IEEE (2012)

    Google Scholar 

  8. Ioannidis, V.N., Zheng, D., Karypis, G.: PanRep: universal node embeddings for heterogeneous graphs. In: DLG-KDD (2020)

    Google Scholar 

  9. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. In: TNNLS (2020)

    Google Scholar 

  10. Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 10843, pp. 593–607. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93417-4_38

    Chapter  Google Scholar 

  11. Hu, W., et al.: Strategies for pre-training graph neural networks. In: ICLR (2020)

    Google Scholar 

  12. Jin, W., et al.: Self-supervised learning on graphs: deep insights and new direction. arXiv preprint (2020)

    Google Scholar 

  13. Kanade, A., Maniatis, P., Balakrishnan, G., Shi, K.: Learning and evaluating contextual embedding of source code. In: ICML (2020)

    Google Scholar 

  14. Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. In: EMNLP (2020)

    Google Scholar 

  15. Svyatkovskiy, A., Deng, S.K., Fu, S., Sundaresan, N.: IntelliCode compose: code generation using transformer. In: ESEC/FSE (2020)

    Google Scholar 

  16. Alon, U., Brody, S., Levy, O., Yahav, E.: code2seq: generating sequences from structured representations of code. In: ICLR (2019)

    Google Scholar 

  17. Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. In: TOPLAS (1987)

    Google Scholar 

  18. Ahmed, N.K., Neville, J., Rossi, R.A., Duffield, N.G., Willke, T.L.: Graphlet decomposition: framework, algorithms, and applications. KAIS 50(3), 689–722 (2017)

    Google Scholar 

  19. Zhang, Y., Yang, Q.: A survey on multi-task learning. CoRR (2017)

    Google Scholar 

  20. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017)

    Article  Google Scholar 

  21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  22. Wang, M., et al.: Deep graph library: a graph-centric, highly-performant package for graph neural networks. arXiv (2019)

    Google Scholar 

  23. van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)

    Google Scholar 

  24. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016)

    Google Scholar 

  25. Yang, B., Yih, W.-T., He, X., Gao, J., Deng, L.: Embedding entities and relations for learning and inference in knowledge bases. In: ICLR (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linfeng Liu .

Editor information

Editors and Affiliations

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, L., Nguyen, H., Karypis, G., Sengamedu, S. (2021). Universal Representation for Code. In: Karlapalem, K., et al. Advances in Knowledge Discovery and Data Mining. PAKDD 2021. Lecture Notes in Computer Science(), vol 12714. Springer, Cham. https://doi.org/10.1007/978-3-030-75768-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-75768-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-75767-0

  • Online ISBN: 978-3-030-75768-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics