skip to main content
10.1145/3510003.3510096acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

SPT-code: sequence-to-sequence pre-training for learning source code representations

Published:05 July 2022Publication History

ABSTRACT

Recent years have seen the successful application of large pre-trained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning.

References

  1. Roee Aharoni and Yoav Goldberg. 2017. Towards String-To-Tree Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 132--140.Google ScholarGoogle ScholarCross RefCross Ref
  2. Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4998--5007.Google ScholarGoogle ScholarCross RefCross Ref
  3. Miltiadis Allamanis. 2019. The Adverse Effects of Code Duplication in Machine Learning Models of Code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 143--153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. code2seq: Generating Sequences from Structured Representations of Code. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  5. Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. 2020. Structural Language Models of Code. In International Conference on Machine Learning. PMLR, 245--256.Google ScholarGoogle Scholar
  6. Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning Distributed Representations of Code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65--72.Google ScholarGoogle Scholar
  8. Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, and Giacomo Domeniconi. 2020. Exploring Software Naturalness through Neural Language Models. arXiv:2006.12641 [cs.CL]Google ScholarGoogle Scholar
  9. Chi Chen, Xin Peng, Zhenchang Xing, Jun Sun, Xin Wang, Yifan Zhao, and Wenyun Zhao. 2021. Holistic Combination of Structural and Textual Code Information for Context based API Recommendation. IEEE Transactions on Software Engineering (2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree Neural Networks for Program Translation. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, Inc.Google ScholarGoogle Scholar
  11. Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724--1734.Google ScholarGoogle ScholarCross RefCross Ref
  12. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2019. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  13. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.Google ScholarGoogle Scholar
  14. Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuang Wang, Fei Feng, Li Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, et al. 2020. Eflops: Algorithm and system co-design for a high performance distributed training platform. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 610--622.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, et al. 2021. ACCL: Architecting Highly Scalable Distributed Training Systems with Highly-Efficient Collective Communication Library. IEEE Micro (2021).Google ScholarGoogle Scholar
  16. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings. 1536--1547.Google ScholarGoogle ScholarCross RefCross Ref
  17. Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1631--1640.Google ScholarGoogle ScholarCross RefCross Ref
  18. Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep Code Search. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 933--944.Google ScholarGoogle Scholar
  19. Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, LIU Shujie, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations, ICLR 2021.Google ScholarGoogle Scholar
  20. Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment Generation. In 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC). IEEE, 200--20010.Google ScholarGoogle Scholar
  21. Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep Code Comment Generation with Hybrid Lexical and Syntactical Information. Empirical Software Engineering 25, 3 (2020), 2179--2217.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi Jin. 2018. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018. 2269--2275.Google ScholarGoogle ScholarCross RefCross Ref
  23. Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2020. Control Flow Graph Embedding Based on Multi-Instance Decomposition for Bug Localization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4223--4230.Google ScholarGoogle ScholarCross RefCross Ref
  24. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2020. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. arXiv:1909.09436 [cs.LG]Google ScholarGoogle Scholar
  25. Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. TreeBERT: A tree-based pre-trained model for programming language. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, Vol. 161. PMLR, 54--63.Google ScholarGoogle Scholar
  26. Magne Jorgensen and Martin Shepperd. 2007. A Systematic Review of Software Development Cost Estimation Studies. IEEE Transactions on Software Engineering 33, 1 (2007), 33--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and Evaluating Contextual Embedding of Source Code. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119). PMLR, 5110--5121.Google ScholarGoogle Scholar
  28. Rafael Michael Karampatsis and Charles Sutton. 2020. SCELMo: Source Code Embeddings from Language Models. arXiv:2004.13214 [cs.SE]Google ScholarGoogle Scholar
  29. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. ACL, 1746--1751.Google ScholarGoogle ScholarCross RefCross Ref
  30. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations. 67--72.Google ScholarGoogle ScholarCross RefCross Ref
  31. Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved Code Summarization via a Graph Neural Network. In Proceedings of the 28th International Conference on Program Comprehension. 184--195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Alexander LeClair, Siyuan Jiang, and Collin McMillan. 2019. A Neural Model for Generating Natural Language Summaries of Program Subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 795--806.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Alexander LeClair and Collin McMillan. 2019. Recommendations for Datasets for Source Code Summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 3931--3937.Google ScholarGoogle ScholarCross RefCross Ref
  34. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871--7880.Google ScholarGoogle ScholarCross RefCross Ref
  35. Chen Lin, Zhichao Ouyang, Junqing Zhuang, Jianqiang Chen, Hui Li, and Rongxin Wu. 2021. Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting. In 29th IEEE/ACM International Conference on Program Comprehension, ICPC 2021.Google ScholarGoogle Scholar
  36. Chin-Yew Lin and Eduard Hovy. 2002. Manual and Automatic Evaluation of Summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization - Volume 4. 45--51.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task Learning based Pre-trained Language Model for Code Completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 473--485.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  39. Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336--347.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Antonio Valerio Miceli-Barone and Rico Sennrich. 2017. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 314--319.Google ScholarGoogle Scholar
  41. Sheena Panthaplackel, Miltiadis Allamanis, and Marc Brockschmidt. 2021. Copy That! Editing Sequences by Copying Spans. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13622--13630.Google ScholarGoogle ScholarCross RefCross Ref
  42. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311--318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227--2237.Google ScholarGoogle ScholarCross RefCross Ref
  44. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. (2019).Google ScholarGoogle Scholar
  45. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1--67.Google ScholarGoogle Scholar
  46. Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1073--1083.Google ScholarGoogle Scholar
  47. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1715--1725.Google ScholarGoogle ScholarCross RefCross Ref
  48. Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 464--468.Google ScholarGoogle ScholarCross RefCross Ref
  49. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In International Conference on Machine Learning. PMLR, 5926--5936.Google ScholarGoogle Scholar
  50. Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards Automatically Generating Summary Comments for Java Methods. In 25th IEEE/ACM International Conference on Automated Software Engineering, ASE 2010. ACM, 43--52.Google ScholarGoogle Scholar
  51. Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. Intellicode Compose: Code Generation Using Transformer. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1433--1443.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia: AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2727--2735.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Advances in Neural Information Processing Systems. 5998--6008.Google ScholarGoogle Scholar
  55. Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S Yu. 2019. Multi-Modal Attention Network Learning for Semantic Source Code Retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. 13--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving Automatic Source Code Summarization via Deep Reinforcement Learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397--407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yanlin Wang and Hui Li. 2021. Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14015--14023.Google ScholarGoogle ScholarCross RefCross Ref
  58. Hongqiu Wu, Hai Zhao, and Min Zhang. 2021. Code Summarization with Structure-induced Transformer. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 1078--1090.Google ScholarGoogle ScholarCross RefCross Ref
  59. Rui Xie, Wei Ye, Jinan Sun, and Shikun Zhang. 2021. Exploiting Method Names to Improve Code Summarization: A Deliberation Multi-Task Learning Approach. In 29th IEEE/ACM International Conference on Program Comprehension, ICPC 2021. IEEE, 138--148.Google ScholarGoogle ScholarCross RefCross Ref
  60. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNET: Generalized Autoregressive Pretraining for Language Understanding. Advances in Neural Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  61. Zhen Yang, Jacky Keung, Xiao Yu, Xiaodong Gu, Zhengyuan Wei, Xiaoxue Ma, and Miao Zhang. 2021. A Multi-Modal Transformer-based Code Summarization Approach for Smart Contracts. In 29th IEEE/ACM International Conference on Program Comprehension, ICPC 2021. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  62. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020. Retrieval-based Neural Source Code Summarization. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1385--1397.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A Novel Neural Source Code Representation based on Abstract Syntax Tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 783--794.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SPT-code: sequence-to-sequence pre-training for learning source code representations

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICSE '22: Proceedings of the 44th International Conference on Software Engineering
        May 2022
        2508 pages
        ISBN:9781450392211
        DOI:10.1145/3510003

        Copyright © 2022 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 July 2022

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate276of1,856submissions,15%

        Upcoming Conference

        ICSE 2025

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader