Abstract
Source code representation learning is the basis of applying artificial intelligence to many software engineering tasks such as code clone detection, algorithm classification, and code summarization. Recently, many works have tried to improve the performance of source code representation from various perspectives, e.g., introducing the structural information of programs into latent representation. However, when dealing with rapidly expanded unlabeled cross-language source code datasets from the Internet, there are still two issues. Firstly, deep learning models for many code-specific tasks still suffer from the lack of high-quality labels. Secondly, the structural differences among programming languages make it more difficult to process multiple languages in a single neural architecture.
To address these issues, in this article, we propose a novel
- [1] . 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, November 2-4, 2016, and (Eds.), USENIX Association, 265–283. Retrieved from https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi.Google Scholar
- [2] . 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 81.Google ScholarDigital Library
- [3] . 2019. code2seq: Generating sequences from structured representations of code. In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net. Retrieved from https://openreview.net/forum?id=H1gKYo09tX.Google Scholar
- [4] . 2018. A general path-based representation for predicting program properties. ACM SIGPLAN Notices 53, 4 (2018), 404–419.Google ScholarDigital Library
- [5] . 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.Google ScholarDigital Library
- [6] . 2022. Multi-triage: A multi-task learning framework for bug triage. Journal of Systems and Software 184 (2022), 111133.Google ScholarDigital Library
- [7] . 2018. Variational attention for sequence-to-sequence models. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, 1672–1682. Retrieved from https://aclanthology.org/C18-1142.Google Scholar
- [8] . 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, June 29, 2005, , , , and (Eds.), Association for Computational Linguistics, 65–72. https://www.aclweb.org/anthology/W05-0909/.Google Scholar
- [9] . 2019. Deep learning anti-patterns from code metrics history. In Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution, ICSME 2019, Cleveland, OH, September 29 - October 4, 2019. IEEE, 114–124.Google ScholarCross Ref
- [10] . 2016. Learning python code suggestion with a sparse pointer network.
arXiv:1611.08307. Retrieved from http://arxiv.org/abs/1611.08307.Google Scholar - [11] . 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, , , , , and (Eds.) Retrieved from https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.Google Scholar
- [12] . 2021. InferCode: Self-supervised learning of code representations by predicting subtrees. In 43rd IEEE/ACM International Conference on Software Engineering (ICSE’21), Madrid, Spain, 22–30 May 2021. IEEE, 1186–1197.
DOI: Google ScholarDigital Library - [13] . 2019. When deep learning met code search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, , , , and (Eds.), ACM, 964–974.Google ScholarDigital Library
- [14] . 2018. Gaussian process prior variational autoencoders. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, , , , , , and (Eds.). 10390–10401. Retrieved from https://proceedings.neurips.cc/paper/2018/hash/1c336b8080f82bcc2cd2499b4c57261d-Abstract.html.Google Scholar
- [15] . 2020. Practical accuracy estimation for efficient deep neural network testing. ACM Transactions on Software Engineering and Methodology 29, 4 (2020), 30:1–30:35.Google ScholarDigital Library
- [16] . 2019. Controllable paraphrase generation with a syntactic exemplar. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, , , and (Eds.), Association for Computational Linguistics, 5972–5984.Google ScholarCross Ref
- [17] . 2019. A multi-task approach for disentangling syntax and semantics in sentence representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, June 2-7, 2019, Volume 1 (Long and Short Papers), , , and (Eds.), Association for Computational Linguistics, 2453–2464.Google ScholarCross Ref
- [18] . 2018. A neural framework for retrieval and summarization of source code. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, , , and (Eds.), ACM, 826–831.Google ScholarDigital Library
- [19] . 2018. Tree-to-tree neural networks for program translation. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, , , , , , and (Eds.), 2552–2562. Retrieved from https://proceedings.neurips.cc/paper/2018/hash/d759175de8ea5b1d9a2660e45554894f-Abstract.html.Google Scholar
- [20] . 2020. Cross-lingual natural language generation via pre-training. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, The 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, February 7-12, 2020. AAAI Press, 7570–7577. Retrieved from https://aaai.org/ojs/index.php/AAAI/article/view/6256.Google Scholar
- [21] . 2018. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, June 18–22, 2018. Computer Vision Foundation/IEEE Computer Society, 8789–8797.
DOI: Google ScholarCross Ref - [22] . 2013. srcML: An infrastructure for the exploration, analysis, and manipulation of source code: A tool demonstration. In IEEE International Conference on Software Maintenance, Eindhoven, The Netherlands, September 22–28, 2013. IEEE Computer Society, 516–519.
DOI: Google ScholarDigital Library - [23] . 2019. Cross-lingual language model pretraining. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, , , , , , and (Eds.), 7057–7067. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/c04c19c2c2474dbf5f7ac4372c5b9af1-Abstract.html.Google Scholar
- [24] . 2018. Compiler fuzzing through deep learning. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018, Amsterdam, The Netherlands, July 16-21, 2018, and (Eds.), ACM, 95–105.Google ScholarDigital Library
- [25] . 2019. Open vocabulary learning on source code with a graph-structured cache. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California(
Proceedings of Machine Learning Research , Vol. 97), and (Eds.), PMLR, 1475–1485. Retrieved from http://proceedings.mlr.press/v97/cvitkovic19b.html.Google Scholar - [26] . 2020. Neural reverse engineering of stripped binaries using augmented control flow graphs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 225:1–225:28.Google ScholarDigital Library
- [27] . 1982. Data flow program graphs. Computer 15, 2 (1982), 26–41.
DOI: Google ScholarDigital Library - [28] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, June 2-7, 2019, Volume 1 (Long and Short Papers), , , and (Eds.), Association for Computational Linguistics, 4171–4186.Google Scholar
- [29] . 2020. MultiWOZ 2.1: A consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In Proceedings of the 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, , , , , , , , , , , , , , and (Eds.), European Language Resources Association, 422–428.Google Scholar
- [30] . 2018. Learning to teach. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. Retrieved from https://openreview.net/forum?id=HJewuJWCZ.Google Scholar
- [31] . 2020. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 1536–1547.
DOI: Google ScholarCross Ref - [32] . 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, June 12-17, 2016, , , and (Eds.), The Association for Computational Linguistics, 866–875.Google ScholarCross Ref
- [33] . 2017. Domain-adversarial training of neural networks. In Proceedings of the Domain Adaptation in Computer Vision Applications, (Ed.), Springer, 189–209.Google ScholarCross Ref
- [34] . 2019. TECCD: A tree embedding approach for code clone detection. In Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution, ICSME 2019, Cleveland, OH, September 29 - October 4, 2019. IEEE, 145–156.Google ScholarCross Ref
- [35] . 2020. Generating question titles for stack overflow from mined code snippets. ACM Transactions on Software Engineering and Methodology 29, 4 (2020), 26:1–26:37.Google ScholarDigital Library
- [36] . 2017. Learn&Fuzz: Machine learning for input fuzzing. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, October 30 - November 03, 2017, , , and (Eds.), IEEE Computer Society, 50–59.Google ScholarCross Ref
- [37] . 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, , , , and (Eds.), ACM, 933–944.Google ScholarDigital Library
- [38] . 2021. GraphCodeBERT: Pre-training code representations with data flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net.Google Scholar
- [39] . 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 770–778.
DOI: Google ScholarCross Ref - [40] . 2012. On the naturalness of software. In Proceedings of the 2012 34th International Conference on Software Engineering, Martin Glinz, Gail C. Murphy, and Mauro Pezzè (Eds.). IEEE, 837–847.
DOI: Google ScholarCross Ref - [41] . 2015. Distilling the knowledge in a neural network.
arxiv:1503.02531. Retrieved from http://arxiv.org/abs/1503.02531.Google Scholar - [42] . 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.Google ScholarDigital Library
- [43] . 2021. Fix-filter-fix: Intuitively connect any models for effective bug fixing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, , , , and (Eds.), Association for Computational Linguistics, 3495–3504. Retrieved from https://aclanthology.org/2021.emnlp-main.282.Google ScholarCross Ref
- [44] . 2017. Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, December 16-20, 2017. IEEE, 16–23.Google ScholarCross Ref
- [45] . 2018. Deep code comment generation. In Proceedings of the 26th Conference on Program Comprehension, ICPC 2018, Gothenburg, Sweden, May 27-28, 2018, , , and (Eds.), ACM, 200–210.Google ScholarDigital Library
- [46] . 2017. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017(
Proceedings of Machine Learning Research , Vol. 70), and (Eds.), PMLR, 1587–1596. Retrieved from http://proceedings.mlr.press/v70/hu17e.html.Google Scholar - [47] . 2018. IntroVAE: Introspective variational autoencoders for photographic image synthesis. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, , , , , , and (Eds.), 52–63. Retrieved from https://proceedings.neurips.cc/paper/2018/hash/093f65e080a295f8076b1c5722a46aa2-Abstract.html.Google Scholar
- [48] . 2021. Deep transfer bug localization. IEEE Transactions on Software Engineering 47, 7 (2021), 1368–1380.Google ScholarCross Ref
- [49] . 2020. CompareCFG: Providing visual feedback on code quality using control flow graphs. In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, ITiCSE 2020, Trondheim, Norway, June 15-19, 2020, , , , and (Eds.), ACM, 493–499.Google ScholarDigital Library
- [50] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017. ACM, 1–12.Google ScholarDigital Library
- [51] . 2018. FaCoY: A code-to-code search engine. In Proceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, , , , and (Eds.), ACM, 946–957.Google Scholar
- [52] . 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, , , and (Eds.), ACL, 1746–1751.Google ScholarCross Ref
- [53] . 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, November 1-4, 2016, , , and (Eds.), The Association for Computational Linguistics, 1317–1327. Retrieved from http://aclweb.org/anthology/D/D16/D16-1139.pdf.Google ScholarCross Ref
- [54] . 2014. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, and (Eds.).Google Scholar
- [55] . 1998. Convolutional Networks for Images, Speech, and Time Series. MIT Press, Cambridge, MA, 255–258.Google Scholar
- [56] . 2017. Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5 (2017), 365–378. Retrieved from https://transacl.org/ojs/index.php/tacl/article/view/1051.Google ScholarCross Ref
- [57] . 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.Google Scholar
- [58] . 2019. Task-oriented conversation generation using heterogeneous memory networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, , , , and (Eds.), Association for Computational Linguistics, 4557–4566.Google ScholarCross Ref
- [59] . 2020. Modeling programs hierarchically with stack-augmented LSTM. Journal of Systems and Software 164 (2020), 110547.
DOI: Google ScholarCross Ref - [60] . 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, 9-15 July 2016, (Ed.), IJCAI/AAAI Press, 2873–2879.Google Scholar
- [61] . 2020. ATOM: Commit message generation based on abstract syntax tree and hybrid ranking. IEEE Transactions on Software Engineering 1 (2020), 1–1.Google Scholar
- [62] . 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692.Google Scholar
- [63] . 2020. Cross-lingual spoken language understanding with regularized representation alignment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, , , , and (Eds.), Association for Computational Linguistics, 7241–7251.Google ScholarCross Ref
- [64] . 2021. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).Google Scholar
- [65] . 2020. Teacher-student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems 31, 9 (2020), 3732–3740.
DOI: Google ScholarCross Ref - [66] . 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, May 2-4, 2013, Workshop Track Proceedings, and (Eds.), http://arxiv.org/abs/1301.3781.Google Scholar
- [67] . 2016. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 13th AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, and (Eds.), AAAI Press, 1287–1293. Retrieved from http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11775.Google ScholarCross Ref
- [68] . 2019. CLCDSA: Cross language code clone detection using syntactical features and API documentation. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, November 11-15, 2019. IEEE, 1026–1037.Google ScholarDigital Library
- [69] . 2005. Understanding source code evolution using abstract syntax tree matching. ACM SIGSOFT Software Engineering Notes 30, 4 (2005), 1–5.Google ScholarDigital Library
- [70] . 2019. Bilateral dependency neural networks for cross-language algorithm classification. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019, , , and (Eds.). IEEE, 422–433.Google Scholar
- [71] . 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA311–318. Retrieved from http://www.aclweb.org/anthology/P02-1040.pdf.Google Scholar
- [72] . 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, , , , , , and (Eds.), 8024–8035. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.Google Scholar
- [73] . 2020. MTSS: Learn from multiple domain teachers and become a multi-domain dialogue expert. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI 2020, the 32nd Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the 10th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, February 7-12, 2020. AAAI Press, 8608–8615. Retrieved from https://aaai.org/ojs/index.php/AAAI/article/view/6384.Google Scholar
- [74] . 2020. On an optimal analogy-based software effort estimation. Information and Software Technology 125 (2020), 106330.
DOI: Google ScholarCross Ref - [75] . 2020. Source code assessment and classification based on estimated error probability using attentive LSTM language model and its application in programming education. Applied Sciences 10, 8 (2020), 2973.Google ScholarCross Ref
- [76] . 2020. Unsupervised translation of programming languages. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, , , , , and (Eds.). Retrieved from https://proceedings.neurips.cc/paper/2020/hash/ed23fbf18c2cd35f8c7f8de44f85c08d-Abstract.html.Google Scholar
- [77] . 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533–536. Retrieved from http://www.nature.com/articles/323533a0.Google ScholarCross Ref
- [78] . 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.Google ScholarDigital Library
- [79] . 2019. Automatic source code summarization with extended tree-LSTM. In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2019 Budapest, Hungary, July 14-19, 2019. IEEE, 1–8.Google ScholarCross Ref
- [80] . 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, and (Eds.), AAAI Press, 4278–4284. Retrieved from http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14806.Google ScholarCross Ref
- [81] . 2019. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California(
Proceedings of Machine Learning Research , Vol. 97), and (Eds.), PMLR, 6105–6114. Retrieved from http://proceedings.mlr.press/v97/tan19a.html.Google Scholar - [82] . 2019. Multilingual neural machine translation with knowledge distillation. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, May 6-9, 2019. OpenReview.net. Retrieved from https://openreview.net/forum?id=S1gUsoR9YX.Google Scholar
- [83] . 2020. Smart contract classification with a Bi-LSTM based approach. IEEE Access 8 (2020), 43806–43816.Google ScholarCross Ref
- [84] . 2010. Implementation of a lattice boltzmann kernel using the compute unified device architecture developed by nVIDIA. Computing and Visualization in Science 13, 1 (2010), 29.Google ScholarCross Ref
- [85] . 2018. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Information and Software Technology 96 (2018), 94–111.Google ScholarDigital Library
- [86] . 2020. Cross-lingual retrieval for iterative self-supervised training. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, , , , , and (Eds.). Retrieved from https://proceedings.neurips.cc/paper/2020/hash/1763ea5a7e72dd7ee64073c2dda7a7a8-Abstract.html.Google Scholar
- [87] . 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology 28, 4 (2019), 19:1–19:29.
DOI: Google ScholarDigital Library - [88] . 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, , , , , , , and (Eds.), 5998–6008. Retrieved from http://papers.nips.cc/paper/7181-attention-is-all-you-need.Google Scholar
- [89] . 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, June 7-12, 2015. IEEE Computer Society, 4566–4575.Google ScholarCross Ref
- [90] . 2019. IR2Vec: A flow analysis based scalable infrastructure for program encodings. CoRR abs/1909.06228.Google Scholar
- [91] . 2019. Multi-modal attention network learning for semantic source code retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, November 11-15, 2019. IEEE, 13–25.Google ScholarDigital Library
- [92] . 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397–407.Google ScholarDigital Library
- [93] . 2020. Deep learning for spatio-temporal data mining: A survey. IEEE Transactions on Knowledge and Data Engineering (2020). https://ieeexplore.ieee.org/document/9204396/citations#citations.Google ScholarCross Ref
- [94] . 2020. Modular tree network for source code representation learning. ACM Transactions on Software Engineering and Methodology 29, 4 (2020), 1–23.Google ScholarDigital Library
- [95] . 2020. Learning to represent programs with heterogeneous graphs. CoRR.Google Scholar
- [96] . 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6629–6638.Google ScholarCross Ref
- [97] . 2021. SynCoBERT: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv:2108.04556. Retrieved from https://arxiv.org/abs/2108.04556.Google Scholar
- [98] . 2015. Toward deep learning software repositories. In Proceedings of the 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories. IEEE, 334–345.Google ScholarCross Ref
- [99] . 2019. Improving bug localization with word embedding and enhanced convolutional neural networks. Information and Software Technology 105 (2019), 17–29.Google ScholarCross Ref
- [100] . 2018. Vulnerability detection for source code using contextual LSTM. In Proceedings of the 2018 5th International Conference on Systems and Informatics. 1225–1230.Google Scholar
- [101] . 2016. Predicting semantically linkable knowledge in developer online forums via convolutional neural network. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016, , , and (Eds.), ACM, 51–62.Google ScholarDigital Library
- [102] . 2018. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4503–4513.Google ScholarCross Ref
- [103] . 2021. A survey on deep learning for software engineering. ACM Comput. Surv. (2021).Google Scholar
- [104] . 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Advances in Neural Information Processing Systems. 5753–5763.Google Scholar
- [105] . 2018. StructVAE: Tree-structured latent variable models for semi-supervised semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, and (Eds.), Association for Computational Linguistics, 754–765. Retrieved from https://www.aclweb.org/anthology/P18-1070/.Google ScholarCross Ref
- [106] . 2016. Android applications categorization using bayesian classification. In Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, CyberC 2016, Chengdu, China, October 13-15, 2016, and (Eds.), IEEE, 173–176.Google ScholarCross Ref
- [107] . 2021. Disentangled code representation learning for multiple programming languages. In Proceedings of the Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021(
Findings of ACL , Vol. ACL/IJCNLP 2021), , , , and (Eds.), Association for Computational Linguistics, 4454–4466.Google ScholarCross Ref - [108] . 2019. ActionNet: Vision-based workflow action recognition from programming screencasts. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, , , and (Eds.), IEEE / ACM, 350–361.Google ScholarDigital Library
- [109] . 2018. DeepSim: Deep learning code functional similarity. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL) . Association for Computing Machinery, New York, NY, 141–151.Google ScholarDigital Library
- [110] . 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the Advances in Neural Information Processing Systems. 10197–10207.Google Scholar
Index Terms
- XCode: Towards Cross-Language Code Representation with Large-Scale Pre-Training
Recommendations
Automating code review activities by large-scale pre-training
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software EngineeringCode review is an essential part to software development lifecycle since it aims at guaranteeing the quality of codes. Modern code review activities necessitate developers viewing, understanding and even running the programs to assess logic, ...
Universal Representation for Code
Advances in Knowledge Discovery and Data MiningAbstractLearning from source code usually requires a large amount of labeled data. Despite the possible scarcity of labeled data, the trained model is highly task-specific and lacks transferability to different tasks. In this work, we present effective ...
CodeEditor: Learning to Edit Source Code with Pre-trained Models
Developers often perform repetitive code editing activities (up to 70%) for various reasons (e.g., code refactoring) during software development. Many deep learning (DL) models have been proposed to automate code editing by learning from the code editing ...
Comments