Abstract
Partial code usually involves non-fully-qualified type names (non-FQNs) and undeclared receiving objects. Resolving the FQNs of these non-FQN types and undeclared receiving objects (referred to as type inference) is the prerequisite to effective search and reuse of partial code. Existing dictionary-lookup based methods build a symbolic knowledge base of API names and code contexts, which involve significant compilation overhead and are sensitive to unseen API names and code context variations. In this article, we propose using a
- [1] . 2019. Learning from examples to find fully qualified names of API elements in code snippets. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering.243–254.Google Scholar
- [2] . 2020. JCoffee: Using compiler feedback to make partial code snippets compilable. In Proceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution. 810–813.Google Scholar
- [3] . 2007. Parseweb: A programmer assistant for reusing open source code on the web. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. 204–213.Google ScholarDigital Library
- [4] . 2021. Dcom: A deep column mapper for semantic data type detection. CoRR, abs/2106.12871, 2021.Google Scholar
- [5] . 2018. Are code examples on an online Q&A forum reliable?: A study of API misuse on stack overflow. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering.886–896.Google ScholarDigital Library
- [6] . 2021. CRYLOGGER: detecting crypto misuses dynamically. In 42nd IEEE Symposium on Security and Privacy, (SP’21), San Francisco, CA, 1972–1989.Google Scholar
- [7] . 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, (NeurIPS’19), Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, (Eds.). Vancouver, BC, 10197–10207.Google Scholar
- [8] . 2020. API-misuse detection driven by fine-grained API-constraint knowledge graph. In Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering.461–472.Google ScholarDigital Library
- [9] . 2020. Type inference for C: Applications to the static analysis of incomplete programs. ACM Transactions on Programming Languages and Systems 42, 3 (2020), 15:1–15:71.Google ScholarDigital Library
- [10] . 2014. Live API documentation. In 36th International Conference on Software Engineering, ICSE’14, Hyderabad, India - May 31 - June 07), Pankaj Jalote, Lionel C. Briand, and André van der Hoek (Eds.). ACM, 643–652.Google Scholar
- [11] . 2022. SnR: Constraint-based type inference for incomplete Java code snippets. In 44th IEEE/ACM 44th International Conference on Software Engineering (ICSE 2022, Pittsburgh, PA, USA, May 25-27). ACM, 1982–1993. ACM, 1982–1993.Google Scholar
- [12] . 2020. Codebert: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: (EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 1536–1547.Google Scholar
- [13] . 2020. Learning and evaluating contextual embedding of source code. In Proceedings of the International Conference on Machine Learning. PMLR, 5110–5121.Google Scholar
- [14] . 2012. On the naturalness of software. In Proceedings of the 2012 34th International Conference on Software Engineering.837–847.Google Scholar
- [15] . 2018. A survey of machine learning for big code and naturalness. ACM Comput. Surv. 51, 4 (2018), 81:1–81:37.Google Scholar
- [16] . 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.Google Scholar
- [17] . 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems (NeurIPS 2020, December 6-12, 2020, virtual, 2020), Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).Google Scholar
- [18] . 2020. Exploring the limits of transfer learning with a unified text to-text transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67.Google Scholar
- [19] . 2010. Cuebert: A new mixing board concept for musical theatre. In Proceedings of the NIME.Google Scholar
- [20] . 2015. Bimodal modelling of source code and natural language. In Proceedings of the International Conference on Machine Learning.Google Scholar
- [21] . 2013. Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering.Google ScholarDigital Library
- [22] . 2010. On the use of automated text summarization techniques for summarizing source code. In 17th Working Conference on Reverse Engineering (WCRE’10, 13-16 October 2010, Beverly, MA), Giuliano Antoniol, Martin Pinzger, and Elliot J. Chikofsky, (Eds.). IEEE Computer Society, 35–44.Google Scholar
- [23] . 2020. Global relational models of source code. In Proceedings of the International Conference on Learning Representations.Google Scholar
- [24] . 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, (EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 2463–2473.Google Scholar
- [25] . 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.Google ScholarDigital Library
- [26] . 2016. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16, Las Vegas, NV, USA, June 27-30), IEEE Computer Society, 779–788.Google Scholar
- [27] . 2022. Analyzing CodeBERT’s performance on natural language code search. (2022).Google Scholar
- [28] . 2021. NSP-BERT: A prompt-based zero-shot learner through an original pre-training task-next sentence prediction. CoRR, abs/2109.03564.Google Scholar
- [29] . 2022. PTR: prompt tuning with rules for text classification. AI Open, 3 (2022), 182–192.Google Scholar
- [30] . 2022. PPT: pre-trained prompt tuning for few-shot learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (ACL 2022, Dublin, Ireland, May 22-27, 2022), Association for Computational Linguistics, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, (Eds.). 8410–8423.Google Scholar
- [31] . 2022. Prompt-learning for fine-grained entity typing. In Findings of the Association for Computational Linguistics: (EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022), Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 6888–6901.Google Scholar
- [32] . 2021. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. CoRR, abs/2110.07602.Google Scholar
- [33] . 2018. Statistical learning of API fully qualified names in code snippets of online forums. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering.632–642.Google ScholarDigital Library
- [34] . https://openai.com/blog/chatgpt. Access date: May 13, 2023.Google Scholar
- [35] . 2021. It’s not just size that matters: Small language models are also few-shot learners. arXiv:2009.07118. Retrieved from https://arxiv.org/abs/2009.07118Google Scholar
- [36] . 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the EACL.Google ScholarCross Ref
- [37] . 2023. Prompt-tuned code language model as a neural knowledge base for type inference in statically-typed partial code. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering.Association for Computing Machinery, New York, NY, 13 pages.
DOI: Google ScholarDigital Library - [38] . 2008. Enabling static analysis for partial java programs. In Proceedings of the 23rd ACM SIGPLAN Conference on Object-oriented Programming Systems Languages and Applications. 313–328.Google ScholarDigital Library
- [39] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems Annual Conference on Neural Information Processing Systems (2017, December 4-9, 2017), Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). Long Beach, CA, 5998–6008.Google Scholar
- [40] . 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019), Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186.Google Scholar
- [41] . 2019. Codesearchnet challenge: Evaluating the state of semantic code search. CoRR, abs/1909.09436.Google Scholar
- [42] . 2021. What do pre-trained code models know about code? In Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering.1332–1336.Google Scholar
- [43] . 2022. Probing pretrained models of source codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP@EMNLP 2022, Abu Dhabi, United Arab Emirates (Hybrid), December 8, 2022), Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, and Sarah Wiegreffe (Eds.), Association for Computational Linguistics, 371–383.Google Scholar
- [44] . 2022. What do they capture? - A structural analysis of pre-trained language models for source code. In 44th IEEE/ACM 44th International Conference on Software Engineering (ICSE’22). Pittsburgh, PA, 2377–2388.Google Scholar
- [45] . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021, Joaquin Vanschoren and Sai-Kit Yeung (Eds.).Google Scholar
- [46] . 2020. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering. IEEE, 261–271.Google ScholarCross Ref
- [47] . 2019. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology 28, 4 (2019), 1–29.Google ScholarDigital Library
- [48] . 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.Google Scholar
- [49] . 2022. Assemble foundation models for automatic code summarization.Google Scholar
- [50] . 2021. Bridging pre-trained models and downstream tasks for source code understanding. In 44th IEEE/ACM 44th International Conference on Software Engineering (ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022), ACM, 287–298.Google Scholar
- [51] . 2020. REALM: retrieval-augmented language model pre-training. CoRR, abs/2002.08909.Google Scholar
- [52] . 2016. Challenges of security and trust of mobile devices as digital avionics component. In Proceedings of the 2016 Integrated Communications Navigation and Surveillance. 1C4–1–1C4–11.
DOI: Google ScholarCross Ref - [53] . https://www.jetbrains.com/idea/. Access date: December, 2022.Google Scholar
- [54] . 2004. ORANGE: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the COLING.Google ScholarDigital Library
- [55] . 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv:2107.13586. Retrieved from https://arxiv.org/abs/2107.13586Google Scholar
- [56] . 1947. The Generalization of ‘Student’s’ Problem when several different population varlances are Involved. Biometrik, 34, 1-2 (1947), 28–35.Google Scholar
- [57] . 2021. An empirical cybersecurity evaluation of github copilot’s code contributions. CoRR, abs/2108.09293Google Scholar
- [58] . 2020. The Pile: An 800GB dataset of diverse text for language modeling. arXiv e-prints, arXiv:2101.00027.Google Scholar
- [59] . 2022. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022, New Orleans, LA, USA, June 18-24, 2022), IEEE, 15979–15988.Google Scholar
- [60] . 2021. Code completion by modeling flattened abstract syntax trees as graphs. In Proceedings of the AAAI Conference on Artificial Intellegence (2021).Google Scholar
- [61] . 2022. Static inference meets deep learning. In Proceedings of the 44th International Conference on Software Engineering. ACM.
DOI: Google ScholarDigital Library - [62] . 2022. Type4Py. In Proceedings of the 44th International Conference on Software Engineering. ACM.
DOI: Google ScholarDigital Library - [63] . 2019. Analyzing and supporting adaptation of online code examples. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering. IEEE, 316–327.Google ScholarDigital Library
- [64] . 2008. Archetypal internet-scale source code searching. In Proceedings of the IFIP International Conference on Open Source Systems. Springer, 257–263.Google ScholarCross Ref
- [65] . 2009. Internet-scale code search. In 2009 ICSE Workshop on Search-Driven Development-Users, Infrastructure, Tools and Evaluation. 49–52.Google Scholar
- [66] . 2009. Two studies of opportunistic programming: interleaving web foraging, learning, and writing code. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1589–1598.Google ScholarDigital Library
- [67] . 2019. Usage and attribution of Stack Overflow code snippets in GitHub projects. Empirical Software Engineering 24, 3 (2019), 1259–1295.Google ScholarDigital Library
- [68] . 2019. How do developers utilize source code from stack overflow? Empirical Software Engineering 24, 2 (2019), 637–673.Google ScholarDigital Library
- [69] . 2018. Improving API caveats accessibility by mining API caveats knowledge graph. In Proceedings of the 2018 IEEE International Conference on Software Maintenance and Evolution.183–193.Google Scholar
- [70] . 2019. Know-how in programming tasks: From textual tutorials to task-oriented knowledge graph. In Proceedings of the 2019 IEEE International Conference on Software Maintenance and Evolution.257–268.Google Scholar
- [71] . 2019. Generating query-specific class API summaries. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2019).Google Scholar
- [72] . 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery 18, 2 (2009), 300–336.Google ScholarDigital Library
- [73] . 2018. FaCoY – A code-to-code search engine. In Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering. 946–957.
DOI: Google ScholarDigital Library - [74] . 2020. A code-description representation learning model based on attention. In Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering. 447–455.
DOI: Google ScholarCross Ref - [75] . 2019. Enhance code search via reformulating queries with evolving contexts. Automated Software Engineering 26, 4 (2019), 705–732.Google ScholarCross Ref
- [76] . 2019. QE-integrating framework based on Github knowledge and SVM ranking. Science China Information Sciences 62, 5 (2019), 1–16.Google ScholarCross Ref
- [77] . 2016. SPOON: A library for implementing analyses and transformations of java source code. Softw. Pract. Exp., 46, 9 (2016), 1155–1179.Google Scholar
- [78] . 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the ACL.Google ScholarCross Ref
- [79] . 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (EACL 2021, Online, April 19-23, 2021), Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty, (Eds.). Association for Computational Linguistics, 1772–1791.Google Scholar
- [80] . 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.Google Scholar
- [81] . 2021. Codet5: Identifier-aware unified pre-trained encoderdecoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021, Virtual Event/Punta Cana, Dominican Republic, 7-11 November, 2021), Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 8696–8708.Google Scholar
- [82] . 2021. A new search paradigm for natural language code search. (2021).Google Scholar
- [83] . 2020. Exploring software naturalness through neural language models. CoRR, abs/2006.12641Google Scholar
- [84] . 2022. What do they capture? - A structural analysis of pre-trained language models for source code. In 44th IEEE/ACM 44th International Conference on Software Engineering (ICSE’22 Pittsburgh, PA, USA, May 25-27, 2022), ACM, 2377–2388.Google Scholar
- [85] . 2015. Challenges with applying vulnerability prediction models. In Proceedings of the 2015 Symposium and Bootcamp on the Science of Security. 1–9.Google ScholarDigital Library
- [86] . 2020. A transformer-based approach for source code summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020, Online, July 5-10, 2020), Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, (Eds.). Association for Computational Linguistics, 4998–5007.Google Scholar
- [87] . 2022. Probing pretrained models of source codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP@EMNLP 2022, Abu Dhabi, United Arab Emirates (Hybrid), December 8, 2022), Jasmijn Bastings, Yonatan Belinkov, Yanai Elazar, Dieuwke Hupkes, Naomi Saphra, and Sarah Wiegreffe (Eds.). Association for Computational Linguistics, 371–383.Google Scholar
- [88] . 2019. Improving BERT fine-tuning with embedding normalization. ArXiv abs/1911.03918Google Scholar
- [89] . 2018. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2018), 9.Google Scholar
- [90] . 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020, Online, November 16-20, 2020), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 4222–4235.Google Scholar
- [91] . 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 3816–3830.Google Scholar
- [92] . 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021), MarieFrancine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, (Eds.). Association for Computational Linguistics, 3045–3059.Google Scholar
- [93] . 2021. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) abs/2101.00190 (2021).Google Scholar
- [94] . 2022. Context-tuning: Learning contextualized prompts for natural language generation. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022, Gyeongju, Republic of Korea, October 12-17), Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na, (Eds.). International Committee on Computational Linguistics, 6340–6354.Google Scholar
- [95] . 2020. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020, Online, November 16-20), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, (Eds.). Association for Computational Linguistics, 5418–5426.Google Scholar
- [96] . 2020. How can we know what language models know. Trans. Assoc. Comput. Linguistics 8 (2020), 423–438.Google Scholar
- [97] . 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, (EACL 2021, Online, April 19-23), Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty, (Eds.), Association for Computational Linguistics, 1772–1791.Google Scholar
Index Terms
- FQN Inference in Partial Code by Prompt-tuned Language Model of Code
Recommendations
Prompt-tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code
ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software EngineeringPartial code usually involves non-fully-qualified type names (non-FQNs) and undeclared receiving objects. Resolving the FQNs of these non-FQN types and undeclared receiving objects (referred to as type inference) is the prerequisite to effective search ...
Polymorphic type inference for machine code
PLDI '16For many compiled languages, source-level types are erased very early in the compilation process. As a result, further compiler passes may convert type-safe source into type-unsafe machine code. Type-unsafe idioms in the original source and type-unsafe ...
Code template inference using language models
ACM SE '10: Proceedings of the 48th Annual Southeast Regional ConferenceThis paper investigates the use of a natural language processing technique that automatically detects project-specific code templates (i.e., frequently used code blocks), which can be made available to software developers within an integrated ...
Comments