research-article

Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study

Authors:

Izhar Ahmed Khan,

Nasrullah Khan,

Muhammad Zahid AbbasAuthors Info & Claims

EASE '23: Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

Pages 398 - 405

https://doi.org/10.1145/3593434.3594236

Published: 14 June 2023 Publication History

Abstract

Studies have substantiated the efficacy of deep learning-based models in various source code modeling tasks. These models are usually trained on large datasets that are divided into smaller units, known as tokens, utilizing either an open or closed vocabulary system. The selection of a tokenization method can have a profound impact on the number of tokens generated, which in turn can significantly influence the performance of the model. This study investigates the effect of different tokenization methods on source code modeling and proposes an optimized tokenizer to enhance the tokenization performance. The proposed tokenizer employs a hybrid approach that initializes with a global vocabulary based on the most frequent unigrams and incrementally builds an open-vocabulary system. The proposed tokenizer is evaluated against popular tokenization methods such as Closed, Unigram, WordPiece, and BPE tokenizers, as well as tokenizers provided by large pre-trained models such as PolyCoder and CodeGen. The results indicate that the choice of tokenization method can significantly impact the number of sub-tokens generated, which can ultimately influence the modeling performance of a model. Furthermore, our empirical evaluation demonstrates that the proposed tokenizer outperforms other baselines, achieving improved tokenization performance both in terms of a reduced number of sub-tokens and time cost. In conclusion, this study highlights the significance of the choice of tokenization method in source code modeling and the potential for improvement through optimized tokenization techniques.

References

[1]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. arXiv preprint arXiv:2103.06333 (2021).

[2]

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).

[3]

Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An Empirical Study on the Usage of Transformer Models for Code Completion. IEEE Transactions on Software Engineering (2021).

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Yangruibo Ding, Luca Buratti, Saurabh Pujar, Alessandro Morari, Baishakhi Ray, and Saikat Chakraborty. 2021. Contrastive Learning for Source Code with Structural and Functional Properties. arXiv preprint arXiv:2110.03868 (2021).

[6]

Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14, 2 (1990), 179–211.

[7]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).

[8]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).

[9]

Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 837–847.

Digital Library

[10]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.

Digital Library

[11]

Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2020. Deep code comment generation with hybrid lexical and syntactical information. Empirical Software Engineering 25, 3 (may 2020), 2179–2217. https://doi.org/10.1007/s10664-019-09730-9

Digital Library

[12]

Yasir Hussain, Zhiqiu Huang, Yu Zhou, and Senzhang Wang. 2020. CodeGRU: Context-aware deep learning with gated recurrent unit for source code modeling. Information and Software Technology 125 (2020), 106309.

[13]

Yasir Hussain, Zhiqiu Huang, Yu Zhou, and Senzhang Wang. 2023. Boosting source code suggestion with self-supervised Transformer Gated Highway. Journal of Systems and Software 196 (2023), 111553.

Digital Library

[14]

Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and evaluating contextual embedding of source code. In International Conference on Machine Learning. PMLR, 5110–5121.

[15]

Rafael-Michael Karampatsis and Charles Sutton. 2019. Maybe deep neural networks are the best choice for modeling source code. arXiv preprint arXiv:1903.05734 (2019).

[16]

Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Prediction by Feeding Trees to Transformers. 1 (2021), 150–162. https://doi.org/10.1109/icse43902.2021.00026 arxiv:2003.13848

Digital Library

[17]

Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959 (2018).

[18]

Chang Liu, Xin Wang, Richard Shin, Joseph E Gonzalez, and Dawn Song. 2016. Neural code completion. (2016).

[19]

Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task Learning based Pre-trained Language Model for Code Completion. Proceedings - 2020 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020 (sep 2020), 473–485. https://doi.org/10.1145/3324884.3416591

Digital Library

[20]

Qing Mi, Jacky Keung, Yan Xiao, Solomon Mensah, and Yujin Gao. 2018. Improving code readability classification using convolutional neural networks. Information and Software Technology 104, November 2017 (2018), 60–71. https://doi.org/10.1016/j.infsof.2018.07.006

[21]

Anh Tuan Nguyen, Tung Thanh Nguyen, Hoan Anh Nguyen, Ahmed Tamrawi, Hung Viet Nguyen, Jafar Al-Kofahi, and Tien N Nguyen. 2012. Graph-based pattern-oriented, context-sensitive source code completion. In 2012 34th International Conference on Software Engineering (ICSE). IEEE, 69–79.

[22]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022. A conversational paradigm for program synthesis. arXiv preprint arXiv:2203.13474 (2022).

[23]

Haoran Niu, Iman Keivanloo, and Ying Zou. 2017. API usage pattern recommendation for software development. Journal of Systems and Software 129 (2017), 127–139.

Digital Library

[24]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, 2018. Improving language understanding by generative pre-training. (2018).

[25]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[26]

Veselin Raychev, Martin Vechev, Z Eth, and Eran Yahav. 2014. Code completion with statistical language models. In Acm Sigplan Notices, Vol. 49. ACM, 419–428.

[27]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode compose: Code generation using transformer. In ESEC/FSE 2020 - Proceedings of the 28th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, Inc, 1433–1443. https://doi.org/10.1145/3368089.3417058 arxiv:2005.08025

Digital Library

[28]

Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. 2014. On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014. ACM, 269–280. https://doi.org/10.1145/2635868.2635875

Digital Library

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[30]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).

[31]

Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward deep learning software repositories. In Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, 334–345.

Digital Library

[32]

Martin White, Christopher Vendome, Mario Linares-Vasquez, Denys Poshyvanyk, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories (2015), 334–345. https://doi.org/10.1109/MSR.2015.38 arxiv:arXiv:1404.7828v1

[33]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6

[34]

Yan Xiao, Jacky Keung, Kwabena E. Bennin, and Qing Mi. 2018. Machine translation-based bug localization technique for bridging lexical gap. Information and Software Technology 99, August 2017 (2018), 58–61. https://doi.org/10.1016/j.infsof.2018.03.003

Digital Library

[35]

Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.

Digital Library

[36]

Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13, 3 (2018), 55–75.

[37]

Yu Zhou, Xinying Yang, Taolue Chen, Zhiqiu Huang, Xiaoxing Ma, and Harald C Gall. 2021. Boosting API recommendation with implicit feedback. IEEE Transactions on Software Engineering (2021).

Cited By

Hussain YHuang ZZhou YKhan I(2024)Exploring the Impact of Vocabulary Techniques on Code Completion: A Comparative ApproachInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402350068734:05(705-727)Online publication date: 13-Jan-2024
https://doi.org/10.1142/S0218194023500687
Jia XZhou YHussain YYang W(2024)An Empirical Study on Python Library Dependency and Conflict Issues2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS62785.2024.00057(504-515)Online publication date: 1-Jul-2024
https://doi.org/10.1109/QRS62785.2024.00057
C.B. LNagendra Swamy HM.P. P(2024)Data Pre-Processing Framework for Kannada Vachana Sahitya2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE)10.1109/AMATHE61652.2024.10582201(1-7)Online publication date: 16-May-2024
https://doi.org/10.1109/AMATHE61652.2024.10582201

Index Terms

Optimized Tokenization Process for Open-Vocabulary Code Completion: An Empirical Study
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Language models
2. Software and its engineering

Recommendations

Open-vocabulary models for source code
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with ...
A System for English Vocabulary Acquisition based on Code-Switching

Vocabulary plays an important part in second language learning and there are many existing techniques to facilitate word acquisition. One of these methods is code-switching, or mixing the vocabulary of two languages in one sentence. In this paper the ...
Using semantics for granularities of tokenization

Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EASE '23: Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering

June 2023

544 pages

ISBN:9798400700446

DOI:10.1145/3593434

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China

Conference

EASE '23

EASE '23: The International Conference on Evaluation and Assessment in Software Engineering

June 14 - 16, 2023

Oulu, Finland

Acceptance Rates

Overall Acceptance Rate 71 of 232 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
119
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)7

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hussain YHuang ZZhou YKhan I(2024)Exploring the Impact of Vocabulary Techniques on Code Completion: A Comparative ApproachInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402350068734:05(705-727)Online publication date: 13-Jan-2024
https://doi.org/10.1142/S0218194023500687
Jia XZhou YHussain YYang W(2024)An Empirical Study on Python Library Dependency and Conflict Issues2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS62785.2024.00057(504-515)Online publication date: 1-Jul-2024
https://doi.org/10.1109/QRS62785.2024.00057
C.B. LNagendra Swamy HM.P. P(2024)Data Pre-Processing Framework for Kannada Vachana Sahitya2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE)10.1109/AMATHE61652.2024.10582201(1-7)Online publication date: 16-May-2024
https://doi.org/10.1109/AMATHE61652.2024.10582201

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten