Abstract
Source code is an intermediary through which humans communicate with computer systems. It contains a large amount of domain knowledge which can be learned by statistical models. Furthermore, this knowledge can be used to build software engineering tools. We find that the functionality of the source code depends on the programming language-specific token which build the base structure, while identifiers provide natural language information. On this basis, we found that the knowledge in the source code can be sufficiently learned more when modeling the source code in bimodal. This paper presents the bimodal composition language model (BCLM) for source code modeling and representation. We analyze the effectiveness of bimodal modeling, and the results show that the bimodal approach has great potential for source code modeling and program comprehension.














Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
This article uses the publicly available dataset CodeSearchNet, which can be accessed by following link: https://github.com/github/CodeSearchNet. Apart from this, no private dataset is used as evaluation data in this paper.
Notes
https://nlp.stanford.edu/projects/glove/
References
Allamanis M, Barr ET, Bird C, et al (2015a) Suggesting accurate method and class names. In: Nitto ED, Harman M, Heymans P (eds) Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015. ACM, pp 38–49, https://doi.org/10.1145/2786805.2786849
Allamanis M, Tarlow D, Gordon AD, et al (2015b) Bimodal modelling of source code and natural language. In: Bach FR, Blei DM (eds) Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR workshop and conference Proceedings, vol 37. JMLR.org, pp 2123–2132, http://proceedings.mlr.press/v37/allamanis15.html
Allamanis M, Barr ET, Devanbu P et al (2018) A survey of machine learning for big code and naturalness. ACM Comput Surv. https://doi.org/10.1145/3212695
Allamanis M, Brockschmidt M, Khademi M (2018c) Learning to represent programs with graphs. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, https://openreview.net/forum?id=BJOFETxR-
Alon U, Brody S, Levy O, et al (2019) code2seq: Generating sequences from structured representations of code. In: 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, https://openreview.net/forum?id=H1gKYo09tX
Ben-Nun T, Jakobovits AS, Hoefler T (2018) Neural code comprehension: A learnable representation of code semantics. In: Bengio S, Wallach HM, Larochelle H, et al (eds) Advances in neural information processing systems 31: annual conference on neural information processing systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp 3589–3601, https://proceedings.neurips.cc/paper/2018/hash/17c3433fecc21b57000debdf7ad5c930-Abstract.html
Butler S, Wermelinger M, Yu Y, et al (2010) Exploring the influence of identifier names on code quality: An empirical study. In: 2010 14th European conference on software maintenance and reengineering, pp 156–165, 10.1109/CSMR.2010.27
Deissenbock F, Pizka M (2005) Concise and consistent naming [software system identifier naming]. In: 13th international workshop on program comprehension (IWPC’05), pp 97–106, https://doi.org/10.1109/WPC.2005.14
Devlin J, Chang M, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, pp 4171–4186, https://doi.org/10.18653/v1/n19-1423
Dong L, Yang N, Wang W, et al (2019) Unified language model pre-training for natural language understanding and generation. In: Wallach HM, Larochelle H, Beygelzimer A, et al (eds) Advances in Neural information processing systems 32: annual conference on neural information processing systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp 13042–13054, https://proceedings.neurips.cc/paper/2019/hash/c20bb2d9a50d5ac1f713f8b34d9aac5a-Abstract.html
Fang S, Tan Y, Zhang T et al (2021) Self-attention networks for code search. Inf Softw Technol 134:106542. https://doi.org/10.1016/j.infsof.2021.106542
Feng Z, Guo D, Tang D, et al (2020) Codebert: A pre-trained model for programming and natural languages. In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, Findings of ACL, vol EMNLP 2020. Association for Computational Linguistics, pp 1536–1547, https://doi.org/10.18653/v1/2020.findings-emnlp.139
Gu X, Zhang H, Zhang D, et al (2016) Deep API learning. In: Zimmermann T, Cleland-Huang J, Su Z (eds) Proceedings of the 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016, Seattle, WA, USA, November 13-18, 2016. ACM, pp 631–642, https://doi.org/10.1145/2950290.2950334
Gu X, Zhang H, Kim S (2018) Deep code search. In: Chaudron M, Crnkovic I, Chechik M, et al (eds) Proceedings of the 40th international conference on software engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018. ACM, pp 933–944, https://doi.org/10.1145/3180155.3180167
Guo D, Ren S, Lu S, et al (2020) Graphcodebert: Pre-training code representations with data flow. CoRR abs/2009.08366. arXiv:2009.08366
Haldar R, Wu L, Xiong J, et al (2020) A multi-perspective architecture for semantic code search. In: Jurafsky D, Chai J, Schluter N, et al (eds) Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, pp 8563–8568, https://doi.org/10.18653/v1/2020.acl-main.758
Hill E, Pollock LL, Vijay-Shanker K (2011) Improving source code search with natural language phrasal representations of method signatures. In: Alexander P, Pasareanu CS, Hosking JG (eds) 26th IEEE/ACM international conference on automated software engineering (ASE 2011), Lawrence, KS, USA, November 6-10, 2011. IEEE Computer Society, pp 524–527, https://doi.org/10.1109/ASE.2011.6100115
Hindle A, Barr ET, Su Z, et al (2012) On the naturalness of software. In: Proceedings of the 34th international conference on software engineering. IEEE Press, ICSE ’12, pp 837-847
Husain H, Wu H, Gazit T, et al (2019) Codesearchnet challenge: Evaluating the state of semantic code search. CoRR abs/1909.09436. arXiv:1909.09436
Kanade A, Maniatis P, Balakrishnan G, et al (2020) Pre-trained contextual embedding of source code. CoRR abs/2001.00059. arXiv:2001.00059
Karampatsis R, Sutton C (2020) Scelmo: Source code embeddings from language models. CoRR abs/2004.13214. arXiv:2004.13214
Lan Z, Chen M, Goodman S, et al (2020) ALBERT: A lite BERT for self-supervised learning of language representations. In: 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, https://openreview.net/forum?id=H1eA7AEtvS
Lawrie D, Feild H, Binkley D (2007) An empirical study of rules for well-formed identifiers: research articles. J Softw Maint Evol 19(4):205–229
Lawrie DJ, Morrell C, Feild H et al (2007) Effective identifier names for comprehension and memory. Innov Syst Softw Eng 3(4):303–318. https://doi.org/10.1007/s11334-007-0031-2
Le THM, Chen H, Babar MA (2020) Deep learning for source code modeling and generation: models, applications, and challenges. ACM Comput Surv 53(3):1–38. https://doi.org/10.1145/3383458
Li R, Hu G, Peng M (2020) Hierarchical embedding for code search in software q &a sites. In: 2020 international joint conference on neural networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020. IEEE, pp 1–10, https://doi.org/10.1109/IJCNN48605.2020.9207101
Li X, Gong Y, Shen Y, et al (2022) Coderetriever: A large scale contrastive pre-training method for code search. In: Goldberg Y, Kozareva Z, Zhang Y (eds) Proceedings of the 2022 conference on empirical methods in natural language processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. association for computational linguistics, pp 2898–2910, https://aclanthology.org/2022.emnlp-main.187
Ling C, Lin Z, Zou Y, et al (2020) Adaptive deep code search. In: ICPC ’20: 28th International conference on program comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, pp 48–59, https://doi.org/10.1145/3387904.3389278
Ling X, Wu L, Wang S et al (2021) Deep graph matching and searching for semantic code retrieval. ACM Trans Knowl Discov Data 15(5):1–21. https://doi.org/10.1145/3447571
Liu C, Xia X, Lo D, et al (2020) Opportunities and challenges in code search tools. CoRR abs/2011.02297. arXiv:2011.02297
Liu Y, Ott M, Goyal N, et al (2019) Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. arXiv:1907.11692
Maalej W, Tiarks R, Roehm T et al (2014) On the comprehension of program comprehension. ACM Trans Softw Eng Methodol 23(4):1–37. https://doi.org/10.1145/2622669
Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press, USA
Mitra B, Craswell N (2018) An introduction to neural information retrieval. Found Trends Inf Retr 13(1):1–126. https://doi.org/10.1561/1500000061
Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. In: Walker MA, Ji H, Stent A (eds) Proceedings of the 2018 conference of the North American Chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). Association for Computational Linguistics, pp 2227–2237, https://doi.org/10.18653/v1/n18-1202
Qiu X, Sun T, Xu Y, et al (2020) Pre-trained models for natural language processing: A survey. CoRR abs/2003.08271. arXiv:2003.08271
Reimers N, Gurevych I (2019) Sentence-bert: Sentence embeddings using siamese bert-networks. In: Inui K, Jiang J, Ng V, et al (eds) Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019. Association for Computational Linguistics, pp 3980–3990, https://doi.org/10.18653/v1/D19-1410
Rong X, Yan S, Oney S, et al (2016) Codemend: Assisting interactive programming with bimodal embedding. In: Rekimoto J, Igarashi T, Wobbrock JO, et al (eds) Proceedings of the 29th annual symposium on user interface software and technology, UIST 2016, Tokyo, Japan, October 16-19, 2016. ACM, pp 247–258, https://doi.org/10.1145/2984511.2984544
Sachdev S, Li H, Luan S, et al (2018) Retrieval on source code: a neural code search. In: Gottschlich J, Cheung A (eds) Proceedings of the 2nd ACM SIGPLAN international workshop on machine learning and programming languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018. ACM, pp 31–41, https://doi.org/10.1145/3211346.3211353
Shuai J, Xu L, Liu C, et al (2020) Improving code search with co-attentive representation learning. In: ICPC ’20: 28th international conference on program comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, pp 196–207, https://doi.org/10.1145/3387904.3389269
Singer J, Lethbridge TC, Vinson NG, et al (1997) An examination of software engineering work practices. In: Johnson JH (ed) Proceedings of the 1997 conference of the centre for advanced studies on collaborative research, November 10-13, 1997, Toronto, Ontario, Canada. IBM, p 21, https://dl.acm.org/citation.cfm?id=782031
Sinha R, Desai U, Tamilselvam S, et al (2020) Evaluation of siamese networks for semantic code search. CoRR abs/2011.01043. arXiv:2011.01043
Storey MD (2006) Theories, tools and research methods in program comprehension: past, present and future. Softw Qual J 14(3):187–208. https://doi.org/10.1007/s11219-006-9216-4
Sun Z, Liu Y, Yang C, et al (2020) PSCS: A path-based neural model for semantic code search. CoRR abs/2008.03042. arXiv:2008.03042
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Guyon I, von Luxburg U, Bengio S, et al (eds) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 5998–6008, https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Wan Y, Shu J, Sui Y, et al (2019) Multi-modal attention network learning for semantic source code retrieval. In: 34th IEEE/ACM international conference on automated software engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, pp 13–25, https://doi.org/10.1109/ASE.2019.00012
Wang H, Zhang J, Xia Y, et al (2020a) COSEA: convolutional code search with layer-wise attention. CoRR abs/2010.09520. arXiv:2010.09520
Wang W, Zhang Y, Zeng Z, et al (2020b) Trans3: A transformer-based framework for unifying code summarization and code search. CoRR abs/2003.03238. arXiv:2003.03238
Funding
This work is partially supported by grant from the Natural Science Foundation of China (No.62076046 and No.62006130), Inner Monoglia Science Foundation (No.2022MS06028). This work is supported by the National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian. This work is also supported by the Inner Mongolia Directly College and University Scientific Basic in 2022.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wen, D., Zhang, X., Diao, Y. et al. Modeling source code in bimodal for program comprehension. Neural Comput & Applic 36, 13815–13832 (2024). https://doi.org/10.1007/s00521-024-09498-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09498-0