Accurate module name prediction using similarity based and sequence generation models

Rai, Sawan; Belwal, Ramesh Chandra; Gupta, Atul

doi:10.1007/s12652-022-03722-2

Accurate module name prediction using similarity based and sequence generation models

Original Research
Published: 02 February 2022

Volume 14, pages 11531–11543, (2023)
Cite this article

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

226 Accesses
1 Altmetric
Explore all metrics

Abstract

Software code understanding is strongly dependent on the identifier names; therefore, software developers spend a lot of time specifying appropriate names for variables, functions, classes, and files. Manually suggesting a useful name is a time consuming and difficult problem for developers. For automatic identifiers name recommendation, various techniques have been proposed. Most of the work has been done for method and class name prediction. Module names play an important role when reusing software libraries to develop new source code. A good module name communicates purpose, while an inappropriate name creates ambiguity and frustration in the developer’s mind. To the best of our knowledge, we did not find any work on module name suggestions or analysis of module names. In this paper, we emphasize the module name and propose the module name prediction approach. First, we extract module files from the online python projects to create a corpus. Next, we apply preprocessing steps to prepare the data for prediction models. We construct four similarity based models and three sequence generation models. The sequence generation models can predict the module name tokens in a sequence, while similarity based models only suggest pre-stored module names. Experimental results indicate that the TF-IDF model performed best among all the models, followed by the three sequence generation models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effect of Identifier Tokenization on Automatic Source Code Documentation

Article 12 September 2021

Ties Between Mined Structural Patterns in Program and Their Identifier Names

Suggesting Descriptive Method Names: An Exploratory Study of Two Machine Learning Approaches

Notes

References

Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pp 207–216
Allamanis M, Barr ET, Bird C, Sutton C (2015) Suggesting accurate method and class names. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp 38–49
Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: International conference on machine learning, pp 2091–2100
Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. In: 2018 39th ACM SIGPLAN conference on programming language design and implementation, ACM, pp 404–419. https://doi.org/10.1145/3192366.3192412
Alon U, Brody S, Levy O, Yahav E (2019a) code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:180801400
Alon U, Zilberstein M, Levy O, Yahav E (2019b) code2vec: Learning distributed representations of code. Proc ACM Program Lang 3(POPL):1–29
Article Google Scholar
Arnaoudova V, Di Penta M, Antoniol G (2016) Linguistic antipatterns: what they are and how developers perceive them. Empir Softw Eng 21(1):104–158
Article Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473
Beck K (2007) Implementation patterns. Pearson Education, London
Google Scholar
Binkley D, Hearn M, Lawrie D (2011) Improving identifier informativeness using part of speech information. In: Proceedings of the 8th working conference on mining software repositories, pp 203–206
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
MATH Google Scholar
Butler S, Wermelinger M, Yu Y, Sharp H (2009) Relating identifier naming flaws and code quality: an empirical study. In: 2009 16th Working conference on reverse engineering, IEEE, pp 31–35
Caprile B, Tonella P (2000) Restructuring program identifier names. In: ICSM, pp 97–107
Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:180311175
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078
Deissenboeck F, Pizka M (2006) Concise and consistent naming. Softw Qual J 14(3):261–282
Article Google Scholar
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Feitelson D, Mizrahi A, Noy N, Shabat AB, Eliyahu O, Sheffer R (2020) How developers choose names. IEEE Trans Softw Eng 48:37–52
Article Google Scholar
Fukuda H, Hayase Y, Kitagawa H (2016) Towards interactive phrase suggestion for naming classes. In: 2016 23rd International conference on software analysis, evolution, and reengineering (SANER), IEEE
Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: 2010 ACM/IEEE 32nd international conference on software engineering, IEEE, vol 2, pp 223–226
Hindle A, Barr ET, Gabel M, Su Z, Devanbu P (2016) On the naturalness of software. Commun ACM 59(5):122–131
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th International conference on software analysis, evolution and reengineering (SANER), IEEE, pp 217–227
Høst EW, Østvold BM (2009) Debugging method names. In: European conference on object-oriented programming, Springer, pp 294–317
Karampatsis RM, Babii H, Robbes R, Sutton C, Janes A (2020) Big code!= big vocabulary: open-vocabulary models for source code. arXiv preprint arXiv:200307914
Kashiwabara Y, Onizuka Y, Ishio T, Hayase Y, Yamamoto T, Inoue K (2014) Recommending verbs for rename method using association rule mining. In: 2014 Software evolution week-IEEE conference on software maintenance, reengineering, and reverse engineering (CSMR-WCRE), IEEE, pp 323–327
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kurimoto S, Hayase Y, Yonai H, Ito H, Kitagawa H (2019) Class name recommendation based on graph embedding of program elements. In: 2019 26th Asia-Pacific software engineering conference (APSEC), IEEE, pp 498–505
Lawrie D, Morrell C, Feild H, Binkley D (2006) What’s in a name? A study of identifiers. In: 14th IEEE international conference on program comprehension (ICPC’06), IEEE, pp 3–12
Lawrie D, Feild H, Binkley D (2007) Quantifying identifier quality: an analysis of trends. Empir Softw Eng 12(4):359–388
Article Google Scholar
LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396–404
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: PPIG, Citeseer, p 11
Martin RC (2009) Clean code: a handbook of agile software craftsmanship. Pearson Education, London
Google Scholar
McConnell S (2004) Code complete. Pearson Education, London
Google Scholar
Mikolov T, Le QV, Sutskever I (2013a) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:13094168
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: 2020 42nd International conference on software engineering
Osman H, van Zadelhoff A, Stikkolorum DR, Chaudron MR (2012) UML class diagram simplification: what is in the developer’s mind? In: Proceedings of the second edition of the international workshop on experiences and empirical studies in software modelling, pp 1–6
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, Cambridge
Book Google Scholar
Rilling J, Klemola T (2003) Identifying comprehension bottlenecks using program slicing and cognitive complexity metrics. In: 11th IEEE international workshop on program comprehension, 2003, IEEE, pp 115–124
Salviulo F, Scanniello G (2014) Dealing with identifiers and comments in source code comprehension and maintenance: results from an ethnographically-informed study with students and professionals. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering, pp 1–10
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp 1139–1147
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3112
Google Scholar
Swidan A, Serebrenik A, Hermans F (2017) How do scratch programmers name variables and procedures? In: 2017 IEEE 17th International working conference on source code analysis and manipulation (SCAM), IEEE, pp 51–60
Takang AA, Grubb PA, Macredie RD (1996) The effects of comments and identifier names on program comprehensibility: an experimental investigation. J Prog Lang 4(3):143–167
Google Scholar
Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India, Noida
Google Scholar
Tichy WF (1998) Should computer scientists experiment more? Computer 31(5):32–40
Article Google Scholar
Tieleman T, Hinton G (2012) Neural networks for machine learning. Coursera (Lecture 65-RMSprop)
Wilcoxon F (1992) Individual comparisons by ranking methods. Breakthroughs in statistics. Springer, Berlin, pp 196–202
Chapter Google Scholar
Yonai H, Hayase Y, Kitagawa H (2019) Mercem: method name recommendation based on call graph embedding. In: 2019 26th Asia-Pacific software engineering conference (APSEC), IEEE, pp 134–141

Download references

Author information

Authors and Affiliations

PDPM Indian Institute of Information Technology, Design and Manufacturing Jabalpur, Jabalpur, 482005, India
Sawan Rai, Ramesh Chandra Belwal & Atul Gupta

Authors

Sawan Rai
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh Chandra Belwal
View author publications
You can also search for this author in PubMed Google Scholar
Atul Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sawan Rai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rai, S., Belwal, R.C. & Gupta, A. Accurate module name prediction using similarity based and sequence generation models. J Ambient Intell Human Comput 14, 11531–11543 (2023). https://doi.org/10.1007/s12652-022-03722-2

Download citation

Received: 28 December 2020
Accepted: 19 January 2022
Published: 02 February 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s12652-022-03722-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accurate module name prediction using similarity based and sequence generation models

Abstract

Access this article

Similar content being viewed by others

Effect of Identifier Tokenization on Automatic Source Code Documentation

Ties Between Mined Structural Patterns in Program and Their Identifier Names

Suggesting Descriptive Method Names: An Exploratory Study of Two Machine Learning Approaches

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Accurate module name prediction using similarity based and sequence generation models

Abstract

Access this article

Similar content being viewed by others

Effect of Identifier Tokenization on Automatic Source Code Documentation

Ties Between Mined Structural Patterns in Program and Their Identifier Names

Suggesting Descriptive Method Names: An Exploratory Study of Two Machine Learning Approaches

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation