Abstract
Software code understanding is strongly dependent on the identifier names; therefore, software developers spend a lot of time specifying appropriate names for variables, functions, classes, and files. Manually suggesting a useful name is a time consuming and difficult problem for developers. For automatic identifiers name recommendation, various techniques have been proposed. Most of the work has been done for method and class name prediction. Module names play an important role when reusing software libraries to develop new source code. A good module name communicates purpose, while an inappropriate name creates ambiguity and frustration in the developer’s mind. To the best of our knowledge, we did not find any work on module name suggestions or analysis of module names. In this paper, we emphasize the module name and propose the module name prediction approach. First, we extract module files from the online python projects to create a corpus. Next, we apply preprocessing steps to prepare the data for prediction models. We construct four similarity based models and three sequence generation models. The sequence generation models can predict the module name tokens in a sequence, while similarity based models only suggest pre-stored module names. Experimental results indicate that the TF-IDF model performed best among all the models, followed by the three sequence generation models.
Similar content being viewed by others
References
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pp 207–216
Allamanis M, Barr ET, Bird C, Sutton C (2015) Suggesting accurate method and class names. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp 38–49
Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: International conference on machine learning, pp 2091–2100
Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. In: 2018 39th ACM SIGPLAN conference on programming language design and implementation, ACM, pp 404–419. https://doi.org/10.1145/3192366.3192412
Alon U, Brody S, Levy O, Yahav E (2019a) code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:180801400
Alon U, Zilberstein M, Levy O, Yahav E (2019b) code2vec: Learning distributed representations of code. Proc ACM Program Lang 3(POPL):1–29
Arnaoudova V, Di Penta M, Antoniol G (2016) Linguistic antipatterns: what they are and how developers perceive them. Empir Softw Eng 21(1):104–158
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473
Beck K (2007) Implementation patterns. Pearson Education, London
Binkley D, Hearn M, Lawrie D (2011) Improving identifier informativeness using part of speech information. In: Proceedings of the 8th working conference on mining software repositories, pp 203–206
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Butler S, Wermelinger M, Yu Y, Sharp H (2009) Relating identifier naming flaws and code quality: an empirical study. In: 2009 16th Working conference on reverse engineering, IEEE, pp 31–35
Caprile B, Tonella P (2000) Restructuring program identifier names. In: ICSM, pp 97–107
Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:180311175
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078
Deissenboeck F, Pizka M (2006) Concise and consistent naming. Softw Qual J 14(3):261–282
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Feitelson D, Mizrahi A, Noy N, Shabat AB, Eliyahu O, Sheffer R (2020) How developers choose names. IEEE Trans Softw Eng 48:37–52
Fukuda H, Hayase Y, Kitagawa H (2016) Towards interactive phrase suggestion for naming classes. In: 2016 23rd International conference on software analysis, evolution, and reengineering (SANER), IEEE
Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: 2010 ACM/IEEE 32nd international conference on software engineering, IEEE, vol 2, pp 223–226
Hindle A, Barr ET, Gabel M, Su Z, Devanbu P (2016) On the naturalness of software. Commun ACM 59(5):122–131
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th International conference on software analysis, evolution and reengineering (SANER), IEEE, pp 217–227
Høst EW, Østvold BM (2009) Debugging method names. In: European conference on object-oriented programming, Springer, pp 294–317
Karampatsis RM, Babii H, Robbes R, Sutton C, Janes A (2020) Big code!= big vocabulary: open-vocabulary models for source code. arXiv preprint arXiv:200307914
Kashiwabara Y, Onizuka Y, Ishio T, Hayase Y, Yamamoto T, Inoue K (2014) Recommending verbs for rename method using association rule mining. In: 2014 Software evolution week-IEEE conference on software maintenance, reengineering, and reverse engineering (CSMR-WCRE), IEEE, pp 323–327
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kurimoto S, Hayase Y, Yonai H, Ito H, Kitagawa H (2019) Class name recommendation based on graph embedding of program elements. In: 2019 26th Asia-Pacific software engineering conference (APSEC), IEEE, pp 498–505
Lawrie D, Morrell C, Feild H, Binkley D (2006) What’s in a name? A study of identifiers. In: 14th IEEE international conference on program comprehension (ICPC’06), IEEE, pp 3–12
Lawrie D, Feild H, Binkley D (2007) Quantifying identifier quality: an analysis of trends. Empir Softw Eng 12(4):359–388
LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396–404
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: PPIG, Citeseer, p 11
Martin RC (2009) Clean code: a handbook of agile software craftsmanship. Pearson Education, London
McConnell S (2004) Code complete. Pearson Education, London
Mikolov T, Le QV, Sutskever I (2013a) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:13094168
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: 2020 42nd International conference on software engineering
Osman H, van Zadelhoff A, Stikkolorum DR, Chaudron MR (2012) UML class diagram simplification: what is in the developer’s mind? In: Proceedings of the second edition of the international workshop on experiences and empirical studies in software modelling, pp 1–6
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, Cambridge
Rilling J, Klemola T (2003) Identifying comprehension bottlenecks using program slicing and cognitive complexity metrics. In: 11th IEEE international workshop on program comprehension, 2003, IEEE, pp 115–124
Salviulo F, Scanniello G (2014) Dealing with identifiers and comments in source code comprehension and maintenance: results from an ethnographically-informed study with students and professionals. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering, pp 1–10
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp 1139–1147
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3112
Swidan A, Serebrenik A, Hermans F (2017) How do scratch programmers name variables and procedures? In: 2017 IEEE 17th International working conference on source code analysis and manipulation (SCAM), IEEE, pp 51–60
Takang AA, Grubb PA, Macredie RD (1996) The effects of comments and identifier names on program comprehensibility: an experimental investigation. J Prog Lang 4(3):143–167
Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India, Noida
Tichy WF (1998) Should computer scientists experiment more? Computer 31(5):32–40
Tieleman T, Hinton G (2012) Neural networks for machine learning. Coursera (Lecture 65-RMSprop)
Wilcoxon F (1992) Individual comparisons by ranking methods. Breakthroughs in statistics. Springer, Berlin, pp 196–202
Yonai H, Hayase Y, Kitagawa H (2019) Mercem: method name recommendation based on call graph embedding. In: 2019 26th Asia-Pacific software engineering conference (APSEC), IEEE, pp 134–141
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rai, S., Belwal, R.C. & Gupta, A. Accurate module name prediction using similarity based and sequence generation models. J Ambient Intell Human Comput 14, 11531–11543 (2023). https://doi.org/10.1007/s12652-022-03722-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-022-03722-2