Skip to main content
Log in

Accurate module name prediction using similarity based and sequence generation models

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Software code understanding is strongly dependent on the identifier names; therefore, software developers spend a lot of time specifying appropriate names for variables, functions, classes, and files. Manually suggesting a useful name is a time consuming and difficult problem for developers. For automatic identifiers name recommendation, various techniques have been proposed. Most of the work has been done for method and class name prediction. Module names play an important role when reusing software libraries to develop new source code. A good module name communicates purpose, while an inappropriate name creates ambiguity and frustration in the developer’s mind. To the best of our knowledge, we did not find any work on module name suggestions or analysis of module names. In this paper, we emphasize the module name and propose the module name prediction approach. First, we extract module files from the online python projects to create a corpus. Next, we apply preprocessing steps to prepare the data for prediction models. We construct four similarity based models and three sequence generation models. The sequence generation models can predict the module name tokens in a sequence, while similarity based models only suggest pre-stored module names. Experimental results indicate that the TF-IDF model performed best among all the models, followed by the three sequence generation models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://docs.python.org/3/tutorial/modules.html.

  2. https://docs.python.org/3/library/ast.html.

  3. https://scikit-learn.org/0.16/index.html.

  4. https://radimrehurek.com/gensim/index.html.

  5. https://github.com/mast-group/convolutional-attention.

  6. https://pytorch.org/.

  7. https://colab.research.google.com.

References

  • Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data, pp 207–216

  • Allamanis M, Barr ET, Bird C, Sutton C (2015) Suggesting accurate method and class names. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, pp 38–49

  • Allamanis M, Peng H, Sutton C (2016) A convolutional attention network for extreme summarization of source code. In: International conference on machine learning, pp 2091–2100

  • Alon U, Zilberstein M, Levy O, Yahav E (2018) A general path-based representation for predicting program properties. In: 2018 39th ACM SIGPLAN conference on programming language design and implementation, ACM, pp 404–419. https://doi.org/10.1145/3192366.3192412

  • Alon U, Brody S, Levy O, Yahav E (2019a) code2seq: Generating sequences from structured representations of code. arXiv preprint arXiv:180801400

  • Alon U, Zilberstein M, Levy O, Yahav E (2019b) code2vec: Learning distributed representations of code. Proc ACM Program Lang 3(POPL):1–29

    Article  Google Scholar 

  • Arnaoudova V, Di Penta M, Antoniol G (2016) Linguistic antipatterns: what they are and how developers perceive them. Empir Softw Eng 21(1):104–158

    Article  Google Scholar 

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473

  • Beck K (2007) Implementation patterns. Pearson Education, London

    Google Scholar 

  • Binkley D, Hearn M, Lawrie D (2011) Improving identifier informativeness using part of speech information. In: Proceedings of the 8th working conference on mining software repositories, pp 203–206

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Butler S, Wermelinger M, Yu Y, Sharp H (2009) Relating identifier naming flaws and code quality: an empirical study. In: 2009 16th Working conference on reverse engineering, IEEE, pp 31–35

  • Caprile B, Tonella P (2000) Restructuring program identifier names. In: ICSM, pp 97–107

  • Cer D, Yang Y, Kong Sy, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder. arXiv preprint arXiv:180311175

  • Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078

  • Deissenboeck F, Pizka M (2006) Concise and consistent naming. Softw Qual J 14(3):261–282

    Article  Google Scholar 

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  • Feitelson D, Mizrahi A, Noy N, Shabat AB, Eliyahu O, Sheffer R (2020) How developers choose names. IEEE Trans Softw Eng 48:37–52

    Article  Google Scholar 

  • Fukuda H, Hayase Y, Kitagawa H (2016) Towards interactive phrase suggestion for naming classes. In: 2016 23rd International conference on software analysis, evolution, and reengineering (SANER), IEEE

  • Haiduc S, Aponte J, Marcus A (2010) Supporting program comprehension with source code summarization. In: 2010 ACM/IEEE 32nd international conference on software engineering, IEEE, vol 2, pp 223–226

  • Hindle A, Barr ET, Gabel M, Su Z, Devanbu P (2016) On the naturalness of software. Commun ACM 59(5):122–131

    Article  Google Scholar 

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  • Hofmeister J, Siegmund J, Holt DV (2017) Shorter identifier names take longer to comprehend. In: 2017 IEEE 24th International conference on software analysis, evolution and reengineering (SANER), IEEE, pp 217–227

  • Høst EW, Østvold BM (2009) Debugging method names. In: European conference on object-oriented programming, Springer, pp 294–317

  • Karampatsis RM, Babii H, Robbes R, Sutton C, Janes A (2020) Big code!= big vocabulary: open-vocabulary models for source code. arXiv preprint arXiv:200307914

  • Kashiwabara Y, Onizuka Y, Ishio T, Hayase Y, Yamamoto T, Inoue K (2014) Recommending verbs for rename method using association rule mining. In: 2014 Software evolution week-IEEE conference on software maintenance, reengineering, and reverse engineering (CSMR-WCRE), IEEE, pp 323–327

  • Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  • Kurimoto S, Hayase Y, Yonai H, Ito H, Kitagawa H (2019) Class name recommendation based on graph embedding of program elements. In: 2019 26th Asia-Pacific software engineering conference (APSEC), IEEE, pp 498–505

  • Lawrie D, Morrell C, Feild H, Binkley D (2006) What’s in a name? A study of identifiers. In: 14th IEEE international conference on program comprehension (ICPC’06), IEEE, pp 3–12

  • Lawrie D, Feild H, Binkley D (2007) Quantifying identifier quality: an analysis of trends. Empir Softw Eng 12(4):359–388

    Article  Google Scholar 

  • LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396–404

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: PPIG, Citeseer, p 11

  • Martin RC (2009) Clean code: a handbook of agile software craftsmanship. Pearson Education, London

    Google Scholar 

  • McConnell S (2004) Code complete. Pearson Education, London

    Google Scholar 

  • Mikolov T, Le QV, Sutskever I (2013a) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:13094168

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  • Nguyen S, Phan H, Le T, Nguyen TN (2020) Suggesting natural method names to check name consistencies. In: 2020 42nd International conference on software engineering

  • Osman H, van Zadelhoff A, Stikkolorum DR, Chaudron MR (2012) UML class diagram simplification: what is in the developer’s mind? In: Proceedings of the second edition of the international workshop on experiences and empirical studies in software modelling, pp 1–6

  • Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Rilling J, Klemola T (2003) Identifying comprehension bottlenecks using program slicing and cognitive complexity metrics. In: 11th IEEE international workshop on program comprehension, 2003, IEEE, pp 115–124

  • Salviulo F, Scanniello G (2014) Dealing with identifiers and comments in source code comprehension and maintenance: results from an ethnographically-informed study with students and professionals. In: Proceedings of the 18th international conference on evaluation and assessment in software engineering, pp 1–10

  • Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp 1139–1147

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3112

    Google Scholar 

  • Swidan A, Serebrenik A, Hermans F (2017) How do scratch programmers name variables and procedures? In: 2017 IEEE 17th International working conference on source code analysis and manipulation (SCAM), IEEE, pp 51–60

  • Takang AA, Grubb PA, Macredie RD (1996) The effects of comments and identifier names on program comprehensibility: an experimental investigation. J Prog Lang 4(3):143–167

    Google Scholar 

  • Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India, Noida

    Google Scholar 

  • Tichy WF (1998) Should computer scientists experiment more? Computer 31(5):32–40

    Article  Google Scholar 

  • Tieleman T, Hinton G (2012) Neural networks for machine learning. Coursera (Lecture 65-RMSprop)

  • Wilcoxon F (1992) Individual comparisons by ranking methods. Breakthroughs in statistics. Springer, Berlin, pp 196–202

    Chapter  Google Scholar 

  • Yonai H, Hayase Y, Kitagawa H (2019) Mercem: method name recommendation based on call graph embedding. In: 2019 26th Asia-Pacific software engineering conference (APSEC), IEEE, pp 134–141

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sawan Rai.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rai, S., Belwal, R.C. & Gupta, A. Accurate module name prediction using similarity based and sequence generation models. J Ambient Intell Human Comput 14, 11531–11543 (2023). https://doi.org/10.1007/s12652-022-03722-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-022-03722-2

Keywords

Navigation