Skip to main content
Log in

An experimental investigation on the effects of context on source code identifiers splitting and expansion

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Recent and past studies indicate that source code lexicon plays an important role in program comprehension. Developers often compose source code identifiers with abbreviated words and acronyms, and do not always use consistent mechanisms and explicit separators when creating identifiers. Such choices and inconsistencies impede the work of developers that must understand identifiers by decomposing them into their component terms, and mapping them onto dictionary, application or domain words. When software documentation is scarce, outdated or simply not available, developers must therefore use the available contextual information to understand the source code. This paper aims at investigating how developers split and expand source code identifiers, and, specifically, the extent to which different kinds of contextual information could support such a task. In particular, we consider (i) an internal context consisting of the content of functions and source code files in which the identifiers are located, and (ii) an external context involving external documentation. We conducted a family of two experiments with 63 participants, including bachelor, master, Ph.D. students, and post-docs. We randomly sampled a set of 50 identifiers from a corpus of open source C programs and we asked participants to split and expand them with the availability (or not) of internal and external contexts. We report evidence on the usefulness of contextual information for identifier splitting and acronym/abbreviation expansion. We observe that the source code files are more helpful than just looking at function source code, and that the application-level contextual information does not help any further. The availability of external sources of information only helps in some circumstances. Also, in some cases, we observe that participants better expanded acronyms than abbreviations, although in most cases both exhibit the same level of accuracy. Finally, results indicated that the knowledge of English plays a significant effect in identifier splitting/expansion. The obtained results confirm the conjecture that contextual information is useful in program comprehension, including when developers split and expand identifiers to understand them. We hypothesize that the integration of identifier splitting and expansion tools with IDE could help to improve developers’ productivity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.gnu.org

  2. http://www.linux.org

  3. http://www.freebsd.org

  4. http://www.acronymfinder.com

  5. Significant p-values are highlighted in bold face here and in all other tables.

References

  • Anquetil N, Lethbridge T (1998) Assessing the relevance of identifier names in a legacy software system. In: Proceedings of CASCON, pp 213–222

  • Antoniol G, Canfora G, Casazza G, De Lucia A, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28:970–983

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley

  • Baker RD (1995) Modern permutation test software. In: Edgington EG (ed) Randomization tests. Marcel Decker

  • Basili V, Caldiera G, Rombach DH (1994) The goal question metric paradigm encyclopedia of software engineering. John Wiley and Sons

  • Binkley D, Davis M, Lawrie D, Morrell C (2009) To camelcase or under_score. In: The 17th IEEE international conference on program comprehension, ICPC 2009. Vancouver, British Columbia, Canada, May 17–19, 2009. IEEE Computer Society, pp 158–167

  • Binkley D, Davis M, Lawrie D, Maletic JI, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empir Software Eng 2(18):219–276

    Article  Google Scholar 

  • Caprile B, Tonella P (1999) Nomen est omen: analyzing the language of function identifiers. In: Proc. of the working conference on reverse engineering (WCRE). Atlanta, Georgia, USA, pp 112–122

  • Caprile B, Tonella P (2000) Restructuring program identifier names. In: Proc. of the International Conference on Software Maintenance (ICSM), pp 97–107

  • Deißenböck F, Pizka M (2005) Concise and consistent naming. In: Proc. of the International Workshop on Program Comprehension (IWPC)

  • Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? In: Proc. of the International Conference on Program Comprehension (ICPC). Kingston, pp 11–20

  • Enslen E, Hill E, Pollock LL, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 6th international working conference on mining software repositories, MSR 2009. Vancouver, BC, Canada, May 16–17, pp 71–80

  • Grissom RJ, Kim JJ (2005) Effect sizes for research: a broad practical approach, 2nd edn. Lawrence Earlbaum Associates

  • Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc YG (2013) TIDIER: an identifier splitting approach using speech recognition techniques. J Softw Evol Process 25(6):569–661

    Article  Google Scholar 

  • Holm A (1979) A simple sequentially rejective Bonferroni test procedure. Scand J Stat 6:65–70

    MathSciNet  MATH  Google Scholar 

  • Kersten M, Murphy GC (2006) Using task context to improve programmer productivity. In: SIGSOFT ’06/FSE-14: proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering. ACM Press, Portland, Oregon, pp 1–11

    Chapter  Google Scholar 

  • Lawrie D, Binkley D (2011) Expanding identifiers to normalize source code vocabulary. In: Proc. of the International Conference on Software Maintenance (ICSM), pp 113–122

  • Lawrie D, Feild H, Binkley D (2006a) Syntactic identifier conciseness and consistency. In: 6th IEEE international workshop on source code analysis and manipulation. Philadelphia, Pennsylvania, USA, pp 139–148

  • Lawrie D, Morrell C, Feild H, Binkley D (2006b) What’s in a name? A study of identifiers. In: Proceedings of 14th IEEE international conference on program comprehension. IEEE CS Press, Athens, pp 3–12

    Chapter  Google Scholar 

  • Lawrie D, Morrell C, Feild H, Binkley D (2007) Effective identifier names for comprehension and memory. Innov Syst Softw Eng 3(4):303–318

    Article  Google Scholar 

  • Lawrie DJ, Binkley D, Morrell C (2010) Normalizing source code vocabulary. In: Proc. of the Working Conference on Reverse Engineering (WCRE), pp 112–122

  • Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. In: Proceedings of the 22nd IEEE/ACM international conference on automated software engineering. ACM, New York, NY, pp 234–243

    Google Scholar 

  • Madani N, Guerrouj L, Di Penta M, Guéhéneuc Y-G, Antoniol G (2010) Recognizing words from source code identifiers using speech recognition techniques. In: Proceedings of the conference on software maintenance and reengineering. IEEE, pp 69–78

  • Maletic JI, Marcus A (2001) Supporting program comprehension using semantic and structural information. In: Proc. of 23rd international conference on software engineering. Toronto, pp 103–112

  • Marc E, Alfred A, Giuliano A, Guéhéneuc Y-G (2008) Cerberus: tracing requirements to source code using information retrieval dynamic analysis and program analysis. In: ICPC ’08: Proceedings of the 2008 the 16th IEEE international conference on program comprehension. IEEE Computer Society, Washington DC pp 53–62

    Google Scholar 

  • Marcus A, Maletic JI, Sergeyev A (2005) Recovery of traceability links between software documentation and source code. Int J Softw Eng Knowl Eng 15(5):811–836

    Article  Google Scholar 

  • Merlo E, McAdam I, De Mori R (2003) Feed-forward and recurrent neural networks for source code informal information analysis. J Softw Maint 15(4):205–244

    Article  Google Scholar 

  • Ney H (1984) The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Trans Acoust Speech Signal Process 32(2):263–271

    Article  Google Scholar 

  • Poshyvanyk D, Guéhéneuc Y-G, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Software Eng 33(6):420–432

    Article  Google Scholar 

  • R Core Team (2012) R: a language and environment for statistical computing. Vienna, Austria. ISBN 3-900051-07-0

  • Ricca F, Di Penta M, Torchiano M, Tonella P, Ceccato M (2010) How developers’ experience and ability influence web application comprehension tasks supported by uml stereotypes: a series of four experiments. IEEE Trans Softw Eng 36(1):96–118

    Article  Google Scholar 

  • Robillard MP, Coelho W, Murphy GC (2004) How effective developers investigate source code: anexploratory study. IEEE Trans Softw Eng 30(12):889–903

    Article  Google Scholar 

  • Sharif B, Maletic JI (2010) An eye tracking study on camelcase and under_score identifier styles. In: Proceedings of the international conference on program comprehension, pp 196–205

  • Sheskin DJ (2007) Handbook of parametric and nonparametric statistical procedures, 4th edn. Chapman & Hall

  • Sillito J, Murphy GC, De Volder K (2008) Asking and answering questions during a programming change task. IEEE Trans Softw Eng 34:434–451

    Article  Google Scholar 

  • Soloway E, Bonar J, Ehrlich K (1983) Cognitive strategies and looping constructs: an empirical study. Commun ACM 26(11):853–860

    Article  Google Scholar 

  • Storey MAD (1998) A cognitive framework for describing and evaluating software exploration tools. PhD thesis Simon Fraser University

  • Takang A, Grubb PA, Macredie RD (1996) The effects of comments and identifier names on program comprehensibility: an experiential study. J Program Lang 4(3):143–167

    Google Scholar 

  • von Mayrhauser A, Vans AM (1995) Program comprehension during software maintenance and evolution. IEEE Comput 28(8):44–55

    Article  Google Scholar 

  • Wohlin C, Runeson P, Host M, Ohlsson MC, Regnell B, Wesslen A (2000) Experimentation in software engineering—an introduction. Kluwer Academic Publishers

Download references

Acknowledgements

Special thanks to all the participants as this work would not be ossible without your time. Many thanks also to all the reviewers for their thorough and well considered reviews.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Latifa Guerrouj.

Additional information

Communicated by: Martin Robillard

Appendices

Appendix

A Detailed Study Settings

Table 20 reports the characteristics of the 34 applications from which the 50 identifiers used in our study were sampled.

Table 20 Applications from which the 50 identifiers were sampled

Table 21 reports the expansions of all the identifiers used in the experiments. The column Separator indicates whether underscore or Camel Case separators are used. The columns Abbr., Acro. and Plain report the number of abbreviations, acronyms and plain English words composing each identifier.

Table 21 Splitting/expansion oracle and kinds of terms composing identifiers

B Detailed Results

This Appendix reports figures detailing results presented and discussed in Section 3. Specifically, Figs. 7 and 8 show boxplots of Precision and Recall for the different levels of context, respectively.

Fig. 7
figure 7

Boxplots of precision different context levels (AF = Acronym Finder)

Fig. 8
figure 8

Boxplots of recall different context levels (AF = Acronym Finder)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guerrouj, L., Di Penta, M., Guéhéneuc, YG. et al. An experimental investigation on the effects of context on source code identifiers splitting and expansion. Empir Software Eng 19, 1706–1753 (2014). https://doi.org/10.1007/s10664-013-9260-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-013-9260-1

Keywords

Navigation