Skip to main content
Log in

An empirical study of identifier splitting techniques

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

Notes

  1. Annotator experience ranged from second year students to practicing professionals with almost fifty years of experience. The average experience was 13.1 years while the median was 7.0 years and the standard deviation 12.8 years.

  2. Information concerning all of these splitters as well as how each split the identifiers in the oracle can be found in the replication package at www.cs.loyola.edu/~lawrie/id-splitting-data.

References

  • Atkinson K (2004) Spell checking oriented word lists (scowl). http://wordlist.sourceforge.net/. Accessed 13 July 2013

  • Binkley D, Davis M, Lawrie D, Maletic J, Morrell C, Sharif B (2013) The impact of identifier style on effort and comprehension. Empir Software Eng 18:219–276. doi:10.1007/s10664-012-9201-4

    Article  Google Scholar 

  • Brants T, Franz A: Web 1t 5-gram version 1 (2006). Linguistic Data Consortium, Philadelphia

  • Butler S, Wermelinger M, Yu Y, Sharp H (2011) Improving the tokenisation of identifier names. In: Proceedings of the 25th European conference on object-oriented programming, ECOOP’11. Springer-Verlag, Berlin, Heidelberg, pp 130–154. http://dl.acm.org/citation.cfm?id=2032497.2032507

  • Caprile B, Tonella P (1999) Nomen est omen: Analyzing the language of function identifiers. In: WCRE ’99: Proceedings of the 6th working conference on reverse engineering, pp 112–122

  • Caprile B, Tonella P (2000) Restructuring program identifier names. In: ICSM ’00: Proceedings of the International Conference on Software Maintenance (ICSM’00). IEEE Computer Society, Washington, DC, USA, p 97

  • Corazza A, Martino SD, Maggio V (2012) Linsen: An approach to split identifiers and expand abbreviations with linear complexity. In: Proceedings of the 2012 IEEE International Conference on Software Maintenance, ICSM ’12. IEEE Computer Society, Washington, DC, USA

  • Deissenboeck F, Pizka M (2006) Concise and consistent naming. J Soft Quality Control 14(3):261–282. doi:10.1007/s11219-006-9219-1

    Article  Google Scholar 

  • Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011) Can better identifier splitting techniques help feature location? In: 2011 IEEE 19th International Conference on Program Comprehension (ICPC), pp 11–20. doi:10.1109/ICPC.2011.47

  • Enslen E, Hill E, Pollock L, Vijay-Shanker K (2009) Mining source code to automatically split identifiers for software analysis. In: Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009, 71–80. doi:10.1109/MSR.2009.5069482

  • Feild H, Binkley D, Lawrie D (2006) An empirical comparison of techniques for extracting concept abbreviations from identifiers. In: Proceedings of IASTED International Conference on Software Engineering and Applications (SEA’06)

  • Guerrouj L, Di Penta M, Antoniol G, Guéhéneuc YG (2011) Tidier: an identifier splitting approach using speech recognition techniques. Journal of Software Maintenance and Evolution: Research and Practice. doi:10.1002/smr.539

    Google Scholar 

  • Hill E, Fry ZP, Boyd H, Sridhara G, Novikova Y, Pollock L, Vijay-Shanker K (2008) AMAP: Automatically mining abbreviation expansions in programs to enhance software maintenance tools. In: MSR ’08: Proceedings of the 5th international working conference on mining software repositories. IEEE Computer Society, Washington, DC, USA

  • Lawrie D, Binkley D (2011) Expanding identifiers to normalizing source code vocabulary. In: ICSM ’11: Proceedings of the 27th IEEE international conference on software maintenance

  • Lawrie D, Binkley D, Morrell C (2010) Normalizing source code vocabulary. In: 17th Working Conference on Reverse Engineering (WCRE), pp 3–12. doi:10.1109/WCRE.2010.10

  • Lawrie D, Feild H, Binkley D (2007a) Extracting meaning from abbreviated identifiers. In: SCAM ’07: Proceedings of the 7th IEEE International working conference on Source Code Analysis and Manipulation (SCAM 2007), pp 213–222. doi:10.1109/SCAM.2007.9

  • Lawrie D, Feild H, Binkley D (2007b) Quantifying identifier quality: an analysis of trends. J Emp Soft Eng 12(4):359–388

    Article  Google Scholar 

  • Liblit B, Begel A, Sweetser E (2006) Cognitive perspectives on the role of naming in computer programs. In: Proceedings of the 18th Annual Psychology of Programming Workshop

  • Madani N, Guerrouj L, Di Penta M, Gueheneuc Y, Antoniol G (2010) Recognizing words from source code identifiers using speech recognition techniques. In: 14th European Conference on Software Maintenance and Reengineering (CSMR), pp 68–77. doi:10.1109/CSMR.2010.31

  • Nie J, Gao J, He H, Chen W, Zhou M (2002) Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In: SIGIR ’02: Proceedings of the 2002 SIGIR. ACM, New York, NY, USA

  • Ott RL, Longnecker M (2001) An introduction to statistical methods and data analysis, 5th edn. Duxbury

  • Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE ’07: Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, Washington, DC, USA, pp 499–510. doi:10.1109/ICSE.2007.32

Download references

Acknowledgements

Special thanks to all the participants as this work would not be possible without your time and also to Chris Morrell for help with the statistics. Support for this work was provided by NSF grant CCF 0916081.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emily Hill.

Additional information

Communicated by: Giulio Antoniol

Appendix: Instructions to Annotators

Appendix: Instructions to Annotators

The following rather minimal instructions were given to the annotators when asking them to provide the oracle version of the split of the identifiers:

What: Please split some program identifiers into atomic units by adding spaces. We consider atomic units to be individual words or abbreviations. Some splits are easily recognized from artifacts in the identifier. Those splits will be automatically inserted. Here are some examples:

  • “theblueHouse” → “the blue House”

  • “FDARequirement” → “FDA Requirement”

  • “unparse_voidptr” → “unparse void ptr”

Some are easy. Some are hard. So let us know when you guess.

Purpose: We are developing algorithms to automatically determine the most likely splits of program identifiers. An automatic identifier splitter is the first important step in a variety of automatic analysis of software natural language. Your splitting decisions will help to guide and evaluate our research on automatic identifier splitting. The split collection of identifiers will be made publicly available.

Disclaimer: Your identity will not be revealed.

Thanks for helping us out!

Dave, Dawn, Emily, Lori, and Vijay

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hill, E., Binkley, D., Lawrie, D. et al. An empirical study of identifier splitting techniques. Empir Software Eng 19, 1754–1780 (2014). https://doi.org/10.1007/s10664-013-9261-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-013-9261-0

Keywords

Navigation