Skip to main content
Log in

Configuring latent Dirichlet allocation based feature location

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://gibbslda.sourceforge.net

  2. http://cran.r-project.org/web/packages/lda

  3. http://software.eng.ua.edu/data/lda-feature-location

  4. http://argouml.tigris.org

  5. http://jabref.sourceforge.net

  6. http://jedit.org

  7. http://mucommander.com

  8. http://eclipse.org/mylyn

  9. http://mozilla.org/rhino

  10. http://www.cs.columbia.edu/~eaddy/concerntagger/

  11. http://www.cs.wm.edu/semeru/data/feature-location-survey/

  12. http://software.eng.ua.edu/data/lda-feature-location

  13. http://antlr.org/grammar/1152141644268/Java.g

  14. In Java a method may contain a class (which may contain a method).

  15. http://tartarus.org/~martin/PorterStemmer/python.txt

  16. http://argouml.tigris.org/issues/show_bug.cgi?id=4019

  17. http://sf.net/tracker/?func=detail&aid2̄842444&group_id=588&atid3̄00588

  18. http://trac.mucommander.com/ticket/311

  19. https://bugzilla.mozilla.org/show_bug.cgi?id=352319

  20. https://bugs.eclipse.org/bugs/show_bug.cgi?id=151257

  21. http://mallet.cs.umass.edu

  22. http://gibbslda.sourceforge.net

References

  • Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: Proc of the 16th IEEE int’l conf on program comprehension, pp 103–112. doi:10.1109/ICPC.2008.30

  • Abebe S, Haiduc S, Marcus A, Tonella P, Antoniol G (2009a) Analyzing the evolution of the source code vocabulary. In: Proc of the 13th European conf on software maintenance and reengineering, pp 189–198. doi:10.1109/CSMR.2009.61

  • Abebe S, Haiduc S, Tonella P, Marcus A (2009b) Lexicon bad smells in software. In: Proc of the 16th working conf on reverse engineering, pp 95–99. doi:10.1109/WCRE.2009.26

  • Andrieu C, Freitas N, Doucet A, Jordan M (2003) An introduction to mcmc for machine learning. Mach Learn 50(1–2):5–43

    Article  MATH  Google Scholar 

  • Antoniol G, Canfora G, Casazza G, Lucia AD, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983

    Article  Google Scholar 

  • Asuncion A, Welling M, Smyth P, Teh Y (2009) On smoothing and inference for topic models. In: Proc of the 25th conf on uncertainty in artificial intelligence, pp 27–34

  • Asuncion H, Asuncion A, Taylor R (2010) Software traceability with topic modeling. In: Proc of the 32nd int’l conf on software engineering, pp 95–104. doi:10.1145/1806799.1806817

  • Baldi P, Linstead E, Lopes C, Bajracharya S (2008) A theory of aspects as latent topics. In: Proc of the ACM SIGPLAN conf on object-oriented programming, systems, languages, and applications, pp 543–562. doi:10.1145/1449955.1449807

  • Basili V, Caldiera G, Rombach H (1994) The goal question metric approach. ftp://ftp.cs.umd.edu/pub/sel/papers/gqm.pdf. Accessed 15 Feb 2011

  • Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: Proc of the 18th working conf on reverse engineering, pp 124–128. doi:10.1109/WCRE.2011.23

  • Biggerstaff T, Mitbander B, Webster D (1993) The concept assignment problem in program understanding. In: Proc of the int’l conf on software engineering, pp 482–498

  • Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Canfora G, Cerulo L (2006) Fine grained indexing of software repositories to support impact analysis. In: Proc of the 3rd int’l wksp on mining software repositories, pp 105–111. doi:10.1145/1137983.1138009

  • Chang J, Blei D (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150

    Article  MATH  MathSciNet  Google Scholar 

  • Corley C, Kraft N, Etzkorn L, Lukins S (2011) Recovering traceability links between source code and fixed bugs via patch analysis. In: Proc of the 6th int’l wks on traceability in emerging forms of software engineering, pp 31–37. doi:10.1145/1987856.1987863

  • De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4). doi:10.1145/1276933.1276934

    Google Scholar 

  • Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407

    Article  Google Scholar 

  • Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011a) Can better identifier splitting techniques help feature location? In: Proc of the 19th IEEE int’l conf on program comprehension, pp 11–20. doi:10.1109/ICPC.2011.47

  • Dit B, Revelle M, Gethers M, Poshyvanyk D (2011b) Feature location in source code: a taxonomy and survey. J Softw Maint Evol: Res Pract. doi:10.1002/smr.567

  • Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho A (2008) Do crosscutting concerns cause defects? IEEE Trans Softw Eng 34(4):497–515

    Article  Google Scholar 

  • Eisenberg A, Volder KD (2005) Dynamic feature traces: finding features in unfamiliar code. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 337–346. doi:10.1109/ICSM.2005.42

  • Fluri B, Wursch M, Gall H (2007) Do code and comments co-evolve? On the relation between source code and comment changes. In: Proc of the 14th working conf on reverse engineering, pp 70–79. doi:10.1109/WCRE.2007.21

  • Fox C (1992) Lexical analysis and stoplists. In: Frakes W, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ

    Google Scholar 

  • Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in IR-based concept location. In: Proc of the IEEE int’l conf on software maintenance, pp 351–360. doi:10.1109/ICSM.2009.5306315

  • Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: Proc of the int’l conf on software maintenance, pp 1–10. doi:10.1109/ICSM.2010.5609687

  • Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(Suppl 1):5228–5235. doi:10.1073/pnas.0307752101

    Article  Google Scholar 

  • Heinrich G (2009) Parameter estimation for text analysis. Tech Rep, Fraunhofer IGD, Darmstadt, Germany. http://www.arbylon.net/publications/text-est2.pdf. Version 2.9. Accessed 15 Feb 2011

  • Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with Dora to expedite software maintenance. In: Proc of the 22nd int’l conf on automated software engineering, pp 14–23. doi:10.1145/1321631.1321637

  • Lawrie D, Binkley D (2011) Expanding identifiers to normalize source code vocabulary. In: Proc of the 27th IEEE int’l conf on software maintenance, pp 113–122. doi:10.1109/ICSM.2011.6080778

  • Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. In: Proc of the 22nd int’l conf on automated software engineering, pp 234–243. doi:10.1145/1321631.1321667

  • Liu Y, Poshyvanyk D, Ferenc R, Gyimothy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proc of the 25th IEEE int’l conf on software maintenance, pp 233–242. doi:10.1109/ICSM.2009.5306318

  • Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug localization using latent Dirichlet allocation. In: Proc of the 15th working conf on reverse engineering. doi:10.1109/WCRE.2008.33

  • Lukins S, Kraft N, Etzkorn L (2010) Bug localization using latent Dirichlet allocation. Inf Softw Technol 52(9):972–990

    Article  Google Scholar 

  • Marcus A, Menzies T (2010) Software is data too. In: Proc of the FSE/SDP wksp on future of software engineering research, pp 229–232. doi:10.1145/1882362.1882410

  • Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 133–142. doi:10.1109/ICSM.2005.89

  • Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proc of the 11th working conf on reverse engineering, pp 214–223. doi:10.1109/WCRE.2004.10

  • Maskeri G, Sarkar S, Heafield K (2008) Mining business topics in source code using latent Dirichlet allocation. In: Proc of the 1st India software engineering conf. doi:10.1145/1342211.1342234

  • Minka T (2009) Estimating a Dirichlet distribution. Tech Rep http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf. Accessed 20 Jun 2011

  • Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: Proc of the IEEE int’l conf on program comprehension, pp 68–71. doi:10.1109/ICPC.2010.20

  • Poshyvanyk D, Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432

    Article  Google Scholar 

  • Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Software Eng 14(1):5–32

    Article  Google Scholar 

  • Rajlich V (2006) Changing the paradigm of software engineering. Commun ACM 49(8):67–70

    Article  Google Scholar 

  • Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proc of the 10th IEEE int’l wksp on program comprehension, pp 271–278. doi:10.1109/WPC.2002.1021348

  • Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study with generic and composite text models. In: Proc of the 8th working conf on mining software repositories, pp 43–52. doi:10.1145/1985441.1985451

  • Ratanotayanon S, Choi H, Sim S (2010) My repository runneth over: an empirical study on diversifying data sources to improve feature search. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 206–215. doi:10.1109/ICPC.2010.33

  • Revelle M, Poshyvanyk D (2009) An exploratory study on assessing feature location techniques. In: Proc of the 17th int’l conf on program comprehension, pp 218–222. doi:10.1109/ICPC.2009.5090045

  • Revelle M, Dit B, Poshyvanyk D (2010) Using data fusion and web mining to support feature location in software. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 14–23. doi:10.1109/ICPC.2010.10

  • Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading, MA

    Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  • Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: exploring topics in source code using latent dirichlet allocation. In: Proc of the 26th IEEE int’l conf on software maintenance, pp 1–6. doi:10.1109/ICSM.2010.5609654

  • Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proc of the 19th IEEE int’l conf on program comprehension, pp 1–10. doi:10.1109/ICPC.2011.13

  • Shao P, Atkison T, Kraft N, Smith R (2012) Combining lexical and structural information for static bug localization. Int J Comput Appl Technol 44(1):61–71

    Article  Google Scholar 

  • Thomas S, Adams B, Hassan A, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proc of the 8th IEEE working conf on mining software repositories, pp 173–182. doi:10.1145/1985441.1985467

  • Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. In: Proc of the 6th IEEE working conf on mining software repositories, pp 163–166. doi:10.1109/MSR.2009.5069496

  • Vinz B, Etzkorn L (2006) A synergistic approach to program comprehension. In: Proc of the 14th IEEE int’l conf on program comprehension, pp 69–73. doi:10.1109/ICPC.2006.7

  • Wei X, Croft W (2006) Lda-based document models for ad-hoc retrieval. In: Proc of ACM SIGIR, pp 178–185. doi:10.1145/1148170.1148204

  • Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) SNIAFL: towards a static noninteractive approach to feature location. ACM Trans Softw Eng Methodol 15(2):195–226

    Article  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments and helpful suggestions. This material is based upon work supported by the National Science Foundation under Grant Nos. 0851824, 0915559, and 1156563 and by the U.S. Department of Education under Grant No. P200A100182.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicholas A. Kraft.

Additional information

Editor: Andrian Marcus

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biggers, L.R., Bocovich, C., Capshaw, R. et al. Configuring latent Dirichlet allocation based feature location. Empir Software Eng 19, 465–500 (2014). https://doi.org/10.1007/s10664-012-9224-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-012-9224-x

Keywords

Navigation