Configuring latent Dirichlet allocation based feature location

Biggers, Lauren R.; Bocovich, Cecylia; Capshaw, Riley; Eddy, Brian P.; Etzkorn, Letha H.; Kraft, Nicholas A.

doi:10.1007/s10664-012-9224-x

Configuring latent Dirichlet allocation based feature location

Published: 29 August 2012

Volume 19, pages 465–500, (2014)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Lauren R. Biggers¹,
Cecylia Bocovich²^nAff5,
Riley Capshaw³,
Brian P. Eddy¹,
Letha H. Etzkorn⁴ &
…
Nicholas A. Kraft¹

1376 Accesses
65 Citations
Explore all metrics

Abstract

Feature location is a program comprehension activity, the goal of which is to identify source code entities that implement a functionality. Recent feature location techniques apply text retrieval models such as latent Dirichlet allocation (LDA) to corpora built from text embedded in source code. These techniques are highly configurable, and the literature offers little insight into how different configurations affect their performance. In this paper we present a study of an LDA based feature location technique (FLT) in which we measure the performance effects of using different configurations to index corpora and to retrieve 618 features from 6 open source Java systems. In particular, we measure the effects of the query, the text extractor configuration, and the LDA parameter values on the accuracy of the LDA based FLT. Our key findings are that exclusion of comments and literals from the corpus lowers accuracy and that heuristics for selecting LDA parameter values in the natural language context are suboptimal in the source code context. Based on the results of our case study, we offer specific recommendations for configuring the LDA based FLT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Use of Artificial Intelligence in Writing Scientific Review Articles

Article Open access 16 January 2024

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Unravelling the Impact of Generative Artificial Intelligence (GAI) in Industrial Applications: A Review of Scientific and Grey Literature

Article 28 September 2023

Notes

References

Abadi A, Nisenson M, Simionovici Y (2008) A traceability technique for specifications. In: Proc of the 16th IEEE int’l conf on program comprehension, pp 103–112. doi:10.1109/ICPC.2008.30
Abebe S, Haiduc S, Marcus A, Tonella P, Antoniol G (2009a) Analyzing the evolution of the source code vocabulary. In: Proc of the 13th European conf on software maintenance and reengineering, pp 189–198. doi:10.1109/CSMR.2009.61
Abebe S, Haiduc S, Tonella P, Marcus A (2009b) Lexicon bad smells in software. In: Proc of the 16th working conf on reverse engineering, pp 95–99. doi:10.1109/WCRE.2009.26
Andrieu C, Freitas N, Doucet A, Jordan M (2003) An introduction to mcmc for machine learning. Mach Learn 50(1–2):5–43
Article MATH Google Scholar
Antoniol G, Canfora G, Casazza G, Lucia AD, Merlo E (2002) Recovering traceability links between code and documentation. IEEE Trans Softw Eng 28(10):970–983
Article Google Scholar
Asuncion A, Welling M, Smyth P, Teh Y (2009) On smoothing and inference for topic models. In: Proc of the 25th conf on uncertainty in artificial intelligence, pp 27–34
Asuncion H, Asuncion A, Taylor R (2010) Software traceability with topic modeling. In: Proc of the 32nd int’l conf on software engineering, pp 95–104. doi:10.1145/1806799.1806817
Baldi P, Linstead E, Lopes C, Bajracharya S (2008) A theory of aspects as latent topics. In: Proc of the ACM SIGPLAN conf on object-oriented programming, systems, languages, and applications, pp 543–562. doi:10.1145/1449955.1449807
Basili V, Caldiera G, Rombach H (1994) The goal question metric approach. ftp://ftp.cs.umd.edu/pub/sel/papers/gqm.pdf. Accessed 15 Feb 2011
Beard M, Kraft N, Etzkorn L, Lukins S (2011) Measuring the accuracy of information retrieval based bug localization techniques. In: Proc of the 18th working conf on reverse engineering, pp 124–128. doi:10.1109/WCRE.2011.23
Biggerstaff T, Mitbander B, Webster D (1993) The concept assignment problem in program understanding. In: Proc of the int’l conf on software engineering, pp 482–498
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Canfora G, Cerulo L (2006) Fine grained indexing of software repositories to support impact analysis. In: Proc of the 3rd int’l wksp on mining software repositories, pp 105–111. doi:10.1145/1137983.1138009
Chang J, Blei D (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150
Article MATH MathSciNet Google Scholar
Corley C, Kraft N, Etzkorn L, Lukins S (2011) Recovering traceability links between source code and fixed bugs via patch analysis. In: Proc of the 6th int’l wks on traceability in emerging forms of software engineering, pp 31–37. doi:10.1145/1987856.1987863
De Lucia A, Fasano F, Oliveto R, Tortora G (2007) Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans Softw Eng Methodol 16(4). doi:10.1145/1276933.1276934
Google Scholar
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407
Article Google Scholar
Dit B, Guerrouj L, Poshyvanyk D, Antoniol G (2011a) Can better identifier splitting techniques help feature location? In: Proc of the 19th IEEE int’l conf on program comprehension, pp 11–20. doi:10.1109/ICPC.2011.47
Dit B, Revelle M, Gethers M, Poshyvanyk D (2011b) Feature location in source code: a taxonomy and survey. J Softw Maint Evol: Res Pract. doi:10.1002/smr.567
Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho A (2008) Do crosscutting concerns cause defects? IEEE Trans Softw Eng 34(4):497–515
Article Google Scholar
Eisenberg A, Volder KD (2005) Dynamic feature traces: finding features in unfamiliar code. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 337–346. doi:10.1109/ICSM.2005.42
Fluri B, Wursch M, Gall H (2007) Do code and comments co-evolve? On the relation between source code and comment changes. In: Proc of the 14th working conf on reverse engineering, pp 70–79. doi:10.1109/WCRE.2007.21
Fox C (1992) Lexical analysis and stoplists. In: Frakes W, Baeza-Yates R (eds) Information retrieval: data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ
Google Scholar
Gay G, Haiduc S, Marcus A, Menzies T (2009) On the use of relevance feedback in IR-based concept location. In: Proc of the IEEE int’l conf on software maintenance, pp 351–360. doi:10.1109/ICSM.2009.5306315
Gethers M, Poshyvanyk D (2010) Using relational topic models to capture coupling among classes in object-oriented software systems. In: Proc of the int’l conf on software maintenance, pp 1–10. doi:10.1109/ICSM.2010.5609687
Griffiths T, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(Suppl 1):5228–5235. doi:10.1073/pnas.0307752101
Article Google Scholar
Heinrich G (2009) Parameter estimation for text analysis. Tech Rep, Fraunhofer IGD, Darmstadt, Germany. http://www.arbylon.net/publications/text-est2.pdf. Version 2.9. Accessed 15 Feb 2011
Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with Dora to expedite software maintenance. In: Proc of the 22nd int’l conf on automated software engineering, pp 14–23. doi:10.1145/1321631.1321637
Lawrie D, Binkley D (2011) Expanding identifiers to normalize source code vocabulary. In: Proc of the 27th IEEE int’l conf on software maintenance, pp 113–122. doi:10.1109/ICSM.2011.6080778
Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. In: Proc of the 22nd int’l conf on automated software engineering, pp 234–243. doi:10.1145/1321631.1321667
Liu Y, Poshyvanyk D, Ferenc R, Gyimothy T, Chrisochoides N (2009) Modeling class cohesion as mixtures of latent topics. In: Proc of the 25th IEEE int’l conf on software maintenance, pp 233–242. doi:10.1109/ICSM.2009.5306318
Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug localization using latent Dirichlet allocation. In: Proc of the 15th working conf on reverse engineering. doi:10.1109/WCRE.2008.33
Lukins S, Kraft N, Etzkorn L (2010) Bug localization using latent Dirichlet allocation. Inf Softw Technol 52(9):972–990
Article Google Scholar
Marcus A, Menzies T (2010) Software is data too. In: Proc of the FSE/SDP wksp on future of software engineering research, pp 229–232. doi:10.1145/1882362.1882410
Marcus A, Poshyvanyk D (2005) The conceptual cohesion of classes. In: Proc of the 21st IEEE int’l conf on software maintenance, pp 133–142. doi:10.1109/ICSM.2005.89
Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proc of the 11th working conf on reverse engineering, pp 214–223. doi:10.1109/WCRE.2004.10
Maskeri G, Sarkar S, Heafield K (2008) Mining business topics in source code using latent Dirichlet allocation. In: Proc of the 1st India software engineering conf. doi:10.1145/1342211.1342234
Minka T (2009) Estimating a Dirichlet distribution. Tech Rep http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf. Accessed 20 Jun 2011
Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2010) On the equivalence of information retrieval methods for automated traceability link recovery. In: Proc of the IEEE int’l conf on program comprehension, pp 68–71. doi:10.1109/ICPC.2010.20
Poshyvanyk D, Gueheneuc Y, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Softw Eng 33(6):420–432
Article Google Scholar
Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Software Eng 14(1):5–32
Article Google Scholar
Rajlich V (2006) Changing the paradigm of software engineering. Commun ACM 49(8):67–70
Article Google Scholar
Rajlich V, Wilde N (2002) The role of concepts in program comprehension. In: Proc of the 10th IEEE int’l wksp on program comprehension, pp 271–278. doi:10.1109/WPC.2002.1021348
Rao S, Kak A (2011) Retrieval from software libraries for bug localization: a comparative study with generic and composite text models. In: Proc of the 8th working conf on mining software repositories, pp 43–52. doi:10.1145/1985441.1985451
Ratanotayanon S, Choi H, Sim S (2010) My repository runneth over: an empirical study on diversifying data sources to improve feature search. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 206–215. doi:10.1109/ICPC.2010.33
Revelle M, Poshyvanyk D (2009) An exploratory study on assessing feature location techniques. In: Proc of the 17th int’l conf on program comprehension, pp 218–222. doi:10.1109/ICPC.2009.5090045
Revelle M, Dit B, Poshyvanyk D (2010) Using data fusion and web mining to support feature location in software. In: Proc of the 18th IEEE int’l conf on program comprehension, pp 14–23. doi:10.1109/ICPC.2010.10
Salton G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading, MA
Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Article Google Scholar
Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: exploring topics in source code using latent dirichlet allocation. In: Proc of the 26th IEEE int’l conf on software maintenance, pp 1–6. doi:10.1109/ICSM.2010.5609654
Scanniello G, Marcus A (2011) Clustering support for static concept location in source code. In: Proc of the 19th IEEE int’l conf on program comprehension, pp 1–10. doi:10.1109/ICPC.2011.13
Shao P, Atkison T, Kraft N, Smith R (2012) Combining lexical and structural information for static bug localization. Int J Comput Appl Technol 44(1):61–71
Article Google Scholar
Thomas S, Adams B, Hassan A, Blostein D (2011) Modeling the evolution of topics in source code histories. In: Proc of the 8th IEEE working conf on mining software repositories, pp 173–182. doi:10.1145/1985441.1985467
Tian K, Revelle M, Poshyvanyk D (2009) Using latent Dirichlet allocation for automatic categorization of software. In: Proc of the 6th IEEE working conf on mining software repositories, pp 163–166. doi:10.1109/MSR.2009.5069496
Vinz B, Etzkorn L (2006) A synergistic approach to program comprehension. In: Proc of the 14th IEEE int’l conf on program comprehension, pp 69–73. doi:10.1109/ICPC.2006.7
Wei X, Croft W (2006) Lda-based document models for ad-hoc retrieval. In: Proc of ACM SIGIR, pp 178–185. doi:10.1145/1148170.1148204
Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) SNIAFL: towards a static noninteractive approach to feature location. ACM Trans Softw Eng Methodol 15(2):195–226
Article Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments and helpful suggestions. This material is based upon work supported by the National Science Foundation under Grant Nos. 0851824, 0915559, and 1156563 and by the U.S. Department of Education under Grant No. P200A100182.

Author information

Cecylia Bocovich
Present address: David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada

Authors and Affiliations

Department of Computer Science, The University of Alabama, Tuscaloosa, AL, USA
Lauren R. Biggers, Brian P. Eddy & Nicholas A. Kraft
Department of Mathematics, Statistics, and Computer Science, Macalester College, Saint Paul, MN, USA
Cecylia Bocovich
Department of Mathematics & Computer Science, Hendrix College, Conway, AR, USA
Riley Capshaw
Department of Computer Science, The University of Alabama in Huntsville, Huntsville, AL, USA
Letha H. Etzkorn

Authors

Lauren R. Biggers
View author publications
You can also search for this author in PubMed Google Scholar
Cecylia Bocovich
View author publications
You can also search for this author in PubMed Google Scholar
Riley Capshaw
View author publications
You can also search for this author in PubMed Google Scholar
Brian P. Eddy
View author publications
You can also search for this author in PubMed Google Scholar
Letha H. Etzkorn
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas A. Kraft
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicholas A. Kraft.

Additional information

Editor: Andrian Marcus

Rights and permissions

Reprints and permissions

About this article

Cite this article

Biggers, L.R., Bocovich, C., Capshaw, R. et al. Configuring latent Dirichlet allocation based feature location. Empir Software Eng 19, 465–500 (2014). https://doi.org/10.1007/s10664-012-9224-x

Download citation

Published: 29 August 2012
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10664-012-9224-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Configuring latent Dirichlet allocation based feature location

Abstract

Access this article

Similar content being viewed by others

The Use of Artificial Intelligence in Writing Scientific Review Articles

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Unravelling the Impact of Generative Artificial Intelligence (GAI) in Industrial Applications: A Review of Scientific and Grey Literature

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Configuring latent Dirichlet allocation based feature location

Abstract

Access this article

Similar content being viewed by others

The Use of Artificial Intelligence in Writing Scientific Review Articles

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Unravelling the Impact of Generative Artificial Intelligence (GAI) in Industrial Applications: A Review of Scientific and Grey Literature

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation