Integrating information retrieval, execution and link analysis algorithms to improve feature location in software

Dit, Bogdan; Revelle, Meghan; Poshyvanyk, Denys

doi:10.1007/s10664-011-9194-4

Integrating information retrieval, execution and link analysis algorithms to improve feature location in software

Published: 04 January 2012

Volume 18, pages 277–309, (2013)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Bogdan Dit¹,
Meghan Revelle¹ &
Denys Poshyvanyk¹

1229 Accesses
70 Citations
3 Altmetric
Explore all metrics

Abstract

Data fusion is the process of integrating multiple sources of information such that their combination yields better results than if the data sources are used individually. This paper applies the idea of data fusion to feature location, the process of identifying the source code that implements specific functionality in software. A data fusion model for feature location is presented which defines new feature location techniques based on combining information from textual, dynamic, and web mining or link analyses algorithms applied to software. A novel contribution of the proposed model is the use of advanced web mining algorithms to analyze execution information during feature location. The results of an extensive evaluation on three Java systems indicate that the new feature location techniques based on web mining improve the effectiveness of existing approaches by as much as 87%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Survey of Feature Location Techniques

Feature Location Through the Combination of Run-Time Architecture Models and Information Retrieval

Impact Analysis of Granularity Levels on Feature Location Technique

Notes

http://www.cs.wm.edu/semeru/data/emse-link-analysis/ (verified on 05/30/2011)
The HITS algorithm does not require edge weights to be normalized, so the execution frequency values are used without normalization.
http://www.eclipse.org/ (verified on 05/30/2011)
https://bugs.eclipse.org/ (verified on 05/30/2011)
http://www.mozilla.org/rhino/ (verified on 05/30/2011)
http://www.ecmascript.org/ (verified on 05/30/2011)
http://www.cs.columbia.edu/~eaddy/concerntagger/ (verified on 05/30/2011)
http://www.jedit.org/ (verified on 05/30/2011)
The online appendix has data on the performance of each technique compared to all others.
A concern is an area of interest or focus in a system. Features can be concerns, but not all concerns are features.

References

Antoniol G, Guéhéneuc YG (2006) Feature identification: an epidemiological metaphor. IEEE Trans Software Eng 32(9):627–641
Article Google Scholar
Biggerstaff TJ, Mitbander BG, Webster DE (1994) The concept assignment problem in program understanding. 15th IEEE/ACM International Conference on Software Engineering (ICSE’94) 482–498
Binkley D, Gold G, Harman M, Li Z, Mahdavi K (2008) An empirical study of the relationship between the concepts expressed in source code and dependence. J Syst Software 81:2287–2298
Article Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. 7th International Conference on World Wide Web, Brisbane, Australia, 107–117
Bruntink M, van Deursen A, Tourwe T, van Engelen R (2004) An evaluation of clone detection techniques for identifying crosscutting concerns. 20th IEEE International Conference on Software Maintenance (ICSM’04), Chicago, Illinois, IEEE Computer Society: Los Alamitos CA, 200–209
Bruntink M, van Deursen A, van Engelen R, Tourwe T (2005) On the use of clone detection for identifying crosscutting concern code. IEEE Trans Software Eng (TSE) 31(10):804–818
Article Google Scholar
Chen K, Rajlich V (2000) Case study of feature location using dependence graph. 8th IEEE International Workshop on Program Comprehension (IWPC’00), Limerick, Ireland, 241–249
Comon P (1994) Independent component analysis, a new concept? Signal Process 36(3):287–314
Article MATH Google Scholar
Conover WJ (1998) Practical nonparametric statistics, 3rd edn. Wiley
Cooley R, Mobasher B, Srivastava J (1997) Web mining: information and pattern discovery on the world wide web. 9th IEEE International Conference on Tools with Articial Intelligence (ICTAI’97), 558–567
Cornelissen B, Zaidman A, van Deursen A, Moonen L, Koschke R (2009) A systematic survey of program comprehension through dynamic analysis. IEEE Trans Software Eng (TSE) 35(5):684–702
Article Google Scholar
Cubranic D, Murphy GC (2003) Hipikat: recommending pertinent software development artifacts. 25th International Conference on Software Engineering (ICSE’03), Portland, OR, 408–418
Cubranic D, Murphy GC, Singer J, Booth KS (2004) Learning from project history: a case study for software development. 2004 ACM Conference on Computer Supported Cooperative Work (CSCW’04), Chicago, Illinois, USA, ACM, 82–91
Cubranic D, Murphy GC, Singer J, Booth KS (2005) Hipikat: a project memory for software development. IEEE Trans Software Eng 31(6):446–465
Article Google Scholar
de Alwis B, Murphy GC (2008) Answering conceptual queries with Ferret. 30th International Conference on Software Engineering (ICSE’08), Leipzig, Germany, 21–30
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Dit B, Revelle M, Gethers M, Poshyvanyk D (2011) Feature location in source code: a taxonomy and survey. J Software Mainten Evol: Res Pract (JSME). doi:10.1002/smr.567
Eaddy M, Aho AV, Antoniol G, Guéhéneuc YG (2008a) CERBERUS: tracing requirements to source code using information retrieval, dynamic analysis, and program analysis. 16th IEEE International Conference on Program Comprehension (ICPC’08), Amsterdam, The Netherlands, 53–62
Eaddy M, Zimmermann T, Sherwood K, Garg V, Murphy G, Nagappan N, Aho AV (2008b) Do crosscutting concerns cause defects? IEEE Trans Software Eng 34(4):497–515
Article Google Scholar
Eisenbarth T, Koschke R, Simon D (2003) Locating features in source code. IEEE Trans Software Eng 29(3):210–224
Article Google Scholar
Ganter B, Wille R (1996) Formal concept analysis. Springer, Berlin
MATH Google Scholar
Gay G, Haiduc S, Marcus M, Menzies T (2009) On the use of relevance feedback in IR-based concept location. 25th IEEE International Conference on Software Maintenance (ICSM’09), Edmonton, Canada, 351–360
Gold N, Bennett K (2002) Hypothesis-based concept assignment in software maintenance. IEE Proc Software 149(4):103–110
Article Google Scholar
Grant S, Cordy JR, Skillicorn DB (2008) Automated concept location using independent component analysis 15th Working Conference on Reverse Engineering (WCRE’08), Antwerp, Belgium, 138–142
Harman M, Gold N, Hierons R, Binkley D (2002) Code extraction algorithms which unify slicing and concept assignment. 9th Working Conference on Reverse Engineering (WCRE’02), Richmond, VA, 11–21
Henry S, Kafura D (1981) Software structure metrics based on information flow. IEEE Trans Software Eng (TSE) 7(5):510–518
Article Google Scholar
Hill E, Pollock L, Vijay-Shanker K (2007) Exploring the neighborhood with dora to expedite software maintenance. 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE’07), 14–23
Hill E, Pollock L, Vijay-Shanker K (2009) Automatically capturing source code context of NL-queries for software maintenance and reuse. 31st IEEE/ACM International Conference on Software Engineering (ICSE’09), Vancouver, British Columbia, Canada
Inoue K, Yokomori R, Yamamoto T, Matsushita M, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Software Eng (TSE) 31(3):213–225
Article Google Scholar
Jiang H, Nguyen T, Che IX, Jaygarl H, Chang C (2008) Incremental latent semantic indexing for effective, automatic traceability link evolution management. 23rd IEEE/ACM International Conference on Automated Software Engineering (ASE’08), L’Aquila, Italy
Kleinberg JM (1999) Authoritative sources in a hyperlinked environment. J ACM 46(5):604–632
Article MathSciNet MATH Google Scholar
Lawrance J, Bellamy R, Burnett M (2007) Scents in programs: does information foraging theory apply to program maintenance? IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’07), IEEE, 15–22
Li Z (2009) Identifying high-level dependence structures using slice-based dependence analysis. King’s College London, University of London. Ph.D
Liu D, Marcus A, Poshyvanyk D, Rajlich V (2007) Feature location via information retrieval based filtering of a single scenario execution trace. 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE’07), Atlanta, Georgia, 234–243
Lukins S, Kraft N, Etzkorn L (2008) Source code retrieval for bug location using latent dirichlet allocation. 15th Working Conference on Reverse Engineering (WCRE’08), Antwerp, Belgium, 155–164
Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. 11th IEEE Working Conference on Reverse Engineering (WCRE’04), Delft, The Netherlands, 214–223
Marin M, van Deursen A, Moonen L (2004) Identifying aspects using fan-in analysis. 11th IEEE Working Conference on Reverse Engineering (WCRE’04), Delft, The Netherlands, 132–141
Marin M, van Deursen A, Moonen L (2007) Identifying crosscutting concerns using fan-in analysis. ACM Trans Software Eng Meth (TOSEM) 17(1):1–34
Article Google Scholar
Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137
Article Google Scholar
Poshyvanyk D, Guéhéneuc YG, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. IEEE Trans Software Eng 33(6):420–432
Article Google Scholar
Revelle M, Poshyvanyk D (2009) An exploratory study on assessing feature location techniques. 17th IEEE International Conference on Program Comprehension (ICPC’09), Vancouver, British Columbia, Canada, 218–222
Robillard M (2005) Automatic generation of suggestions for program investigation. Joint European Software Engineering Conference and ACM SIGSOFT Symposium on the Foundations of Software Engineering, Lisbon, Portugal, 11–20
Robillard MP (2008) Topology analysis of software dependencies. ACM Trans Software Eng Meth 17(4):1–36
Article Google Scholar
Robillard MP, Dagenais B (2008) Retrieving task-related clusters from change history. 15th Working Conference on Reverse Engineering (WCRE’08), 17–26
Robillard MP, Dagenais B (2010) Recommending change clusters to support software investigation: an empirical study. J Software Mainten Evol Res Pract 22(3):143–164
Google Scholar
Robillard MP, Shepherd D, Hill E, Vijay-Shanker K, Pollock L (2007) An empirical study of the concept assignment problem. McGill University, Montreal
Google Scholar
Rohatgi A, Hamou-Lhadj A, Rilling J (2009) an approach for solving the feature location problem by measuring the component modification impact. IET Softw 3(4):292–311
Article Google Scholar
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill
Saul MZ, Filkov V, Devanbu P, Bird C (2007) Recommending random walks. 11th European Software Engineering Conference held jointly with 15th ACM SIGSOFT International Symposium on Foundations of Software Engineering (ESEC/FSE’07), Dubrovnik, Croatia, 15–24
Savage T, Revelle M, Poshyvanyk D (2010) FLAT^3: feature location and textual tracing tool. 32nd ACM/IEEE International Conference on Software Engineering (ICSE’10), Cape Town, South Africa, 255–258
Shepherd D, Gibson E, Pollock L (2004) Design and evaluation of an automated aspect mining tool. Mid-Atlantic Student Workshop on Programming Languages and Systems (MASPLAS ‘04)
Shepherd D, Palm J, Pollock L, Chu-Carroll M (2005) Timna: a framework for automatically combining aspect mining analyses. 20th IEEE/ACM international Conference on Automated Software Engineering (ASE’05), Long Beach, CA, USA, 184–193
Shepherd D, Pollock L, Vijay-Shanker K (2007) Case study: supplementing program analysis with natural language analysis to improve a reverse engineering task. 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE’07), San Diego, California, USA, ACM, 49–54
Sillito J, Murphy GC, De Volder K (2008) Asking and answering questions during a programming change task. IEEE Trans Software Eng (TSE) 34(4):434–451
Article Google Scholar
Starke J, Luce C, Sillito J (2009) Searching and skimming: an exploratory study. 25th IEEE International Conference on Software Maintenance (ICSM’09), Edmonton, Alberta, Canada
Wilde N, Scully M (1995) Software reconnaissance: mapping program features to code. J Software Mainten Res Pract 7:49–62
Article Google Scholar
Zaidman A, Demeyer S (2008) Automatic identification of key classes in a software system using webmining techniques. J Software Mainten Evol Res Pract 20(6):387–417
Article Google Scholar
Zaidman A, Du Bois B, Demeyer S (2006) How webmining and coupling metrics improve early program comprehension. 14th IEEE International Conference on Program Comprehension (ICPC’06), Athens, Greece, 74–78
Zhao W, Zhang L, Liu Y, Sun J, Yang F (2006) SNIAFL: towards a static non-interactive approach to feature location. ACM Trans Software Eng Meth (TOSEM) 15(2):195–226
Article Google Scholar

Download references

Acknowledgements

We are grateful to the anonymous EMSE and ICPC 2010 reviewers for their relevant and useful comments and suggestions, which helped us in significantly improving the earlier versions of this paper. This work is supported in part by NSF CCF-0916260 and NSF CCF-1016868 GRANTS. Any opinions, findings, and conclusions expressed herein are the authors’ and do not necessarily reflect those of the sponsors.

Author information

Authors and Affiliations

The College of William and Mary, Williamsburg, VA, USA
Bogdan Dit, Meghan Revelle & Denys Poshyvanyk

Authors

Bogdan Dit
View author publications
You can also search for this author in PubMed Google Scholar
Meghan Revelle
View author publications
You can also search for this author in PubMed Google Scholar
Denys Poshyvanyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Denys Poshyvanyk.

Additional information

Editors: Keith Brian Gallagher and Giulio Antoniol

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dit, B., Revelle, M. & Poshyvanyk, D. Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empir Software Eng 18, 277–309 (2013). https://doi.org/10.1007/s10664-011-9194-4

Download citation

Published: 04 January 2012
Issue Date: April 2013
DOI: https://doi.org/10.1007/s10664-011-9194-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating information retrieval, execution and link analysis algorithms to improve feature location in software

Abstract

Access this article

Similar content being viewed by others

A Survey of Feature Location Techniques

Feature Location Through the Combination of Run-Time Architecture Models and Information Retrieval

Impact Analysis of Granularity Levels on Feature Location Technique

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Integrating information retrieval, execution and link analysis algorithms to improve feature location in software

Abstract

Access this article

Similar content being viewed by others

A Survey of Feature Location Techniques

Feature Location Through the Combination of Run-Time Architecture Models and Information Retrieval

Impact Analysis of Granularity Levels on Feature Location Technique

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation