research-article

Retrieval from software libraries for bug localization: a comparative study of generic and composite text models

Authors:
Shivani Rao

Purdue University, West Lafayette, USA

Purdue University, West Lafayette, USA
View Profile

,
Avinash Kak

Purdue University, West Lafayette, USA

Purdue University, West Lafayette, USA
View Profile

MSR '11: Proceedings of the 8th Working Conference on Mining Software RepositoriesMay 2011Pages 43–52https://doi.org/10.1145/1985441.1985451

Published:21 May 2011Publication History

MSR '11: Proceedings of the 8th Working Conference on Mining Software Repositories

Pages 43–52

ABSTRACT

From the standpoint of retrieval from large software libraries for the purpose of bug localization, we compare five generic text models and certain composite variations thereof. The generic models are: the Unigram Model (UM), the Vector Space Model (VSM), the Latent Semantic Analysis Model (LSA), the Latent Dirichlet Allocation Model (LDA), and the Cluster Based Document Model (CBDM). The task is to locate the files that are relevant to a bug reported in the form of a textual description by a software developer. We use for our study iBUGS, a benchmarked bug localization dataset with 75 KLOC and a large number of bugs (291). A major conclusion of our comparative study is that simple text models such as UM and VSM are more effective at correctly retrieving the relevant files from a library as compared to the more sophisticated models such as LDA. The retrieval effectiveness for the various models was measured using the following two metrics: (1) Mean Average Precision; and (2) Rank-based metrics. Using the SCORE metric, we also compare the retrieval effectiveness of the models in our study with some other bug localization tools.

References

J. Chang. R-lda. http://cran.r-project.org/web/packages/lda/.Google Scholar
B. Cleary, C. Exton, J. Buckley, and M. English. An Empirical Analysis of Information Retrieval based Concept Location Techniques in Software Comprehension. Empirical Softw. Engg., 14(1):93--130, 2009. Google ScholarDigital Library
V. Dallmeier, C. Lindig, and A. Zeller. Lightweight Bug Localization with AMPLE. In Proceedings of the sixth international symposium on Automated analysis-driven debugging, AADEBUG'05, pages 99--104, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
V. Dallmeier and T. Zimmermann. Automatic Extraction of Bug Localization Benchmarks from History. Technical report, Universiät des Saarlandes, Saarbrücken, Germany, June 2007.Google Scholar
V. Dallmeier and T. Zimmermann. Extraction of Bug Localization Benchmarks from History. In ASE '07: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 433--436, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
P. T. Devanbu, R. J. Brachman, P. G. Selfridge, and B. W. Ballard. Lassie--A Knowledge-Based Software Information System. In ICSE '90: Proceedings of the 12th international conference on Software engineering, pages 249--261, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. Google ScholarDigital Library
D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proceedings of the eighteenth ACM symposium on Operating systems principles, SOSP '01, pages 57--72, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker. Mining Source Code to Automatically Split Identifiers for Software Analysis. In Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, MSR '09, pages 71--80, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
W. B. Frakes and B. A. Nejmeh. Software Reuse through Information Retrieval. SIGIR Forum, 21(1--2):30--36, 1987. Google ScholarDigital Library
D. B. H. Field and D. Lawrie. An Empirical Comparison of Techniques for Extracting Concept Abbreviations from Identifiers. In Proceedings of IASTED International Conference on Software Engineering and Applications, 2006.Google Scholar
D. Hovemeyer and W. Pugh. Finding Bugs is Easy. SIGPLAN Not., 39:92--106, December 2004. Google ScholarDigital Library
J. A. Jones and M. J. Harrold. Empirical Evaluation of the Tarantula Automatic Fault-Localization Technique. In Automated Software Engineering, 2005. Google ScholarDigital Library
A. Kuhn, S. Ducasse, and T. Girba. Semantic Clustering: Identifying Topics in Source Code. Source Information and Software Technology archive, 49:230--243, 2007. Google ScholarDigital Library
C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiff. SOBER: Statistical Model-Based Bug Localization. SIGSOFT Softw. Eng. Notes, 30:286--295, September 2005. Google ScholarDigital Library
H. Liu and T. C. Lethbridge. Intelligent Search Techniques for Large Software Systems. In CASCON '01: Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research, page 10. IBM Press, 2001. Google ScholarDigital Library
X. Liu and W. B. Croft. Cluster-Based Retrieval using Language Models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '04, pages 186--193, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
S. K. Lukins, N. A. Karft, and E. H. Letha. Source Code Retrieval for Bug Localization using Latent Dirichlet Allocation. In 15th Working Conference on Reverse Engineering, 2008. Google ScholarDigital Library
Y. S. Maarek, D. M. Berry, and G. E. Kaiser. An Information Retrieval Approach for Automatically Constructing Software Libraries. IEEE Trans. Softw. Eng., 17(8):800--813, 1991. Google ScholarDigital Library
A. Marcus and J. I. Maletic. Recovering Documentation-to-Source-Code Traceability Links using Latent Semantic Indexing. In ICSE '03: Proceedings of the 25th International Conference on Software Engineering, pages 125--135, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An Information Retrieval Approach to Concept Location in Source code. In In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004, pages 214--223. IEEE Computer Society, 2004. Google ScholarDigital Library
G. Mishne and M. D. Rijke. Source Code Retrieval using Conceptual Similarity. In Proc. 2004 Conf. Computer Assisted Information Retrieval (RIAO aAZ04, pages 539--554, 2004.Google Scholar
M. Renieres and S. Reiss. Fault Localization with Nearest Neighbor Queries. In Proceedings. 18th IEEE International Conference on Automated Software Engineering, ASE'03, pages 30--39, 2003.Google ScholarCross Ref
I. Ruthven and M. Lalmas. A Survey on the Use of Relevance Feedback for Information Access Systems. The Knowledge Engineering Review, 2003. Google ScholarDigital Library

Index Terms

Retrieval from software libraries for bug localization: a comparative study of generic and composite text models
1. Information systems
  1. Information retrieval
    1. Document representation
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Reusability
  2. Software notations and tools
    1. Software libraries and repositories

Recommendations

Information retrieval and spectrum based bug localization: better together
ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering

Debugging often takes much effort and resources. To help developers debug, numerous information retrieval (IR)-based and spectrum-based bug localization techniques have been proposed. IR-based techniques process textual information in bug reports, ...
Read More
Bug localization using latent Dirichlet allocation

Context: Some recent static techniques for automatic bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI). Latent Dirichlet allocation (LDA) is a generative statistical model that has ...
Read More
Comparing Incremental Latent Semantic Analysis Algorithms for Efficient Retrieval from Software Libraries for Bug Localization

The problem of bug localization is to identify the source files related to a bug in a software repository. Information Retrieval (IR) based approaches create an index of the source files and learn a model which is then queried with a bug for the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MSR '11: Proceedings of the 8th Working Conference on Mining Software Repositories
May 2011
260 pages
ISBN:9781450305747
DOI:10.1145/1985441
General Chair:
Arie van Deursen
Delft University of Technology, The Netherlands
,
Program Chairs:
Tao Xie
North Carolina State University, USA
,
Thomas Zimmermann
Microsoft Research, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 May 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bug localization
information retrieval
latent dirichlet allocation
latent semantic analysis
software engineering
Qualifiers
- research-article
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 197
  Total Citations
  View Citations
- 911
  Total Downloads
- Downloads (Last 12 months)50
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Retrieval from software libraries for bug localization: a comparative study of generic and composite text models

MSR '11: Proceedings of the 8th Working Conference on Mining Software Repositories

ABSTRACT

References

Cited By

Index Terms

Recommendations

Information retrieval and spectrum based bug localization: better together

Bug localization using latent Dirichlet allocation

Comparing Incremental Latent Semantic Analysis Algorithms for Efficient Retrieval from Software Libraries for Bug Localization