ABSTRACT
From the standpoint of retrieval from large software libraries for the purpose of bug localization, we compare five generic text models and certain composite variations thereof. The generic models are: the Unigram Model (UM), the Vector Space Model (VSM), the Latent Semantic Analysis Model (LSA), the Latent Dirichlet Allocation Model (LDA), and the Cluster Based Document Model (CBDM). The task is to locate the files that are relevant to a bug reported in the form of a textual description by a software developer. We use for our study iBUGS, a benchmarked bug localization dataset with 75 KLOC and a large number of bugs (291). A major conclusion of our comparative study is that simple text models such as UM and VSM are more effective at correctly retrieving the relevant files from a library as compared to the more sophisticated models such as LDA. The retrieval effectiveness for the various models was measured using the following two metrics: (1) Mean Average Precision; and (2) Rank-based metrics. Using the SCORE metric, we also compare the retrieval effectiveness of the models in our study with some other bug localization tools.
- J. Chang. R-lda. http://cran.r-project.org/web/packages/lda/.Google Scholar
- B. Cleary, C. Exton, J. Buckley, and M. English. An Empirical Analysis of Information Retrieval based Concept Location Techniques in Software Comprehension. Empirical Softw. Engg., 14(1):93--130, 2009. Google ScholarDigital Library
- V. Dallmeier, C. Lindig, and A. Zeller. Lightweight Bug Localization with AMPLE. In Proceedings of the sixth international symposium on Automated analysis-driven debugging, AADEBUG'05, pages 99--104, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- V. Dallmeier and T. Zimmermann. Automatic Extraction of Bug Localization Benchmarks from History. Technical report, Universiät des Saarlandes, Saarbrücken, Germany, June 2007.Google Scholar
- V. Dallmeier and T. Zimmermann. Extraction of Bug Localization Benchmarks from History. In ASE '07: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 433--436, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- P. T. Devanbu, R. J. Brachman, P. G. Selfridge, and B. W. Ballard. Lassie--A Knowledge-Based Software Information System. In ICSE '90: Proceedings of the 12th international conference on Software engineering, pages 249--261, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. Google ScholarDigital Library
- D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as Deviant Behavior: A General Approach to Inferring Errors in Systems Code. In Proceedings of the eighteenth ACM symposium on Operating systems principles, SOSP '01, pages 57--72, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- E. Enslen, E. Hill, L. Pollock, and K. Vijay-Shanker. Mining Source Code to Automatically Split Identifiers for Software Analysis. In Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, MSR '09, pages 71--80, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- W. B. Frakes and B. A. Nejmeh. Software Reuse through Information Retrieval. SIGIR Forum, 21(1--2):30--36, 1987. Google ScholarDigital Library
- D. B. H. Field and D. Lawrie. An Empirical Comparison of Techniques for Extracting Concept Abbreviations from Identifiers. In Proceedings of IASTED International Conference on Software Engineering and Applications, 2006.Google Scholar
- D. Hovemeyer and W. Pugh. Finding Bugs is Easy. SIGPLAN Not., 39:92--106, December 2004. Google ScholarDigital Library
- J. A. Jones and M. J. Harrold. Empirical Evaluation of the Tarantula Automatic Fault-Localization Technique. In Automated Software Engineering, 2005. Google ScholarDigital Library
- A. Kuhn, S. Ducasse, and T. Girba. Semantic Clustering: Identifying Topics in Source Code. Source Information and Software Technology archive, 49:230--243, 2007. Google ScholarDigital Library
- C. Liu, X. Yan, L. Fei, J. Han, and S. P. Midkiff. SOBER: Statistical Model-Based Bug Localization. SIGSOFT Softw. Eng. Notes, 30:286--295, September 2005. Google ScholarDigital Library
- H. Liu and T. C. Lethbridge. Intelligent Search Techniques for Large Software Systems. In CASCON '01: Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research, page 10. IBM Press, 2001. Google ScholarDigital Library
- X. Liu and W. B. Croft. Cluster-Based Retrieval using Language Models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '04, pages 186--193, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- S. K. Lukins, N. A. Karft, and E. H. Letha. Source Code Retrieval for Bug Localization using Latent Dirichlet Allocation. In 15th Working Conference on Reverse Engineering, 2008. Google ScholarDigital Library
- Y. S. Maarek, D. M. Berry, and G. E. Kaiser. An Information Retrieval Approach for Automatically Constructing Software Libraries. IEEE Trans. Softw. Eng., 17(8):800--813, 1991. Google ScholarDigital Library
- A. Marcus and J. I. Maletic. Recovering Documentation-to-Source-Code Traceability Links using Latent Semantic Indexing. In ICSE '03: Proceedings of the 25th International Conference on Software Engineering, pages 125--135, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. An Information Retrieval Approach to Concept Location in Source code. In In Proceedings of the 11th Working Conference on Reverse Engineering (WCRE 2004, pages 214--223. IEEE Computer Society, 2004. Google ScholarDigital Library
- G. Mishne and M. D. Rijke. Source Code Retrieval using Conceptual Similarity. In Proc. 2004 Conf. Computer Assisted Information Retrieval (RIAO aAZ04, pages 539--554, 2004.Google Scholar
- M. Renieres and S. Reiss. Fault Localization with Nearest Neighbor Queries. In Proceedings. 18th IEEE International Conference on Automated Software Engineering, ASE'03, pages 30--39, 2003.Google ScholarCross Ref
- I. Ruthven and M. Lalmas. A Survey on the Use of Relevance Feedback for Information Access Systems. The Knowledge Engineering Review, 2003. Google ScholarDigital Library
Index Terms
- Retrieval from software libraries for bug localization: a comparative study of generic and composite text models
Recommendations
Information retrieval and spectrum based bug localization: better together
ESEC/FSE 2015: Proceedings of the 2015 10th Joint Meeting on Foundations of Software EngineeringDebugging often takes much effort and resources. To help developers debug, numerous information retrieval (IR)-based and spectrum-based bug localization techniques have been proposed. IR-based techniques process textual information in bug reports, ...
Bug localization using latent Dirichlet allocation
Context: Some recent static techniques for automatic bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI). Latent Dirichlet allocation (LDA) is a generative statistical model that has ...
Comparing Incremental Latent Semantic Analysis Algorithms for Efficient Retrieval from Software Libraries for Bug Localization
The problem of bug localization is to identify the source files related to a bug in a software repository. Information Retrieval (IR) based approaches create an index of the source files and learn a model which is then queried with a bug for the ...
Comments