research-article

Learning to rank duplicate bug reports

Authors:

Hongyu ZhangAuthors Info & Claims

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Pages 852 - 861

https://doi.org/10.1145/2396761.2396869

Published: 29 October 2012 Publication History

Abstract

For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same problem. It is often tedious and costly to manually check if a newly reported bug is a duplicate of an already reported bug. In this paper, we propose BugSim, a method that can automatically retrieve duplicate bug reports given a new bug report. BugSim is based on learning to rank concepts. We identify textual and statistical features of bug reports and propose a similarity function for bug reports based on the features. We then construct a training set by assembling pairs of duplicate and non-duplicate bug reports. We train the weights of features by applying the stochastic gradient descent algorithm over the training set. For a new bug report, we retrieve candidate duplicate reports using the trained model. We evaluate BugSim using more than 45,100 real bug reports of twelve Eclipse projects. The evaluation results show that the proposed method is effective. On average, the recall rate for the top 10 retrieved reports is 76.11%. Furthermore, BugSim outperforms the previous state-of-art methods that are implemented using SVM and BM25F_ext.

References

[1]

J. Anvik, L. Hiew, and G. C. Murphy. 2006. Who should fix this bug? In Proceedings of the 28th international conference on Software engineering (ICSE '06). Pages: 361--370.

Digital Library

[2]

S. Brin, J. Davis, and H. Garcia-Molina. 1995. Copy detection mechanisms for digital documents. In Proc. ACM SIGMOD Annual Conference. Pages: 398--409.

Digital Library

[3]

A. Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. 1997. Syntactic Clustering of the Web. In Proc. 6th International World Wide Web Conference. Pages: 393--404.

Digital Library

[4]

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML '05). Pages: 89--96.

Digital Library

[5]

Y. Cao, J. Xu, T-Y Liu, H. Li, Y. Huang, and H-W Hon. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). Pages: 186--193.

Digital Library

[6]

Z. Cao, T. Qin, T-Y Liu, M-F Tsai, and H Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (ICML'07). Pages: 129--136.

Digital Library

[7]

M. S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thirty-fourth annual ACM symposium on Theory of computing (STOC '02). Pages: 380--388.

Digital Library

[8]

Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. 2003. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4 (December 2003). Pages: 933--969.

Digital Library

[9]

M. Henzinger. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). Pages: 284--291.

Digital Library

[10]

R Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers (2000). Volume: 88, Issue: 2, Publisher: MIT Press, Pages: 115--132.

[11]

T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology (2003). Volume 54, Issue 3, pages 203--215.

Digital Library

[12]

N. Jalbert and W. Weimer. 2008. Automated Duplicate Detection for Bug Tracking Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN '08). Pages: 52--61.

[13]

T. Joachims. 2002. Optimizing search engines using click through data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). Pages: 133--142.

Digital Library

[14]

T-Y Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval. Vol. 3: No 3. Pages: 225--331.

Digital Library

[15]

T. M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997. Pages: 88--95.

Digital Library

[16]

R. Nallapati. 2004. Discriminative models for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04). Pages: 64--71.

Digital Library

[17]

S. Robertson, H. Zaragoza, and M. Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management (CIKM '04). Pages: 42--49.

Digital Library

[18]

P. Runeson, M. Alexanderson, O. Nyholm. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29nd ACM/IEEE International Conference on Software Engineering (ICSE '07). Pages: 499--510.

Digital Library

[19]

S. Sood and D. Loguinov. 2011. Probabilistic near-duplicate detection using simhash. In Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM '11). Pages: 1117--1126.

Digital Library

[20]

C. Sun, D. Lo, X. Wang, J. Jiang, and S-C. Khoo. 2010. A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE '10). Pages: 45--54.

Digital Library

[21]

C. Sun, D. Lo, S-C. Khoo, and J. Jiang. 2011. Towards more accurate retrieval of duplicate bug reports. In Proc. Automated Software Engineering (ASE'11). Pages: 253--262.

Digital Library

[22]

A. Sureka and P. Jalote. 2010. Detecting duplicate bug report using character n-gram-based features. In Proceedings of the 2010 Asia Pacific Software Engineering Conference. Pages: 366--374.

Digital Library

[23]

M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. 2006. Optimisation methods for ranking functions with multiple parameters. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06). Pages: 585--593.

Digital Library

[24]

M. Theobald, J. Siddharth, and A. Paepcke. 2008. SpotSigs: robust and efficient near duplicate detection in large web collections. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '08). Pages: 563--570.

Digital Library

[25]

M-F. Tsai, T-Y. Liu, T. Qin, H-H. Chen, and W-Y. Ma. 2007. FRank: a ranking method with fidelity loss. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07). Pages: 383--390.

Digital Library

[26]

E.M. Voorhees. 1999. Question Answering Track Report. In Proceedings of the 8th Text Retrieval Conference. Pages: 77--82.

[27]

X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun. 2008. An approach to detecting duplicate bug reports using natural language and execution information. In Proceedings of the 30th international conference on Software engineering (ICSE '08). Pages: 461--470.

Digital Library

[28]

F. Xia, T-Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning (ICML '08). Pages: 1192--1199.

Digital Library

[29]

J. Xu and H. Li. 2007. AdaRank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07). Pages: 391--398.

Digital Library

Cited By

Ghosh SGrover KWong JBansal CNamineni RVerma MRajmohan SChua TNgo CKumar RLauw HKa-Wei Lee R(2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648311
Zhang THan DVinayakarao VIrsan IXu BThung FLo DJiang L(2023)Duplicate Bug Report Detection: How Far Are We?ACM Transactions on Software Engineering and Methodology10.1145/357604232:4(1-32)Online publication date: 27-May-2023
https://dl.acm.org/doi/10.1145/3576042
Yu SFang CZhang QCao ZYun YCao ZMei KChen Z(2023)Mobile App Crowdsourced Test Report Consistency Detection via Deep Image-and-Text Fusion UnderstandingIEEE Transactions on Software Engineering10.1109/TSE.2023.3285787(1-20)Online publication date: 2023
https://doi.org/10.1109/TSE.2023.3285787
Show More Cited By

Index Terms

Learning to rank duplicate bug reports
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Software reverse engineering

Recommendations

Learning to rank relevant files for bug reports using domain knowledge
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

When a new bug report is received, developers usually need to reproduce the bug and perform code reviews to find the cause, a process that can be tedious and time consuming. A tool for ranking all the source files of a project with respect to how ...
Towards Word Embeddings for Improved Duplicate Bug Report Retrieval in Software Repositories
ICTIR '18: Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

A key part of software maintenance is bug reporting and rectification. Bug reporting is a major issue and due to its asynchronous nature, duplicate bug reporting is common. Detecting duplicate bug reports is an important task in software maintenance in ...
Preventing duplicate bug reports by continuously querying bug reports

Bug deduplication or duplicate bug report detection is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

October 2012

2840 pages

ISBN:9781450311564

DOI:10.1145/2396761

General Chair:
Xuewen Chen
Wayne State University, USA
,
Program Chairs:
Guy Lebanon
Georgia Institute of Technology
,
Haixun Wang
Microsoft Research Asia
,
Mohammed J. Zaki
Rensselaer Polytechnic Institute

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'12

Sponsor:

CIKM'12: 21st ACM International Conference on Information and Knowledge Management

October 29 - November 2, 2012

Hawaii, Maui, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

35
Total Citations
View Citations
523
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ghosh SGrover KWong JBansal CNamineni RVerma MRajmohan SChua TNgo CKumar RLauw HKa-Wei Lee R(2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589335.3648311
Zhang THan DVinayakarao VIrsan IXu BThung FLo DJiang L(2023)Duplicate Bug Report Detection: How Far Are We?ACM Transactions on Software Engineering and Methodology10.1145/357604232:4(1-32)Online publication date: 27-May-2023
https://dl.acm.org/doi/10.1145/3576042
Yu SFang CZhang QCao ZYun YCao ZMei KChen Z(2023)Mobile App Crowdsourced Test Report Consistency Detection via Deep Image-and-Text Fusion UnderstandingIEEE Transactions on Software Engineering10.1109/TSE.2023.3285787(1-20)Online publication date: 2023
https://doi.org/10.1109/TSE.2023.3285787
Fazzini MMoran KBernal-Cardenas CWendland TOrso APoshyvanyk D(2023)Enhancing Mobile App Bug Reporting via Real-Time Understanding of Reproduction StepsIEEE Transactions on Software Engineering10.1109/TSE.2022.317402849:3(1246-1272)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TSE.2022.3174028
Liu JHe SChen ZLi LKang YZhang XHe PZhang HLin QXu ZRajmohan SZhang DLyu M(2023)Incident-aware Duplicate Ticket Aggregation for Cloud Systems2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00193(2299-2311)Online publication date: May-2023
https://doi.org/10.1109/ICSE48619.2023.00193
Li XYu SSun LLiu YFang C(2023)Towards Effective Bug Reproduction for Mobile Applications2023 10th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA59317.2023.00024(114-125)Online publication date: 10-Aug-2023
https://doi.org/10.1109/DSA59317.2023.00024
Chauhan RSharma SGoyal A(2023)DENATURE: duplicate detection and type identification in open source bug repositoriesInternational Journal of System Assurance Engineering and Management10.1007/s13198-023-01855-x14:S1(275-292)Online publication date: 19-Jan-2023
https://doi.org/10.1007/s13198-023-01855-x
Ehsan OKhomh FZou YQiu D(2023)Ranking code clones to support maintenance activitiesEmpirical Software Engineering10.1007/s10664-023-10292-028:3Online publication date: 25-Apr-2023
https://dl.acm.org/doi/10.1007/s10664-023-10292-0
Zhao GLiu JAlencar Da Costa DZou YShirani POnut INg TKent KBaşar AOnut I(2022)Adopting Learning-to-rank Algorithm for Reviewer Recommendation Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering10.5555/3566055.3566059(22-31)Online publication date: 15-Nov-2022
https://dl.acm.org/doi/10.5555/3566055.3566059
Wang YWang JZhang HMing XShi LWang QDwyer MDamian DZeller A(2022)Where is your app frustrating users?Proceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510189(2427-2439)Online publication date: 21-May-2022
https://dl.acm.org/doi/10.1145/3510003.3510189
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten