skip to main content
10.1145/2396761.2396869acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Learning to rank duplicate bug reports

Published: 29 October 2012 Publication History

Abstract

For a large and complex software system, the project team could receive a large number of bug reports. Some bug reports could be duplicates as they essentially report the same problem. It is often tedious and costly to manually check if a newly reported bug is a duplicate of an already reported bug. In this paper, we propose BugSim, a method that can automatically retrieve duplicate bug reports given a new bug report. BugSim is based on learning to rank concepts. We identify textual and statistical features of bug reports and propose a similarity function for bug reports based on the features. We then construct a training set by assembling pairs of duplicate and non-duplicate bug reports. We train the weights of features by applying the stochastic gradient descent algorithm over the training set. For a new bug report, we retrieve candidate duplicate reports using the trained model. We evaluate BugSim using more than 45,100 real bug reports of twelve Eclipse projects. The evaluation results show that the proposed method is effective. On average, the recall rate for the top 10 retrieved reports is 76.11%. Furthermore, BugSim outperforms the previous state-of-art methods that are implemented using SVM and BM25Fext.

References

[1]
J. Anvik, L. Hiew, and G. C. Murphy. 2006. Who should fix this bug? In Proceedings of the 28th international conference on Software engineering (ICSE '06). Pages: 361--370.
[2]
S. Brin, J. Davis, and H. Garcia-Molina. 1995. Copy detection mechanisms for digital documents. In Proc. ACM SIGMOD Annual Conference. Pages: 398--409.
[3]
A. Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. 1997. Syntactic Clustering of the Web. In Proc. 6th International World Wide Web Conference. Pages: 393--404.
[4]
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML '05). Pages: 89--96.
[5]
Y. Cao, J. Xu, T-Y Liu, H. Li, Y. Huang, and H-W Hon. 2006. Adapting ranking SVM to document retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). Pages: 186--193.
[6]
Z. Cao, T. Qin, T-Y Liu, M-F Tsai, and H Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (ICML'07). Pages: 129--136.
[7]
M. S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the thirty-fourth annual ACM symposium on Theory of computing (STOC '02). Pages: 380--388.
[8]
Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. 2003. An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4 (December 2003). Pages: 933--969.
[9]
M. Henzinger. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). Pages: 284--291.
[10]
R Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers (2000). Volume: 88, Issue: 2, Publisher: MIT Press, Pages: 115--132.
[11]
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. Journal of the American Society for Information Science and Technology (2003). Volume 54, Issue 3, pages 203--215.
[12]
N. Jalbert and W. Weimer. 2008. Automated Duplicate Detection for Bug Tracking Systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN '08). Pages: 52--61.
[13]
T. Joachims. 2002. Optimizing search engines using click through data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). Pages: 133--142.
[14]
T-Y Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval. Vol. 3: No 3. Pages: 225--331.
[15]
T. M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997. Pages: 88--95.
[16]
R. Nallapati. 2004. Discriminative models for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '04). Pages: 64--71.
[17]
S. Robertson, H. Zaragoza, and M. Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management (CIKM '04). Pages: 42--49.
[18]
P. Runeson, M. Alexanderson, O. Nyholm. 2007. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29nd ACM/IEEE International Conference on Software Engineering (ICSE '07). Pages: 499--510.
[19]
S. Sood and D. Loguinov. 2011. Probabilistic near-duplicate detection using simhash. In Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM '11). Pages: 1117--1126.
[20]
C. Sun, D. Lo, X. Wang, J. Jiang, and S-C. Khoo. 2010. A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE '10). Pages: 45--54.
[21]
C. Sun, D. Lo, S-C. Khoo, and J. Jiang. 2011. Towards more accurate retrieval of duplicate bug reports. In Proc. Automated Software Engineering (ASE'11). Pages: 253--262.
[22]
A. Sureka and P. Jalote. 2010. Detecting duplicate bug report using character n-gram-based features. In Proceedings of the 2010 Asia Pacific Software Engineering Conference. Pages: 366--374.
[23]
M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. 2006. Optimisation methods for ranking functions with multiple parameters. In Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06). Pages: 585--593.
[24]
M. Theobald, J. Siddharth, and A. Paepcke. 2008. SpotSigs: robust and efficient near duplicate detection in large web collections. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '08). Pages: 563--570.
[25]
M-F. Tsai, T-Y. Liu, T. Qin, H-H. Chen, and W-Y. Ma. 2007. FRank: a ranking method with fidelity loss. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07). Pages: 383--390.
[26]
E.M. Voorhees. 1999. Question Answering Track Report. In Proceedings of the 8th Text Retrieval Conference. Pages: 77--82.
[27]
X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun. 2008. An approach to detecting duplicate bug reports using natural language and execution information. In Proceedings of the 30th international conference on Software engineering (ICSE '08). Pages: 461--470.
[28]
F. Xia, T-Y. Liu, J. Wang, W. Zhang, and H. Li. 2008. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning (ICML '08). Pages: 1192--1199.
[29]
J. Xu and H. Li. 2007. AdaRank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '07). Pages: 391--398.

Cited By

View all
  • (2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
  • (2023)Duplicate Bug Report Detection: How Far Are We?ACM Transactions on Software Engineering and Methodology10.1145/357604232:4(1-32)Online publication date: 27-May-2023
  • (2023)Mobile App Crowdsourced Test Report Consistency Detection via Deep Image-and-Text Fusion UnderstandingIEEE Transactions on Software Engineering10.1109/TSE.2023.3285787(1-20)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management
October 2012
2840 pages
ISBN:9781450311564
DOI:10.1145/2396761
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bug reports
  2. duplicate bug retrieval
  3. duplicate documents
  4. learning to rank
  5. software maintenance

Qualifiers

  • Research-article

Conference

CIKM'12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)14
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Dependency Aware Incident Linking in Large Cloud SystemsCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3648311(141-150)Online publication date: 13-May-2024
  • (2023)Duplicate Bug Report Detection: How Far Are We?ACM Transactions on Software Engineering and Methodology10.1145/357604232:4(1-32)Online publication date: 27-May-2023
  • (2023)Mobile App Crowdsourced Test Report Consistency Detection via Deep Image-and-Text Fusion UnderstandingIEEE Transactions on Software Engineering10.1109/TSE.2023.3285787(1-20)Online publication date: 2023
  • (2023)Enhancing Mobile App Bug Reporting via Real-Time Understanding of Reproduction StepsIEEE Transactions on Software Engineering10.1109/TSE.2022.317402849:3(1246-1272)Online publication date: 1-Mar-2023
  • (2023)Incident-aware Duplicate Ticket Aggregation for Cloud Systems2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00193(2299-2311)Online publication date: May-2023
  • (2023)Towards Effective Bug Reproduction for Mobile Applications2023 10th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA59317.2023.00024(114-125)Online publication date: 10-Aug-2023
  • (2023)DENATURE: duplicate detection and type identification in open source bug repositoriesInternational Journal of System Assurance Engineering and Management10.1007/s13198-023-01855-x14:S1(275-292)Online publication date: 19-Jan-2023
  • (2023)Ranking code clones to support maintenance activitiesEmpirical Software Engineering10.1007/s10664-023-10292-028:3Online publication date: 25-Apr-2023
  • (2022)Adopting Learning-to-rank Algorithm for Reviewer Recommendation Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering10.5555/3566055.3566059(22-31)Online publication date: 15-Nov-2022
  • (2022)Where is your app frustrating users?Proceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510189(2427-2439)Online publication date: 21-May-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media