Abstract
The issue-tracking systems used by software projects contain issues, bugs, or tickets written by a wide variety of bug reporters, with different levels of training and knowledge about the system under development. Typically, reporters lack the skills and/or time to search the issue-tracking system for similar issues already reported. As a result, many reports end up referring to the same issue, which effectively makes the bug-report triaging process time consuming and error prone. Many researchers have approached the bug-deduplication problem using off-the-shelf information-retrieval (IR) tools. In this work, we extend the state of the art by investigating how contextual information about software-quality attributes, software-architecture terms, and system-development topics can be exploited to improve bug deduplication. We demonstrate the effectiveness of our contextual bug-deduplication method at ranking duplicates on the bug repositories of the Android, Eclipse, Mozilla, and OpenOffice software systems. Based on this experience, we conclude that taking into account domain-specific context can improve IR methods for bug deduplication.
Similar content being viewed by others
References
Karan A, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: Guéhéneuc Y-G, Adams B, Serebrenik A (eds) 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2015, Montreal, QC, Canada, March 2-6, 2015, pp 211–220. IEEE
Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the Tenth International Workshop on Mining Software Repositories, pp 183–192. IEEE Press
Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange, pp 35–39. ACM
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug?. In: Proceedings of the 28th international conference on Software engineering, pp 361–370. ACM
Ayewah N, Pugh W (2010) The google findbugs fixit. In: Proceedings of the 19th international symposium on Software testing and analysis, pp 241–252. ACM
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful really?. In: 2008 IEEE International Conference on Software Maintenance, ICSM 2008, pp 337–345 . IEEE
Brown A, Wilson G (2011) The Architecture Of Open Source Applications. lulu.com
Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. ACM, New York, pp 33–40
Android Community (2013) Android Technical Information. http://source.android.com/tech/security/
Ernst NA, Mylopoulos J (2010) On the perception of software quality requirements during the project lifecycle. In: Wieringa R, Persson A (eds) Requirements Engineering: Foundation for Software Quality, volume 6182 of Lecture Notes in Computer Science, pp 143–157. Springer, Berlin
Grosskurth A, Godfrey MW (2006) Architecture and evolution of the modern web browser. Preprint submitted to Elsevier Science
Guana V, Rocha F, Hindle A, Stroulia E (2012) Do the stars align? Multidimensional analysis of Android’s layered architecture. In: 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), pp 124–127. IEEE
Han D, Zhang C, Fan X, Hindle A, Wong K, Stroulia E (2012) Understanding android fragmentation with topic analysis of vendor-specific bugs. IEEE
Hangal S, Lam MS (2002) Tracking down software bugs using automatic anomaly detection. In: Proceedings of the 24th international conference on Software engineering, pp 291–301. ACM
Hiew L (2006) Assisted detection of duplicate bug reports. PhD thesis, The University Of British Columbia
Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: Proceedings of the 8th Working Conference on Mining Software Repositories, pp 163–172. ACM
Holmes G, Donkin A, Witten IH (1994) Weka: A machine learning workbench. In: Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems, 1994, pp 357–361. IEEE
Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC, DSN 2008, pp 52–61. IEEE
Kayed A, Hirzalla N, Samhan AA, Alfayoumi M (2009) Towards an ontology for software product quality attributes. In: ICIW’09 Fourth International Conference on Internet and Web Applications and Services, 2009, pp 200–204. IEEE
Langford J, Li L, Strehl A (2007) Vowpal wabbit online learning project
Sun Microsystems (2000) The openoffice.org source project: Technical overview. http://www.immagic.com/eLibrary/ARCHIVES/GENERAL/SUN/OPENOFCT.pdf
Monard MC, Batista GE (2002) Learning with skewed class distrihutions. Advances in Logic, Artificial Intelligence, and Robotics: LAPTEC 2002 85:173
Nagwani NK, Singh P (2009) Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs. In: Proceedings of the International Conference on Advances in Computing, Communication and Control, pp 202–207. ACM
Nakashima T, Oyama M, Hisada H, Ishii N (1999) Analysis of software bug causes and its prevention. Inf Softw Technol 41(15):1059–1068
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 70–79. ACM
Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora
Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M et al (1995) Okapi at trec-3. NIST SPECIAL PUBLICATION SP:109–109
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: 2007 29th International Conference on Software Engineering, ICSE 2007, pp 499–510. IEEE
Serrano N, Ciordia I (2005) Bugzilla, ITracker, and other bug trackers. IEEE Softw 22(2):11–13
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pp 253–262. IEEE Computer Society
Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pp 45–54. ACM
Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: 2010 17th Asia Pacific Software Engineering Conference (APSEC), pp 366–374. IEEE
Hulse JV, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on Machine learning, pp 935–942. ACM
Wallace BC, Dahabreh IJ (2012) Class probability estimates are unreliable for imbalanced data (and how to fix them). In: ICDM, pp 695–704
Wallace BC, Small K, Brodley CE, Trikalinos TA (2011) Class imbalance, redux. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp 754–763. IEEE
Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on Software engineering, pp 461–470. ACM
Zaragoza H, Craswell N, Taylor MJ, Saria S, Robertson SE (2004) Microsoft cambridge at trec 13: Web and hard tracks. In: TREC, vol, 4, pp 1–1. Citeseer
Acknowledgments
We would like to thank Sun et al. (2011) for sharing their Eclipse, OpenOffice, and Mozilla datasets with us. Abram Hindle and Eleni Stroulia were supported by NSERC Discovery Grants.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Massimiliano Di Penta and Sung Kim
Rights and permissions
About this article
Cite this article
Hindle, A., Alipour, A. & Stroulia, E. A contextual approach towards more accurate duplicate bug report detection and ranking. Empir Software Eng 21, 368–410 (2016). https://doi.org/10.1007/s10664-015-9387-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-015-9387-3