Preventing duplicate bug reports by continuously querying bug reports

Hindle, Abram; Onuczko, Curtis

doi:10.1007/s10664-018-9643-4

Preventing duplicate bug reports by continuously querying bug reports

Published: 20 August 2018

Volume 24, pages 902–936, (2019)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

979 Accesses
29 Citations
4 Altmetric
Explore all metrics

Abstract

Bug deduplication or duplicate bug report detection is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report software they employ, such as Bugzilla, Jira, or Github Issues. These search capabilities range from simple SQL string search to IR-based word indexing methods employed by search engines. Yet too often these searches do very little to stop the creation of duplicate bug reports. Some bug trackers have more than 10% of their bug reports marked as duplicate. Perhaps these bug tracker search engines are not enough? In this paper we propose a method of attempting to prevent duplicate bug reports before they start: continuously querying. That is as the bug reporter types in their bug report their text is used to query the bug database to find duplicate or related bug reports. This continuously querying bug reports allows the reporter to be alerted to duplicate bug reports as they report the bug, rather than formulating queries to find the duplicate bug report. Thus this work ushers in a new way of evaluating bug report deduplication techniques, as well as a new kind of bug deduplication task. We show that simple IR measures can address this problem but also that further research is needed to refine this novel process that is integrate-able into modern bug report systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A contextual approach towards more accurate duplicate bug report detection and ranking

Article 28 June 2015

Locating bugs without looking back

Article Open access 10 October 2017

An Analytics-Driven Approach to Identify Duplicate Bug Records in Large Data Repositories

Notes

average mean average precision rolls off the tongue but perhaps triple mean precision sounds better.
https://archive.org/details/2016-04-09ContinuousQueryData
https://bitbucket.org/abram/continuous-query
To install DüpeBuster visit https://bitbucket.org/abram/bugparty-docker and https://bitbucket.org/abram/bugparty/.
Datasets https://archive.org/details/2016-04-09ContinuousQueryData. Code: https://bitbucket.org/abram/continuous-query

References

Aggarwal K, Timbers F, Rutgers T, Hindle A, Stroulia E, Greiner R (2017) Detecting duplicate bug reports with software engineering domain knowledge. Journal of Software: Evolution and Process 29:1–15. https://doi.org/10.1002/smr.1821 http://softwareprocess.ca/pubs/aggarwal2017JSEP.pdf E1821 smr.1821
Article Google Scholar
Aggarwal K, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: 22nd international conference on software analysis, evolution and reengineering (SANER), 2015 IEEE, pp 211–220. IEEE
Alipour A (2013) A contextual approach towards more accurate duplicate bug report detection. Master’s thesis University of Alberta
Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the Tenth International Workshop on Mining Software Repositories, pp 183–192. IEEE Press
Arasu A, Babu S, Widom J (2006) The cql continuous query language: semantic foundations and query execution. VLDB J 15(2):121–142
Article Google Scholar
Asaduzzaman M, Roy CK, Schneider KA, Hou D (2014) Cscc: Simple, efficient, context sensitive code completion. In: 2014 IEEE International conference on software maintenance and evolution (ICSME), pp 71–80. IEEE
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful really?. In: IEEE international conference on software maintenance, 2008. ICSM 2008, pp 337–345. IEEE
Campbell JC, Santos EA, Hindle A (2016) The unreasonable effectiveness of traditional information retrieval in crash report deduplication. In: International Working Conference on Mining Software Repositories (MSR 2016), pp 269–280. https://doi.org/10.1145/2901739.2901766. http://softwareprocess.ca/pubs/campbell2016MSR-partycrasher.pdf
Chandrasekaran S, Cooper O, Deshpande A, Franklin MJ, Hellerstein JM, Hong W, Krishnamurthy S, Madden SR, Reiss F, Shah MA (2003) Telegraphcq: Continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03. ACM, New York, pp 668–668. http://doi.acm.org/10.1145/872757.872857
Chandrasekaran S, Franklin MJ (2002) Streaming queries over streaming data. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02. VLDB Endowment, pp 203–214. http://dl.acm.org/citation.cfm?id=1287369.1287388
Deshmukh JMAK, Podder S, Sengupta S, Dubash N (2017) Towards accurate duplicate bug retrieval using deep learning techniques. In: 2017 IEEE International conference on software maintenance and evolution (ICSME), pp 115–124. https://doi.org/10.1109/ICSME.2017.69
Google (2016) Google suggestion service https://goo.gl/4sFq8n
Haiduc S (2014) Supporting query formulation for text retrieval applications in software engineering. In: 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014, pp 657–662. IEEE Computer Society. https://doi.org/10.1109/ICSME.2014.117
Harman M, Mansouri SA, Zhang Y (2012) Search-based software engineering: Trends, techniques and applications. ACM Comput Surv 45(1):11:1–11:61. https://doi.org/10.1145/2379776.2379787. http://doi.acm.org/10.1145/2379776.2379787
Article Google Scholar
Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: IEEE International Conference on dependable systems and networks with FTCS and DCC, 2008. DSN 2008, pp 52–61. IEEE
Kao B, Garcia-Molina H (1994) An overview of real-time database systems. In: Real time computing, pp 261–282. Springer
Klein N, Corley CS, Kraft NA (2014) New features for duplicate bug detection. In: MSR, pp 324–327
Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp 308–311. ACM
Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: Proceedings of the 2008 15th Working Conference on Reverse Engineering, WCRE ’08. IEEE Computer Society, Washington, pp 155–164. https://doi.org/10.1109/WCRE.2008.33
Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge. http://nlp.stanford.edu/fsnlp/
MATH Google Scholar
Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 70–79. ACM
Panichella A, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2016) Parameterizing and assembling ir-based solutions for SE tasks using genetic algorithms. In: IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016, pp 314–325. IEEE Computer Society. https://doi.org/10.1109/SANER.2016.97
Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack overflow in the ide. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13. IEEE Press, Piscataway, pp 1295–1298. http://dl.acm.org/citation.cfm?id=2486788.2486988
Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014. ACM, New York, pp 102–111. http://doi.acm.org/10.1145/2597073.2597077
Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137. https://doi.org/10.1108/eb046814 http://www.emeraldinsight.com/doi/abs/10.1108/eb046814
Article Google Scholar
Rakha MS, Bezemer CP, Hassan AE (2017) Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports. IEEE Trans Softw Eng PP(99):1–1. https://doi.org/10.1109/TSE.2017.2755005
Google Scholar
Rakha MS, Bezemer CP, Hassan AE (2018) Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval Empirical Software Engineering
Rakha MS, Shang W, Hassan AE (2015) Studying the needed effort for identifying duplicate issues. Empirical Software Engineering pp 1–30. https://doi.org/10.1007/s10664-015-9404-6
Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, pp 45–50. http://is.muni.cz/publication/884893/en
Řehůřek R, Sojka P (2018) models.tfidfmodel — TF-IDF model. https://radimrehurek.com/gensim/models/tfidfmodel.html (retrieved, March 2018)
Robertson S, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M (1995) Okapi at trec–3. In: Overview of the Third Text REtrieval Conference (TREC–3), pp 109–126. Gaithersburg, MD: NIST. https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/
Rocha H, De Oliveira G, Marques-Neto H, Valente MT (2015) Nextbug: a bugzilla extension for recommending similar bugs. Journal of Software Engineering Research and Development 3(1):1–14
Article Google Scholar
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: 29th international conference on Software engineering, 2007. ICSE 2007, pp 499–510. IEEE
Sabor KK, Hamou-Lhadj A, Larsson A (2017) Durfex: a feature extraction technique for efficient detection of duplicate bug reports. In: 2017 IEEE international conference on software quality, reliability and security (QRS), pp 240–250. IEEE
Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2003) Flux: an adaptive partitioning operator for continuous query systems. In: Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405), pp 25–36. https://doi.org/10.1109/ICDE.2003.1260779
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pp 253–262. IEEE Computer Society
Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pp 45–54. ACM
Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: Software engineering conference (APSEC), 2010 17th asia pacific, pp 366–374. IEEE
Tange O (2011) Gnu parallel - the command-line power tool. ;login: The USENIX Magazine 36(1), pp 42–47. http://www.gnu.org/s/parallel
Thung F, Kochhar PS, Lo D (2014) Dupfinder: Integrated tool support for duplicate bug report detection. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14. http://doi.acm.org/10.1145/2642937.2648627. ACM, New York, pp 871–874
Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization. In: 2014 IEEE international conference on software maintenance and evolution (ICSME), pp 171–180. IEEE
Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on Software engineering, pp 461–470. ACM
White RW, Marchionini G (2007) Examining the effectiveness of real-time query expansion. Inf Process Manag 43(3):685–704. https://doi.org/10.1016/j.ipm.2006.06.005. http://www.sciencedirect.com/science/article/pii/S0306457306000951. Special Issue on Heterogeneous and Distributed IR
Article Google Scholar
Zhang Y, Lo D, Xia X, Sun JL (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997. https://doi.org/10.1007/s11390-015-1576-4
Article Google Scholar

Download references

Acknowledgments

This work was funded by an NSERC Discovery Grant, NSERC Engage Grant, and a MITACS Accelerate Cluster Grant in conjunction with Bioware ULC. We would also like to thank prior reviewers and Ahmed Hassan.

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada
Abram Hindle
BioWare ULC, Edmonton, Alberta, Canada
Curtis Onuczko

Authors

Abram Hindle
View author publications
You can also search for this author in PubMed Google Scholar
Curtis Onuczko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abram Hindle.

Additional information

Communicated by: David Lo

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hindle, A., Onuczko, C. Preventing duplicate bug reports by continuously querying bug reports. Empir Software Eng 24, 902–936 (2019). https://doi.org/10.1007/s10664-018-9643-4

Download citation

Published: 20 August 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s10664-018-9643-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Preventing duplicate bug reports by continuously querying bug reports

Abstract

Access this article

Similar content being viewed by others

A contextual approach towards more accurate duplicate bug report detection and ranking

Locating bugs without looking back

An Analytics-Driven Approach to Identify Duplicate Bug Records in Large Data Repositories

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Preventing duplicate bug reports by continuously querying bug reports

Abstract

Access this article

Similar content being viewed by others

A contextual approach towards more accurate duplicate bug report detection and ranking

Locating bugs without looking back

An Analytics-Driven Approach to Identify Duplicate Bug Records in Large Data Repositories

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation