Skip to main content
Log in

Preventing duplicate bug reports by continuously querying bug reports

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Bug deduplication or duplicate bug report detection is a hot topic in software engineering information retrieval research, but it is often not deployed. Typically to de-duplicate bug reports developers rely upon the search capabilities of the bug report software they employ, such as Bugzilla, Jira, or Github Issues. These search capabilities range from simple SQL string search to IR-based word indexing methods employed by search engines. Yet too often these searches do very little to stop the creation of duplicate bug reports. Some bug trackers have more than 10% of their bug reports marked as duplicate. Perhaps these bug tracker search engines are not enough? In this paper we propose a method of attempting to prevent duplicate bug reports before they start: continuously querying. That is as the bug reporter types in their bug report their text is used to query the bug database to find duplicate or related bug reports. This continuously querying bug reports allows the reporter to be alerted to duplicate bug reports as they report the bug, rather than formulating queries to find the duplicate bug report. Thus this work ushers in a new way of evaluating bug report deduplication techniques, as well as a new kind of bug deduplication task. We show that simple IR measures can address this problem but also that further research is needed to refine this novel process that is integrate-able into modern bug report systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. average mean average precision rolls off the tongue but perhaps triple mean precision sounds better.

  2. https://archive.org/details/2016-04-09ContinuousQueryData

  3. https://bitbucket.org/abram/continuous-query

  4. To install DüpeBuster visit https://bitbucket.org/abram/bugparty-docker and https://bitbucket.org/abram/bugparty/.

  5. Datasets https://archive.org/details/2016-04-09ContinuousQueryData. Code: https://bitbucket.org/abram/continuous-query

References

  • Aggarwal K, Timbers F, Rutgers T, Hindle A, Stroulia E, Greiner R (2017) Detecting duplicate bug reports with software engineering domain knowledge. Journal of Software: Evolution and Process 29:1–15. https://doi.org/10.1002/smr.1821 http://softwareprocess.ca/pubs/aggarwal2017JSEP.pdf E1821 smr.1821

    Article  Google Scholar 

  • Aggarwal K, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: 22nd international conference on software analysis, evolution and reengineering (SANER), 2015 IEEE, pp 211–220. IEEE

  • Alipour A (2013) A contextual approach towards more accurate duplicate bug report detection. Master’s thesis University of Alberta

  • Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: Proceedings of the Tenth International Workshop on Mining Software Repositories, pp 183–192. IEEE Press

  • Arasu A, Babu S, Widom J (2006) The cql continuous query language: semantic foundations and query execution. VLDB J 15(2):121–142

    Article  Google Scholar 

  • Asaduzzaman M, Roy CK, Schneider KA, Hou D (2014) Cscc: Simple, efficient, context sensitive code completion. In: 2014 IEEE International conference on software maintenance and evolution (ICSME), pp 71–80. IEEE

  • Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful really?. In: IEEE international conference on software maintenance, 2008. ICSM 2008, pp 337–345. IEEE

  • Campbell JC, Santos EA, Hindle A (2016) The unreasonable effectiveness of traditional information retrieval in crash report deduplication. In: International Working Conference on Mining Software Repositories (MSR 2016), pp 269–280. https://doi.org/10.1145/2901739.2901766. http://softwareprocess.ca/pubs/campbell2016MSR-partycrasher.pdf

  • Chandrasekaran S, Cooper O, Deshpande A, Franklin MJ, Hellerstein JM, Hong W, Krishnamurthy S, Madden SR, Reiss F, Shah MA (2003) Telegraphcq: Continuous dataflow processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD ’03. ACM, New York, pp 668–668. http://doi.acm.org/10.1145/872757.872857

  • Chandrasekaran S, Franklin MJ (2002) Streaming queries over streaming data. In: Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02. VLDB Endowment, pp 203–214. http://dl.acm.org/citation.cfm?id=1287369.1287388

  • Deshmukh JMAK, Podder S, Sengupta S, Dubash N (2017) Towards accurate duplicate bug retrieval using deep learning techniques. In: 2017 IEEE International conference on software maintenance and evolution (ICSME), pp 115–124. https://doi.org/10.1109/ICSME.2017.69

  • Google (2016) Google suggestion service https://goo.gl/4sFq8n

  • Haiduc S (2014) Supporting query formulation for text retrieval applications in software engineering. In: 30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014, pp 657–662. IEEE Computer Society. https://doi.org/10.1109/ICSME.2014.117

  • Harman M, Mansouri SA, Zhang Y (2012) Search-based software engineering: Trends, techniques and applications. ACM Comput Surv 45(1):11:1–11:61. https://doi.org/10.1145/2379776.2379787. http://doi.acm.org/10.1145/2379776.2379787

    Article  Google Scholar 

  • Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: IEEE International Conference on dependable systems and networks with FTCS and DCC, 2008. DSN 2008, pp 52–61. IEEE

  • Kao B, Garcia-Molina H (1994) An overview of real-time database systems. In: Real time computing, pp 261–282. Springer

  • Klein N, Corley CS, Kraft NA (2014) New features for duplicate bug detection. In: MSR, pp 324–327

  • Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: Proceedings of the 11th Working Conference on Mining Software Repositories, pp 308–311. ACM

  • Lukins SK, Kraft NA, Etzkorn LH (2008) Source code retrieval for bug localization using latent dirichlet allocation. In: Proceedings of the 2008 15th Working Conference on Reverse Engineering, WCRE ’08. IEEE Computer Society, Washington, pp 155–164. https://doi.org/10.1109/WCRE.2008.33

  • Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge. http://nlp.stanford.edu/fsnlp/

    MATH  Google Scholar 

  • Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp 70–79. ACM

  • Panichella A, Dit B, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2016) Parameterizing and assembling ir-based solutions for SE tasks using genetic algorithms. In: IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016, pp 314–325. IEEE Computer Society. https://doi.org/10.1109/SANER.2016.97

  • Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack overflow in the ide. In: Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13. IEEE Press, Piscataway, pp 1295–1298. http://dl.acm.org/citation.cfm?id=2486788.2486988

  • Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014. ACM, New York, pp 102–111. http://doi.acm.org/10.1145/2597073.2597077

  • Porter M (1980) An algorithm for suffix stripping. Program 14(3):130–137. https://doi.org/10.1108/eb046814 http://www.emeraldinsight.com/doi/abs/10.1108/eb046814

    Article  Google Scholar 

  • Rakha MS, Bezemer CP, Hassan AE (2017) Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports. IEEE Trans Softw Eng PP(99):1–1. https://doi.org/10.1109/TSE.2017.2755005

    Google Scholar 

  • Rakha MS, Bezemer CP, Hassan AE (2018) Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval Empirical Software Engineering

  • Rakha MS, Shang W, Hassan AE (2015) Studying the needed effort for identifying duplicate issues. Empirical Software Engineering pp 1–30. https://doi.org/10.1007/s10664-015-9404-6

  • Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, pp 45–50. http://is.muni.cz/publication/884893/en

  • Řehůřek R, Sojka P (2018) models.tfidfmodel — TF-IDF model. https://radimrehurek.com/gensim/models/tfidfmodel.html (retrieved, March 2018)

  • Robertson S, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M (1995) Okapi at trec–3. In: Overview of the Third Text REtrieval Conference (TREC–3), pp 109–126. Gaithersburg, MD: NIST. https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/

  • Rocha H, De Oliveira G, Marques-Neto H, Valente MT (2015) Nextbug: a bugzilla extension for recommending similar bugs. Journal of Software Engineering Research and Development 3(1):1–14

    Article  Google Scholar 

  • Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: 29th international conference on Software engineering, 2007. ICSE 2007, pp 499–510. IEEE

  • Sabor KK, Hamou-Lhadj A, Larsson A (2017) Durfex: a feature extraction technique for efficient detection of duplicate bug reports. In: 2017 IEEE international conference on software quality, reliability and security (QRS), pp 240–250. IEEE

  • Shah MA, Hellerstein JM, Chandrasekaran S, Franklin MJ (2003) Flux: an adaptive partitioning operator for continuous query systems. In: Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405), pp 25–36. https://doi.org/10.1109/ICDE.2003.1260779

  • Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pp 253–262. IEEE Computer Society

  • Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1, pp 45–54. ACM

  • Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: Software engineering conference (APSEC), 2010 17th asia pacific, pp 366–374. IEEE

  • Tange O (2011) Gnu parallel - the command-line power tool. ;login: The USENIX Magazine 36(1), pp 42–47. http://www.gnu.org/s/parallel

  • Thung F, Kochhar PS, Lo D (2014) Dupfinder: Integrated tool support for duplicate bug report detection. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, ASE ’14. http://doi.acm.org/10.1145/2642937.2648627. ACM, New York, pp 871–874

  • Wang S, Lo D, Lawall J (2014) Compositional vector space models for improved bug localization. In: 2014 IEEE international conference on software maintenance and evolution (ICSME), pp 171–180. IEEE

  • Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: Proceedings of the 30th international conference on Software engineering, pp 461–470. ACM

  • White RW, Marchionini G (2007) Examining the effectiveness of real-time query expansion. Inf Process Manag 43(3):685–704. https://doi.org/10.1016/j.ipm.2006.06.005. http://www.sciencedirect.com/science/article/pii/S0306457306000951. Special Issue on Heterogeneous and Distributed IR

    Article  Google Scholar 

  • Zhang Y, Lo D, Xia X, Sun JL (2015) Multi-factor duplicate question detection in stack overflow. J Comput Sci Technol 30(5):981–997. https://doi.org/10.1007/s11390-015-1576-4

    Article  Google Scholar 

Download references

Acknowledgments

This work was funded by an NSERC Discovery Grant, NSERC Engage Grant, and a MITACS Accelerate Cluster Grant in conjunction with Bioware ULC. We would also like to thank prior reviewers and Ahmed Hassan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abram Hindle.

Additional information

Communicated by: David Lo

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hindle, A., Onuczko, C. Preventing duplicate bug reports by continuously querying bug reports. Empir Software Eng 24, 902–936 (2019). https://doi.org/10.1007/s10664-018-9643-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9643-4

Keywords

Navigation