Studying the needed effort for identifying duplicate issues

Rakha, Mohamed Sami; Shang, Weiyi; Hassan, Ahmed E.

doi:10.1007/s10664-015-9404-6

Studying the needed effort for identifying duplicate issues

Published: 04 November 2015

Volume 21, pages 1960–1989, (2016)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Mohamed Sami Rakha¹,
Weiyi Shang¹ &
Ahmed E. Hassan¹

827 Accesses
24 Citations
1 Altmetric
Explore all metrics

Abstract

Many recent software engineering papers have examined duplicate issue reports. Thus far, duplicate reports have been considered a hindrance to developers and a drain on their resources. As a result, prior research in this area focuses on proposing automated approaches to accurately identify duplicate reports. However, there exists no studies that attempt to quantify the actual effort that is spent on identifying duplicate issue reports. In this paper, we empirically examine the effort that is needed for manually identifying duplicate reports in four open source projects, i.e., Firefox, SeaMonkey, Bugzilla and Eclipse-Platform. Our results show that: (i) More than 50 % of the duplicate reports are identified within half a day. Most of the duplicate reports are identified without any discussion and with the involvement of very few people; (ii) A classification model built using a set of factors that are extracted from duplicate issue reports classifies duplicates according to the effort that is needed to identify them with a precision of 0.60 to 0.77, a recall of 0.23 to 0.96, and an ROC area of 0.68 to 0.80; and (iii) Factors that capture the developer awareness of the duplicate issue’s peers (i.e., other duplicates of that issue) and textual similarity of a new report to prior reports are the most influential factors in our models. Our findings highlight the need for effort-aware evaluation of approaches that identify duplicate issue reports, since the identification of a considerable amount of duplicate reports (over 50 %) appear to be a relatively trivial task for developers. To better assist developers, research on identifying duplicate issue reports should put greater emphasis on assisting developers in identifying effort-consuming duplicate issues.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval

Article 23 January 2018

Preventing duplicate bug reports by continuously querying bug reports

Article 20 August 2018

On the unreliability of bug severity data

Article 27 October 2015

Notes

https://bugzilla.mozilla.org/show_bug.cgi?id=312782
https://bugzilla.mozilla.org/show_bug.cgi?id=65305
Issue triaging is the task of determining if an issue report describes a meaningful new problem or enhancement, so it can be assigned to an appropriate developer for further handling (Anvik et al. 2006).
Replication package: http://sailhome.cs.queensu.ca/replication/EMSE2015_DuplicateReports/
Release notes for Bugzilla 4.0: https://www.bugzilla.org/releases/4.0/release-notes.html

References

Aggarwal K, Rutgers T, Timbers F, Hindle A, Greiner R, Stroulia E (2015) Detecting duplicate bug reports with software engineering domain knowledge. In: SANER 2015: International conference on software analysis, evolution and reengineering. IEEE, pp 211–220
Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: MSR 2013: Proceedings of the 10th working conference on mining software repositories, pp 183–192
Angrist JD, Pischke JS (2008) Mostly harmless econometrics: An empiricist’s companion. Princeton university press, Princeton
MATH Google Scholar
Anvik J, Hiew L, Murphy GC (2005) Coping with an open bug repository. In: Eclipse 2005: Proceedings of the 2005 OOPSLA Workshop on Eclipse Technology eXchange. ACM, pp 35–39
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug?. In: ICSE 2006: Proceedings of the 28th international conference on software engineering. ACM, pp 361–370
Bertram D, Voida A, Greenberg S, Walker R (2010) Communication, collaboration, and bugs: The social nature of issue tracking in small, collocated teams. In: CSCW 2010: Proceedings of the ACM conference on computer supported cooperative work. ACM, pp 291–300
Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2007) Quality of bug reports in eclipse. In: Eclipse 2007: Proceedings of the 2007 OOPSLA Workshop on Eclipse Technology eXchange. ACM, New York, pp 21–25
Chapter Google Scholar
Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008a) What makes a good bug report?. In: SIGSOFT ’08/FSE-16: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering. ACM, New York, pp 308–318
Chapter Google Scholar
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008b) Duplicate bug reports considered harmful really?. In: ICSM 2008: Proceedings of the IEEE international conference on software maintenance. IEEE, pp 337–345
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MathSciNet MATH Google Scholar
Cavalcanti YC, Da Mota Silveira Neto PA, de Almeida ES, Lucrédio D, da Cunha CEA, de Lemos Meira SR (2010) One step more to understand the bug report duplication problem. In: SBES 2010: Brazilian symposium on software engineering. IEEE, pp 148–157
Cavalcanti YC, Neto PAdMS, Lucrédio D, Vale T, de Almeida ES, de Lemos Meira SR (2013) The bug report duplication problem: an exploratory study. Softw Qual J 21(1):39–66
Article Google Scholar
Chavent M, Kuentz V, Liquet B, Saracco J (2015) Variable Clustering. http://svitsrv25.epfl.ch/R-doc/library/Hmisc/html/varclus.html
Davidson JL, Mohan N, Jensen C (2011) Coping with duplicate bug reports in free/open source software projects. In: VL/HCC 2011: IEEE symposium on visual languages and Human-Centric computing. IEEE, pp 101–108
Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JAsIs 41(6):391–407
Article Google Scholar
Ghotra B, McIntosh S, Hassan AE (2015) Revisiting the impact of classification techniques on the performance of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineering
Jalbert N, Weimer W (2008) Automated duplicate detection for bug tracking systems. In: DSN 2008: Proceedings of the IEEE international conference on dependable systems and networks with FTCS and DCC. IEEE, pp 52–61
Jiang Y, Cukic B, Menzies T (2008) Can data transformation help in the detection of fault-prone modules?. In: Proceedings of the 2008 workshop on Defects in large software systems. ACM, pp 16–20
Kamei Y, Matsumoto S, Monden A, Matsumoto Ki, Adams B, Hassan AE (2010) Revisiting common bug prediction findings using effort-aware models. In: ICSM 2010: IEEE international conference on software maintenance. IEEE, pp 1–10
Kampstra P, et al. (2008) Beanplot: A boxplot alternative for visual comparison of distributions
Kanaris I, Kanaris K, Houvardas I, Stamatatos E (2007) Words versus character n-grams for anti-spam filtering. Int J Artif Intell Tools 16(06):1047–1067
Article Google Scholar
Kanerva P, Kristofersson J, Holst A (2000) Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd annual conference of the cognitive science society, vol 1036 . Citeseer
Kaushik N, Tahvildari L (2012) A comparative study of the performance of ir models on duplicate bug detection. In: CSMR 2012: Proceedings of the 16th European conference on software maintenance and reengineering. IEEE Computer Society, pp 159–168
Koponen T (2006) Life cycle of defects in open source software projects. In: Open Source Systems. Springer, pp 195–200
Lazar A, Ritchey S, Sharif B (2014) Improving the accuracy of duplicate bug report detection using textual similarity measures. In: MSR 2014: Proceedings of the 11th working conference on mining software repositories. ACM, pp 308–311
Lerch J, Mezini M (2013) Finding duplicates of your yet unwritten bug report. In: CSMR 2013: 17th European conference on software maintenance and reengineering. IEEE, pp 69–78
Lessmann S, Baesens B, Mues C, Pietsch S (2008) Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans Softw Eng 34(4):485– 496
Article Google Scholar
Liaw A, Wiener M (2014) Random Forest R package. http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
McIntosh S, Kamei Y, Adams B, Hassan AE (2015) An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering 1–44
Mitchell MW (2011) Bias of the random forest out-of-bag (oob) error for certain input parameters. Open J Stat 1(03):205
Article MathSciNet Google Scholar
Mockus A, Weiss DM (2000) Predicting risk of software changes. Bell Labs Tech J 5(2):169–180
Article Google Scholar
Nagwani NK, Singh P (2009) Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs. In: ICAC 2009: Proceedings of the international conference on advances in computing, communication and control. ACM, pp 202–207
Neter J, Kutner MH, Nachtsheim CJ, Wasserman W (1996) Applied linear statistical models, vol. 4. Irwin Chicago
Prifti T, Banerjee S, Cukic B (2011) Detecting bug duplicate reports through local references. In: PROMISE 2011: Proceedings of the 7th international conference on predictive models in software engineering. ACM, pp 8:1–8:9
Robertson S, Zaragoza H, Taylor M (2004) Simple bm25 extension to multiple weighted fields. In: CIKM 2004: Proceedings of the Thirteenth ACM international conference on information and knowledge management. ACM, pp 42–49
Robnik-Ṡikonja M (2004) Improving random forests. In: Machine Learning: ECML 2004. Springer, pp 359–370
Runeson P, Alexandersson M, Nyholm O (2007) Detection of duplicate defect reports using natural language processing. In: ICSE 2007: Proceedings of the 29th international conference on software engineering. IEEE Computer Society, pp 499–510
Scott A, Knott M (1974) A cluster analysis method for grouping means in the analysis of variance. Biometrics 507–512
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: ASE 2011: Proceedings of the 26th IEEE/ACM international conference on automated software engineering. IEEE, pp 253–262
Sun C, Lo D, Wang X, Jiang J, Khoo SC (2010) A discriminative model approach for accurate duplicate bug report retrieval. In: ICSE 2010: Proceedings of the 32Nd ACM/IEEE international conference on software engineering. ACM, pp 45–54
Sureka A, Jalote P (2010) Detecting duplicate bug report using character n-gram-based features. In: APSEC 2010: Proceedings of the Asia Pacific software engineering conference. IEEE Computer Society, pp 366–374
Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto Ki, Ghotra B, Kamei Y, Adams B, Morales R, Khomh F, et al. (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: ICSE 2015: Proceedings of the 37th international conference on software engineering
Wang X, Zhang L, Xie T, Anvik J, Sun J (2008) An approach to detecting duplicate bug reports using natural language and execution information. In: ICSE 2008: Proceedings of the 30th international conference on software engineering. ACM, pp 461–470
Xavier R, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M (2015) pROC R package. http://cran.r-project.org/web/packages/pROC/pROC.pdf

Download references

Author information

Authors and Affiliations

Software Analysis and Intelligence Lab (SAIL), School of Computing, Queen’s University, Kingston, Ontario, Canada
Mohamed Sami Rakha, Weiyi Shang & Ahmed E. Hassan

Authors

Mohamed Sami Rakha
View author publications
You can also search for this author in PubMed Google Scholar
Weiyi Shang
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed E. Hassan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Sami Rakha.

Additional information

Communicated by: Emerson Murphy-Hill

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rakha, M.S., Shang, W. & Hassan, A.E. Studying the needed effort for identifying duplicate issues. Empir Software Eng 21, 1960–1989 (2016). https://doi.org/10.1007/s10664-015-9404-6

Download citation

Published: 04 November 2015
Issue Date: October 2016
DOI: https://doi.org/10.1007/s10664-015-9404-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Studying the needed effort for identifying duplicate issues

Abstract

Access this article

Similar content being viewed by others

Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval

Preventing duplicate bug reports by continuously querying bug reports

On the unreliability of bug severity data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Studying the needed effort for identifying duplicate issues

Abstract

Access this article

Similar content being viewed by others

Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval

Preventing duplicate bug reports by continuously querying bug reports

On the unreliability of bug severity data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation