Abstract
In crowdsourced mobile application testing, crowd workers perform test tasks for developers and submit test reports to report the observed abnormal behaviors. These test reports usually provide important information to improve the quality of software. However, due to the poor expertise of workers and the inconvenience of editing on mobile devices, some test reports usually lack necessary information for understanding and reproducing the revealed bugs. Sometimes developers have to spend a significant part of available resources to handle the low-quality test reports, thus severely reducing the inspection efficiency. In this paper, to help developers determine whether a test report should be selected for inspection within limited resources, we issue a new problem of test report quality assessment. Aiming to model the quality of test reports, we propose a new framework named TERQAF. First, we systematically summarize some desirable properties to characterize expected test reports and define a set of measurable indicators to quantify these properties. Then, we determine the numerical values of indicators according to the contained contents of test reports. Finally, we train a classifier by using logistic regression to predict the quality of test reports. To validate the effectiveness of TERQAF, we conduct extensive experiments over five crowdsourced test report datasets. Experimental results show that TERQAF can achieve 85.18% in terms of Macro-average Precision (MacroP), 75.87% in terms of Macro-average Recall (MacroR), and 80.01% in terms of Macro-average F-measure (MacroF) on average in test report quality assessment. Meanwhile, the empirical results also demonstrate that test report quality assessment can help developers handle test reports more efficiently.
Similar content being viewed by others
Notes
References
Aceto G, Ciuonzo D, Montieri A, Pescapè A (2018) Multi-classification approaches for classifying mobile app traffic. J Netw Comput Appl 103:131–145
Bettenburg N, Just S, Schröter A, Weiss C, Premraj R, Zimmermann T (2008) What makes a good bug report?. In: Proceedings of the 16th ACM SIGSOFT international symposium on foundations of software engineering, ser. FSE’08. ACM, pp 308–318
Bettenburg N, Premraj R, Zimmermann T, Kim S (2008) Duplicate bug reports considered harmful... really?. In: 24th IEEE international conference on software maintenance, ser. ICSM’08, pp 337–345
Carlson N, Laplante PA (2014) The NASA automated requirements measurement tool: a reconstruction. ISSE 10(2):77–91
Chen Z, Luo B (2014) Quasi-crowdsourcing testing for educational projects. In: Companion proceedings of the 36th international conference on software engineering, ser ICSE’14. ACM, pp 272–275
Chen X, Jiang H, Li X, He T, Chen Z (2018) Automated quality assessment for crowdsourced test reports of mobile applications. In: 25th international conference on software analysis, evolution and reengineering, SANER 2018, Campobasso, Italy, March 20-23, 2018. IEEE, pp 368–379
Chen X, Jiang H, Chen Z, He T, Nie L (2019) Automatic test report augmentation to assist crowdsourced testing. Frontiers of Computer Science (print)(5)
Cui Q, Wang S, Wang J, Hu Y, Wang Q, Li M (2017) Multi-objective crowd worker selection in crowdsourced testing. In: The 29th international conference on software engineering and knowledge engineering, Wyndham Pittsburgh University Center, Pittsburgh, PA, USA, July 5-7, 2017, pp 218–223
de Sousa TC, Almeida JR Jr, Viana S, Pavón J (2010) Automatic analysis of requirements consistency with the B method. ACM SIGSOFT Software Engineering Notes 35(2):1–4
Denoeux T (2018) Logistic regression revisited: Belief function analysis. In: Belief functions: theory and applications - 5th international conference, BELIEF 2018, Compiégne, France, September 17-21, 2018, Proceedings, pp 57–64
Dolstra E, Vliegendhart R, Pouwelse JA (2013) Crowdsourcing gui tests. In: Sixth IEEE international conference on software testing, verification and validation, ser. ICST’13. IEEE, pp 332–341
Feng Y, Chen Z, Jones JA, Fang C, Xu B (2015) Test report prioritization to assist crowdsourced testing. In: Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE’15. ACM, pp 225–236
Feng Y, Jones JA, Chen Z, Fang C (2016) Multi-objective test report prioritization using image understanding. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, ser ASE’16. ACM, pp 202–213
Flesch R (1948) A new readability yardstick. Journal of Applied Psychology 32 (3):221
Gao R, Wang Y, Feng Y, Chen Z, Wong WE (2018) Successes, challenges, and rethinking – an industrial investigation on crowdsourced mobile application testing. Empir Softw Eng 2:1–25
Génova G, Fuentes JM, Morillo JL, Hurtado O, Moreno V (2013) A framework to measure and improve the quality of textual requirements. Requir Eng 18(1):25–41
Gomide VH, Valle PA, Ferreira JO, Barbosa JR, Da Rocha AF, Barbosa T (2014) Affective crowdsourcing applied to usability testing. Int J Comput Sci Inf Technol 5(1):575–579
Guaiani F, Muccini H (2015) Crowd and laboratory testing, can they co-exist? an exploratory study. In: 2nd IEEE/ACM international workshop on crowdsourcing in software engineering, ser. CSI-SE’15. ACM/IEEE, pp 32–37
Guo S, Chen R, Li H (2017) Using knowledge transfer and rough set to predict the severity of android test reports via text mining. Symmetry 9(8):161
Guo W (2010) Research on readability formula of chinese text for foreign students. Ph.D. dissertation, Shanghai Jiao Tong University
Heck P, Zaidman A (2016) A systematic literature review on quality criteria for agile requirements specifications. Softw Qual J: 1–34
Férnandez HJ (1959) Medidas sencillas de lecturabilidad. Consigna (214):29–32
Hooimeijer P, Weimer W (2007) Modeling bug report quality. In: 22nd IEEE/ACM international conference on automated software engineering (ASE 2007), ser ASE’07. ACM, pp 34–43
Howe J (2006) The rise of crowdsourcing. Wired Magazine 14(6):1–4
Hsu H, Chang YI, Chen R (2019) Greedy active learning algorithm for logistic regression models. Computational Statistics & Data Analysis 129:119–134
Jiang H, Chen X, He T, Chen Z, Li X (2018) Fuzzy clustering of crowdsourced test reports for apps. ACM Trans Internet Techn 18(2):18:1?18:28
Jiang H, Zhang J, Li X, Ren Z, Lo D (2016) A more accurate model for finding tutorial segments explaining apis. In: IEEE 23rd international conference on software analysis, evolution, and reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016, vol 1, pp 157–167
Joorabchi ME, MirzaAghaei M, Mesbah A (2014) Works for me! characterizing non-reproducible bug reports. In: 11th working conference on mining software repositories, MSR 2014, Proceedings, ser. MSE?14, pp 62–71
Kiyavitskaya N, Zeni N, Mich L, Berry DM (2008) Requirements for tools for ambiguity identification and measurement in natural language requirements specifications. Requir Eng 13(3):207–239
Ko AJ, Myers BA, Chau DH (2006) A linguistic analysis of how people describe software problems. In: 2006 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC 2006), ser. VL/HCC?06. IEEE Computer Society, pp 127–134
Leicht N, Knop N, Müller-Bloch C, Leimeister JM (2016) When is crowdsourcing advantageous? the case of crowdsourced software testing. In: 24th European conference on information systems, ECIS 2016, Istanbul, Turkey, June 12-15, 2016, p Research Paper 60
Liu D, Lease M, Kuipers R, Bias RG (2012) Crowdsourcing for usability testing. American Society for Information Science and Technology 49(1):332–341
Liu Z, Gao X, Long X (2010) Adaptive random testing of mobile application. In: International conference on computer engineering and technology, pp V2–297 – V2–301
Mao K, Capra L, Harman M, Jia Y (2015) A survey of the use of crowdsourcing in software engineering. RN 15(01)
Nazar N, Jiang H, Gao G, Zhang T, Li X, Ren Z (2016) Source code fragment summarization with small-scale crowdsourcing based features. Frontiers of Computer Science 10(3):504–517
Nebeling M, Speicher M, Grossniklaus M, Norrie MC (2012) Crowdsourced web site evaluation with crowdstudy. In: Proceedings of 12th International Conference on Web Engineering, ser ICWE’12. Springer, pp 494–497
None (2014) Itc guidelines on quality control in scoring, test analysis, and reporting of test scores. Int J Test 14(3):195–217
Parra E, Dimou C, Morillo JL, Moreno V, Fraga A (2015) A methodology for the classification of quality of requirements using machine learning techniques. Information & Software Technology 67:180–195
Perry WE (2006) Effective methods for software testing, 3rd edn. Wiley, Hoboken
Petrosyan G, Robillard MP, Mori RD (2015) Discovering information explaining API types using text classification. In: 37th IEEE/ACM international conference on software engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, vol 1, pp 869–879
Popescu D, Rugaber S, Medvidovic N, Berry DM (2007) Reducing ambiguities in requirements specifications via automatically created object-oriented models. In: Innovations for requirement analysis. From stakeholders? needs to formal designs, pp 103–124
Rastkar S, Murphy GC, Murray G (2014) Automatic summarization of bug reports. IEEE Trans Software Eng 40(4):366–380
Rosenberg L, Hammer T (1999) A methodology for writing high quality requirement specifications and for evaluating existing ones. NASA Goddard space flight center software assurance technology center
Sardinha A, Chitchyan R, Weston N, Greenwood P, Rashid A (2013) Ea-analyzer: automating conflict detection in a large set of textual aspect-oriented requirements. Autom Softw Eng 20(1): 111–135
Starov O (2013) Cloud platform for research crowdsourcing in mobile testing. East Carolina University
Thakurta R (2013) A framework for prioritization of quality requirements for inclusion in a software project. Softw Qual J 21(4):573–597
Vliegendhart R, Dolstra E, Pouwelse J (2012) Crowdsourced user interface testing for multimedia applications. In: ACM multimedia 2012 workshop on crowdsourcing for multimedia, pp 21–22
Wang J, Cui Q, Wang Q, Wang S (2016) Towards effectively test report classification to assist crowdsourced testing. In: Proceedings of the 10th ACM/IEEE international symposium on empirical software engineering and measurement. ACM, pp 6:1–6:10
Wang J, Cui Q, Wang S, Wang Q (2017) Domain adaptation for test report classification in crowdsourced testing. In: 39th IEEE/ACM international conference on software engineering: software engineering in practice track, ICSE-SEIP 2017, Buenos Aires, Argentina, May 20-28, 2017. IEEE, pp 83–92
Wang J, Wang S, Cui Q, Wang Q (2016) Local-based active classification of test report to assist crowdsourced testing. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering, ser ASE’16. ACM, pp 190–201
Wilson W, Rosenberg L, Hyatt L (1996) Automated quality analysis of natural language requirement specifications in proc. In: Fourteenth annual pacific northwest software quality conference Portland OR
Wu C, Chen K, Chang Y, Lei C (2013) Crowdsourcing multimedia qoe evaluation: a trusted framework. IEEE Trans Multimed 15(5):1121–1137
Yang S-j (1970) A readability formula for Chinese language. University of Wisconsin–Madison
Zhang T, Gao JZ, Cheng J (2017) Crowdsourced testing services for mobile apps. In: in 2017 IEEE symposium on service-oriented system engineering, SOSE 2017, San Francisco, CA, USA, April 6-9, 2017, pp 75–80
Zhang X, Chen Z, Fang C, Liu Z (2016) Guiding the crowds for android testing. In: Proceedings of the 38th international conference on software engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016 - Companion Volume, pp 752–753
Zimmermann T, Premraj R, Bettenburg N, Just S, Schröter A, Weiss C (2010) What makes a good bug report. IEEE Trans Software Eng 36(5):618–643
Zogaj S, Bretschneider U, Leimeister JM (2014) Managing crowdsourced software testing: a case study based insight on the challenges of a crowdsourcing intermediary. Journal of Business Economics 84(3):375–405
Acknowledgments
We greatly thank the developers who devote their precious time on evaluating and inspecting the quality of test reports. We would thank José M. Fuentes who provides help for us to conduct this work. This work is partially supported by the National Key Research and Development Program of China under grant no. 2018YF-B1003900, and the National Natural Science Foundation of China under Grants No. 61902096, 61972359, 61370144, 61722202, 61403057, and 61772107.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Massimiliano Di Penta and David D. Shepherd
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Software Analysis, Evolution and Reengineering (SANER)
Rights and permissions
About this article
Cite this article
Chen, X., Jiang, H., Li, X. et al. A systemic framework for crowdsourced test report quality assessment. Empir Software Eng 25, 1382–1418 (2020). https://doi.org/10.1007/s10664-019-09793-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-019-09793-8