Abstract
Various software-engineering problems have been solved by crowdsourcing. In many projects, the software outsourcing process is streamlined on cloud-based platforms. Among software engineering tasks, test-case development is particularly suitable for crowdsourcing, because a large number of test cases can be generated at little monetary cost. However, the numerous test cases harvested from crowdsourcing can be high- or low-quality. Owing to the large volume, distinguishing the high-quality tests by traditional techniques is computationally expensive. Therefore, crowdsourced testing would benefit from an efficient mechanism distinguishes the qualities of the test cases. This paper introduces an automated approach — TCQA — to evaluate the quality of test cases based on the onsite coding history. Quality assessment by TCQA proceeds through three steps: (1) modeling the code history as a time series, (2) extracting the multiple relevant features from the time series, and (3) building a model that classifies the test cases based on their qualities. Step (3) is accomplished by feature-based machine-learning techniques. By leveraging the onsite coding history, TCQA can assess the test-case quality without performing expensive source-code analysis or executing the test cases. Using the data of nine test-development tasks involving more than 400 participants, we evaluated TCQA from multiple perspectives. The TCQA approach assessed the quality of the test cases with higher precision, faster speed, and lower overhead than conventional test-case quality-assessment techniques. Moreover, TCQA provided yield real-time insights on test-case quality before the assessment was finished.
Similar content being viewed by others
References
Mao K, Capra L, Harman M, et al. A survey of the use of crowdsourcing in software engineering. J Syst Softw, 2017, 126: 57–84
LaToza T D, Chen M, Jiang L X, et al. Borrowing from the crowd: a study of recombination in software design competitions. In: Proceedings of the 37th International Conference on Software Engineering, 2015. 551–562
Musson R, Richards J, Fisher D, et al. Leveraging the crowd: how 48,000 users helped improve lync performance. IEEE Softw, 2013, 30: 38–45
LaToza T D, Towne W B, Adriano C M, et al. Microtask programming: building software with a crowd. In: Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology, 2014. 43–54
Inozemtseva L, Holmes R. Coverage is not strongly correlated with test suite effectiveness. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 435–445
Zhang J, Wang Z Y, Zhang L M, et al. Predictive mutation testing. In: Proceedings of the 25th International Symposium on Software Testing and Analysis, 2016. 342–353
Tsai W T, Wu W, Huhns M N. Cloud-based software crowdsourcing. IEEE Int Comput, 2014, 18: 78–83
Park J, Park Y H, Kim S, et al. Eliph: effective visualization of code history for peer assessment in programming education. In: Proceedings of ACM Conference on Computer Supported Cooperative Work and Social Computing, 2017. 458–467
Wang Y, Wagstrom P, Duesterwald E, et al. New opportunities for extracting insights from cloud based IDEs. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 408–411
Wang Y. Characterizing developer behavior in cloud based IDEs. In: Proceedings of ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 2017. 48–57
Negara S, Vakilian M, Chen N, et al. Is it dangerous to use version control histories to study source code evolution? In: Proceedings of European Conference on Object-Oriented Programming, 2012. 79–103
LaToza T D, Myers B A. Hard-to-answer questions about code. In: Proceedings of Evaluation and Usability of Programming Languages and Tools, 2010
Christ M, Kempa-Liehr A W, Feindt M. Distributed and parallel time series feature extraction for industrial big data applications. 2016. ArXiv:1610.07717
Huhns M N, Li W, Tsai W T. Cloud-based software crowdsourcing (dagstuhl seminar 13362). Dagstuhl Rep, 2013, 3: 34–58
Fast E, Steffee D, Wang L, et al. Emergent, crowd-scale programming practice in the IDE. In: Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems, 2014. 2491–2500
Bai X Y, Li M Y, Huang X F, et al. Vee@cloud: the virtual test lab on the cloud. In: Proceeding of the 8th International Workshop on Automation of Software Test (AST), 2013. 15–18
Zhu H, Hall P A V, May J H R. Software unit test coverage and adequacy. ACM Comput Surv, 1997, 29: 366–427
Young M. Software Testing and Analysis: Process, Principles, and Techniques. Hoboken: John Wiley & Sons, 2008
Rojas J M, Fraser G, Arcuri A. Automated unit test generation during software development: a controlled experiment and think-aloud observations. In: Proceedings of International Symposium on Software Testing and Analysis, 2015. 338–349
Xiao X S, Xie T, Tillmann N, et al. Precise identification of problems for structural test generation. In: Proceedings of the 33rd International Conference on Software Engineering, 2011. 611–620
Almasi M M, Hemmati H, Fraser G, et al. An industrial evaluation of unit test generation: finding real faults in a financial application. In: Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Practice Track, 2017. 263–272
Shamshiri S, Just R, Rojas J M, et al. Do automatically generated unit tests find real faults? an empirical study of effectiveness and challenges (t). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2015. 201–211
Ying A T T, Murphy G C, Ng R, et al. Predicting source code changes by mining change history. IEEE Trans Softw Eng, 2004, 30: 574–586
Keogh E J, Pazzani M J. Scaling up dynamic time warping for datamining applications. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000. 285–289
Christ M, Kempa-Liehr A W, Feindt M. Distributed and parallel time series feature extraction for industrial big data applications. 2016. ArXiv:1610.07717
Yekutieli D, Benjamini Y. The control of the false discovery rate in multiple testing under dependency. Ann Statist, 2001, 29: 1165–1188
Schreiber T, Schmitz A. Discrimination power of measures for nonlinearity in a time series. Phys Rev E, 1997, 55: 5443–5447
Menzies T, Williams L, Zimmermann T. Perspectives on Data Science for Software Engineering. San Francisco: Morgan Kaufmann, 2016
Breiman L. Random forests. Mach Learn, 2001, 45: 5–32
Khoshgoftaar T M, Golawala M, van Hulse J. An empirical study of learning from imbalanced data using random forest. In: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, 2007. 310–317
Petitjean F, Forestier G, Webb G I, et al. Faster and more accurate classification of time series by exploiting a novel dynamic time warping averaging algorithm. Knowl Inf Syst, 2016, 47: 1–26
Inozemtseva L, Holmes R. Coverage is not strongly correlated with test suite effectiveness. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 435–445
Jia Y, Harman M. An analysis and survey of the development of mutation testing. IEEE Trans Softw Eng, 2011, 37: 649–678
Pan S J, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng, 2010, 22: 1345–1359
Pan S J, Tsang I W, Kwok J T, et al. Domain adaptation via transfer component analysis. IEEE Trans Neural Netw, 2011, 22: 199–210
Nam J, Pan S J, Kim S. Transfer defect learning. In: Proceedings of International Conference on Software Engineering, 2013. 382–391
Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Inf Process Manage, 2009, 45: 427–437
Ye L, Keogh E. Time series shapelets: a new primitive for data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009. 947–956
Rokicki M, Zerr S, Siersdorfer S. Groupsourcing: team competition designs for crowdsourcing. In: Proceedings of the 24th International Conference on World Wide Web, 2015. 906–915
Soffa M L, Mathur A P, Gupta N. Generating test data for branch coverage. In: Proceedings of the 15th IEEE International Conference on Automated Software Engineering, 2000. 219–227
Gligoric M, Groce A, Zhang C, et al. Comparing non-adequate test suites using coverage criteria. In: Proceedings of International Symposium on Software Testing and Analysis, 2013. 302–313
Gopinath R, Jensen C, Groce A. Code coverage for suite evaluation by developers. In: Proceedings of the 36th International Conference on Software Engineering, 2014. 72–82
Perry W E. Effective Methods for Software Testing: Includes Complete Guidelines, Checklists, and Templates. Hoboken: John Wiley & Sons, 2007
Namin A S, Andrews J H. The influence of size and coverage on test suite effectiveness. In: Proceedings of the 18th International Symposium on Software Testing and Analysis, 2009. 57–68
Briand L, Pfahl D. Using simulation for assessing the real impact of test coverage on defect coverage. In: Proceedings of the 10th International Symposium on Software Reliability Engineering, 1999. 148–157
Cai X, Lyu M R. The effect of code coverage on fault detection under different testing profiles. SIGSOFT Softw Eng Notes, 2005, 30: 1–7
Zhang Y C, Mesbah A. Assertions are strongly correlated with test suite effectiveness. In: Proceedings of the 10th Joint Meeting on Foundations of Software Engineering, 2015. 214–224
Wong W E, Mathur A P. Reducing the cost of mutation testing: an empirical study. J Syst Softw, 1995, 31: 185–196
Offutt A J, Lee A, Rothermel G, et al. An experimental determination of sufficient mutant operators. ACM Trans Softw Eng Methodol, 1996, 5: 99–118
Polo M, Piattini M, García-Rodríguez I. Decreasing the cost of mutation testing with second-order mutants. Softw Test Verif Reliab, 2009, 19: 111–131
Jia Y, Harman M. Higher order mutation testing. Inf Softw Tech, 2009, 51: 1379–1393
Harman M, Jia Y, Langdon W B. Strong higher order mutation-based test data generation. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, 2011. 212–222
Shi Q K, Chen Z Y, Fang C R, et al. Measuring the diversity of a test set with distance entropy. IEEE Trans Rel, 2016, 65: 19–27
Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv, 2009, 41: 1–58
Acknowledgements
This work was partly supported by National Key Research and Development Program of China (Grant No. 2018YFB1403400) and National Natural Science Foundation of China (Grant Nos. 61690201, 61772014).
Author information
Authors and Affiliations
Corresponding authors
Rights and permissions
About this article
Cite this article
Zhao, Y., Feng, Y., Wang, Y. et al. Quality assessment of crowdsourced test cases. Sci. China Inf. Sci. 63, 190102 (2020). https://doi.org/10.1007/s11432-019-2859-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-019-2859-8