Crowdsourced test report prioritization considering bug severity
Introduction
Crowdsourced testing has become a popular research field in software engineering [1], [2], [3], [4]. It recruits global online labor to accomplish various types of software testing tasks through an open call, including functional testing, usability testing, security testing, and user experience testing. Unlike conventional software testing, crowdsourced testing can rapidly obtain a large number of test feedback to accommodate rapid iterative software development. Recently, many commercial testing platforms have emerged such as TestIO,1 Crowdsprint,2 Rainforest,3 Testbirds,4 Baidu Crowd Test,5 and Alibaba Crowd Test.6
In crowdsourcing testing, when crowd workers submit test reports, they usually receive rewards based on the quality or quantity of bugs revealed by a report. Therefore, each test project will receive massive test reports due to financial incentives or other reward motivations. How to effectively inspect these test reports becomes an inevitable problem for task requesters [5]. The difficulties include high duplication rate, low quality, and language incompatibility. As such, it usually takes much time and cost to identify duplicate and low-quality test reports when inspecting all test reports. To alleviate these challenges, researchers have proposed many solutions such as reports clustering [4], [6], [7], prioritization [8], [9], and classification [1], [10], [11]. When clustering, test reports will be assigned to different clusters that contain representative information according to their similarity of text and image information. Similarly, by prioritizing, we can repeatedly find a test report that is not similar to the current report under reviewing to improve the rate of detecting unique bugs. Moreover, some classification techniques have been presented to identify whether a test report actually reveals a fault. Based on the above schemes, we can quickly obtain the most useful information from reports to developers.
In practice, the information in test reports contains multiple input modalities like text descriptions, screenshots, short operation videos, and voice messages. Different modalities typically carry different kinds of information. Textual descriptions usually contain a detailed operation and bug behavior in the testing process, whereas images generally illustrate software activity views. Most researchers conduct studies that leveraging text modality and image modality to deal with test reports, and they also prioritize a test report according to the text descriptions [4], [7], [8]. Images are considered to be one of the most fundamental carriers on the mobile platform [12], [13]. Intuitively, screenshots can more directly demonstrate erroneous behavior served as additional information, which reflects the current problematic activity view, so we designed the prioritization technique mainly depends on the image information.
Further, existing prioritization methods merely focus on the rate of fault detection while neglecting the severity of a single fault. Bug severity is an important attribute that refers to the degree of impact that a defect has on the development or operation of a component or system [14]. Generally, bug severity can be classified into different levels based on their impact on the system. It is important for developers and end-users that the critical bug requires to be fixed on a priority basis. Prior researchers proposed several models to identify the bug severity and its processing priority. These models are based on textual processing of historical bug reports and classification techniques, which most combine feature selection methods and machine learning algorithms [15], [16], [17]. Even though these techniques have a good performance on severity assessment, they may continuously recommend test reports that revealed the same bug categories. Most of the existing traditional severity prediction approaches focused on prioritizing critical bugs while ignoring the rate of detecting all bugs throughout the inspection process [16], [17].
To overcome these limitations, we propose a novel crowdsourced test report prioritization method for bug severity (CTRP). First, we utilize the natural language processing tool to extract keywords described in text information. For image information, we employ the spatial pyramid matching (SPM) method to extract the screenshots’ feature and distance matrix. We utilize the locality-sensitive hashing technique (LSH) to map the image feature vector to an index value, where the same index value means that these images are similar. For the remaining test reports which are not added to the hash table, we compare the Jaccard distance of their keyword sets in test reports. When we find the smallest Jaccard distance of text keyword sets in two reports, we build the same index. Last, in the ranking phase, we iteratively choose a different group in the hash table as candidates, then select a test report with the largest text information entropy in candidates to recommend.
The main contributions of this paper are:
- •
We present a novel approach to prioritize crowdsourced test reports. According to the classification results of test reports, the approach builds a hash table for all processed test reports. Then we design a prioritization algorithm based on the information entropy of text keywords.
- •
We conduct rigorous experiments on six industrial crowdsourcing project data sets, which contain more than 1600 test reports and 1400 screenshots. We found that our technique performed well on the rate of detecting faults, and it can prioritize high-severity bug reports as well, compared with other existing techniques.
- •
To assist users in applying our technique, we analyze the sensitivity of different critical parameters. Based on the actual results, we give a guideline on the optimal parameter settings.
The remainder of this paper is organized as follows: In Section 2, we introduce the technical background and research motivation. In Section 3, we present the details of our technique framework. In Section 4, we evaluate our technical framework and introduce the experimental setup. In Section 5, we analyze the experiment results to answer research questions and discuss threats to the validity of our technical framework. In Section 6, we review prior related research on this topic. In Section 7, we draw conclusions and outline some directions for future work.
Section snippets
Background
In this section, we introduce the background of crowdsourced testing and the motivation that inspired us to conduct this study.
Approach
This section elaborates on the details of our method. Fig. 1 shows the framework of our approach, which mainly contains three phases: (1) feature extraction, (2) hash table building, and (3) prioritization. In the process of feature extraction, we assume the test reports only consist of two parts: text description and screenshots, which will be handled separately. Then we build a hash table to store all test reports depending on a distance measure. Finally, we devise a prioritization strategy
Experiments
To validate our approach, we conduct extensive experiments on real data. All methods are conducted on a personal computer with a 2.80 GHz Intel processor, 24 GB memory, and an Nvidia GTX-1060-3G GPU. In this section, we first bring forward the research questions in our experiment. Then we give a detailed introduction to experimental subjects, baselines, and metrics. Finally, we analyze the parameter settings in our technique.
Results analysis
In this section, we analyze the experimental results to answer the research questions.
Related works
In this section, we summarize the primary relevant research in the context of processing crowdsourced test reports.
Conclusion
In this paper, we proposed a novel prioritization technique (CTRP) to help developers review the overwhelming number of reports in crowdsourced testing. Our preliminary investigation found that image information is more representative in identifying crowdsourced test reports, and almost all existing prioritization techniques only adopt the most straightforward random strategy for sampling reports. This motivates us to design an approach that considers image information as possible by grouping
CRediT authorship contribution statement
Yao Tong: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Xiaofang Zhang: Investigation, Validation, Resources, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is partially supported by the National Natural Science Foundation of China (61772263, 61872177, 61972289, 61832009), National Key R&D Program of China (Grant No. 2018YFB1003901), Collaborative Innovation Center of Novel Software Technology and Industrialization, China, and the Priority Academic Program Development of Jiangsu Higher Education Institutions, China .
References (52)
- et al.
A survey of the use of crowdsourcing in software engineering
J. Syst. Softw.
(2017) - et al.
Domain adaptation for test report classification in crowdsourced testing
- et al.
CTRAS: crowdsourced test report aggregation and summarization
- et al.
Automatically translating bug reports into test cases for mobile apps
- et al.
Clustering crowdsourced test reports of mobile applications using image understanding
IEEE Trans. Softw. Eng.
(2020) - et al.
Research progress of crowdsourced software testing
J. Softw.
(2018) - et al.
Fuzzy clustering of crowdsourced test reports for apps
ACM Trans. Internet Technol.
(2018) - et al.
Clustering study of crowdsourced test report with multi-source heterogeneous information
- et al.
Test report prioritization to assist crowdsourced testing
- et al.
Multi-objective test report prioritization using image understanding
Towards effectively test report classification to assist crowdsourced testing
Local-based active classification of test report to assist crowdsourced testing
Visual attention based image browsing on mobile devices
Hessian regularized support vector machines for mobile image annotation on the cloud
IEEE Trans. Multimedia
Determining bug severity using machine learning techniques
An empirical study on improving severity prediction of defect reports using feature selection
Predicting the severity of bug reports based on feature selection
Int. J. Softw. Eng. Knowl. Eng.
SMOTE and feature selection for more effective bug severity prediction
Int. J. Softw. Eng. Knowl. Eng.
Bug reports for desktop software and mobile apps in GitHub: What’s the difference?
IEEE Softw.
Coverage-based clustering and scheduling approach for test case prioritization
IEICE Trans. Inf. Syst.
Test case prioritization using requirements-based clustering
A mathematical theory of communication
ACM SIGMOBILE Mob. Comput. Commun. Rev.
Automatic transition of natural language software requirements specification into formal presentation
An approach to detecting duplicate bug reports using natural language and execution information
Metaphor identification using verb and noun clustering
Verb noun construction MWE token classification
Cited by (6)
CASMS: Combining clustering with attention semantic model for identifying security bug reports
2022, Information and Software TechnologyCitation Excerpt :Finally, we conclude our work in Section 7. In the early phase, bug tracking systems are used to help engineers to identify duplicate bug reports [7–11], evaluate the severity or priority of bug reports [12–14], trace bug reports back to relevant source documents [15–19], analyze and predict the effort needed to fix software bugs [20], work on characteristics of software vulnerabilities [21,22], and evaluate the ability of code analysis tools to detect security vulnerabilities [23]. Most of these methods applied textual similarity metrics (e.g., cosine similarity) and machine learning methods (e.g., SVM and KNN) to extract textual information, while the sequential and semantic information have not been considered.
Severity-Oriented Multi-Objective Crowdsourced Test Reports Prioritization
2024, International Journal of Image and GraphicsSeverity-oriented Multi-objective Crowdsourced Test Reports Prioritization
2023, Research SquareAdaptable criteria progress quality management indicators for global software requirements prioritization
2023, AIP Conference Proceedings