Crowdsourced test report prioritization considering bug severity

https://doi.org/10.1016/j.infsof.2021.106668Get rights and content

Abstract

In crowdsourced testing, a large number of test reports will be generated in a short time. How to efficiently inspect these reports becomes one of the critical steps in the testing process. In recent years, many automated techniques like clustering, classification, and prioritization have emerged to provide an automated inspection order over test reports. Even though these methods have achieved good performance, they did not consider the priority to image and text information. Simultaneously, existing prioritization approaches only focus on the rate of detecting faults but ignore the severity of the faults. In fact, bug severity is a vital indicator that the users provide to flag the criticality of a bug, so developers can then use it to set their priority for the resolution process. For these reasons, this paper presents a novel prioritization approach for crowdsourcing test reports. It extracts features from text and screenshot information of the test reports, uses the hash technique to index test reports, and finally designs a prioritization algorithm. To validate our approach, we conducted experiments on six industrial projects. The results and the hypotheses analysis show that our approach can detect all faults faster in a limited time and can prioritize reports that have higher severity faults compared with the existing methods.

Introduction

Crowdsourced testing has become a popular research field in software engineering [1], [2], [3], [4]. It recruits global online labor to accomplish various types of software testing tasks through an open call, including functional testing, usability testing, security testing, and user experience testing. Unlike conventional software testing, crowdsourced testing can rapidly obtain a large number of test feedback to accommodate rapid iterative software development. Recently, many commercial testing platforms have emerged such as TestIO,1 Crowdsprint,2 Rainforest,3 Testbirds,4 Baidu Crowd Test,5 and Alibaba Crowd Test.6

In crowdsourcing testing, when crowd workers submit test reports, they usually receive rewards based on the quality or quantity of bugs revealed by a report. Therefore, each test project will receive massive test reports due to financial incentives or other reward motivations. How to effectively inspect these test reports becomes an inevitable problem for task requesters [5]. The difficulties include high duplication rate, low quality, and language incompatibility. As such, it usually takes much time and cost to identify duplicate and low-quality test reports when inspecting all test reports. To alleviate these challenges, researchers have proposed many solutions such as reports clustering [4], [6], [7], prioritization [8], [9], and classification [1], [10], [11]. When clustering, test reports will be assigned to different clusters that contain representative information according to their similarity of text and image information. Similarly, by prioritizing, we can repeatedly find a test report that is not similar to the current report under reviewing to improve the rate of detecting unique bugs. Moreover, some classification techniques have been presented to identify whether a test report actually reveals a fault. Based on the above schemes, we can quickly obtain the most useful information from reports to developers.

In practice, the information in test reports contains multiple input modalities like text descriptions, screenshots, short operation videos, and voice messages. Different modalities typically carry different kinds of information. Textual descriptions usually contain a detailed operation and bug behavior in the testing process, whereas images generally illustrate software activity views. Most researchers conduct studies that leveraging text modality and image modality to deal with test reports, and they also prioritize a test report according to the text descriptions [4], [7], [8]. Images are considered to be one of the most fundamental carriers on the mobile platform [12], [13]. Intuitively, screenshots can more directly demonstrate erroneous behavior served as additional information, which reflects the current problematic activity view, so we designed the prioritization technique mainly depends on the image information.

Further, existing prioritization methods merely focus on the rate of fault detection while neglecting the severity of a single fault. Bug severity is an important attribute that refers to the degree of impact that a defect has on the development or operation of a component or system [14]. Generally, bug severity can be classified into different levels based on their impact on the system. It is important for developers and end-users that the critical bug requires to be fixed on a priority basis. Prior researchers proposed several models to identify the bug severity and its processing priority. These models are based on textual processing of historical bug reports and classification techniques, which most combine feature selection methods and machine learning algorithms [15], [16], [17]. Even though these techniques have a good performance on severity assessment, they may continuously recommend test reports that revealed the same bug categories. Most of the existing traditional severity prediction approaches focused on prioritizing critical bugs while ignoring the rate of detecting all bugs throughout the inspection process [16], [17].

To overcome these limitations, we propose a novel crowdsourced test report prioritization method for bug severity (CTRP). First, we utilize the natural language processing tool to extract keywords described in text information. For image information, we employ the spatial pyramid matching (SPM) method to extract the screenshots’ feature and distance matrix. We utilize the locality-sensitive hashing technique (LSH) to map the image feature vector to an index value, where the same index value means that these images are similar. For the remaining test reports which are not added to the hash table, we compare the Jaccard distance of their keyword sets in test reports. When we find the smallest Jaccard distance of text keyword sets in two reports, we build the same index. Last, in the ranking phase, we iteratively choose a different group in the hash table as candidates, then select a test report with the largest text information entropy in candidates to recommend.

The main contributions of this paper are:

  • We present a novel approach to prioritize crowdsourced test reports. According to the classification results of test reports, the approach builds a hash table for all processed test reports. Then we design a prioritization algorithm based on the information entropy of text keywords.

  • We conduct rigorous experiments on six industrial crowdsourcing project data sets, which contain more than 1600 test reports and 1400 screenshots. We found that our technique performed well on the rate of detecting faults, and it can prioritize high-severity bug reports as well, compared with other existing techniques.

  • To assist users in applying our technique, we analyze the sensitivity of different critical parameters. Based on the actual results, we give a guideline on the optimal parameter settings.

The remainder of this paper is organized as follows: In Section 2, we introduce the technical background and research motivation. In Section 3, we present the details of our technique framework. In Section 4, we evaluate our technical framework and introduce the experimental setup. In Section 5, we analyze the experiment results to answer research questions and discuss threats to the validity of our technical framework. In Section 6, we review prior related research on this topic. In Section 7, we draw conclusions and outline some directions for future work.

Section snippets

Background

In this section, we introduce the background of crowdsourced testing and the motivation that inspired us to conduct this study.

Approach

This section elaborates on the details of our method. Fig. 1 shows the framework of our approach, which mainly contains three phases: (1) feature extraction, (2) hash table building, and (3) prioritization. In the process of feature extraction, we assume the test reports only consist of two parts: text description and screenshots, which will be handled separately. Then we build a hash table to store all test reports depending on a distance measure. Finally, we devise a prioritization strategy

Experiments

To validate our approach, we conduct extensive experiments on real data. All methods are conducted on a personal computer with a 2.80 GHz Intel processor, 24 GB memory, and an Nvidia GTX-1060-3G GPU. In this section, we first bring forward the research questions in our experiment. Then we give a detailed introduction to experimental subjects, baselines, and metrics. Finally, we analyze the parameter settings in our technique.

Results analysis

In this section, we analyze the experimental results to answer the research questions.

Related works

In this section, we summarize the primary relevant research in the context of processing crowdsourced test reports.

Conclusion

In this paper, we proposed a novel prioritization technique (CTRP) to help developers review the overwhelming number of reports in crowdsourced testing. Our preliminary investigation found that image information is more representative in identifying crowdsourced test reports, and almost all existing prioritization techniques only adopt the most straightforward random strategy for sampling reports. This motivates us to design an approach that considers image information as possible by grouping

CRediT authorship contribution statement

Yao Tong: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Xiaofang Zhang: Investigation, Validation, Resources, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (61772263, 61872177, 61972289, 61832009), National Key R&D Program of China (Grant No. 2018YFB1003901), Collaborative Innovation Center of Novel Software Technology and Industrialization, China, and the Priority Academic Program Development of Jiangsu Higher Education Institutions, China .

References (52)

  • MaoK. et al.

    A survey of the use of crowdsourcing in software engineering

    J. Syst. Softw.

    (2017)
  • WangJ. et al.

    Domain adaptation for test report classification in crowdsourced testing

  • HaoR. et al.

    CTRAS: crowdsourced test report aggregation and summarization

  • FazziniM. et al.

    Automatically translating bug reports into test cases for mobile apps

  • LiuD. et al.

    Clustering crowdsourced test reports of mobile applications using image understanding

    IEEE Trans. Softw. Eng.

    (2020)
  • ZhangX. et al.

    Research progress of crowdsourced software testing

    J. Softw.

    (2018)
  • JiangH. et al.

    Fuzzy clustering of crowdsourced test reports for apps

    ACM Trans. Internet Technol.

    (2018)
  • YangY. et al.

    Clustering study of crowdsourced test report with multi-source heterogeneous information

  • FengY. et al.

    Test report prioritization to assist crowdsourced testing

  • FengY. et al.

    Multi-objective test report prioritization using image understanding

  • WangJ. et al.

    Towards effectively test report classification to assist crowdsourced testing

  • WangJ. et al.

    Local-based active classification of test report to assist crowdsourced testing

  • FanX. et al.

    Visual attention based image browsing on mobile devices

  • TaoD. et al.

    Hessian regularized support vector machines for mobile image annotation on the cloud

    IEEE Trans. Multimedia

    (2013)
  • ChaturvediK.K. et al.

    Determining bug severity using machine learning techniques

  • YangC. et al.

    An empirical study on improving severity prediction of defect reports using feature selection

  • LiuW. et al.

    Predicting the severity of bug reports based on feature selection

    Int. J. Softw. Eng. Knowl. Eng.

    (2018)
  • HamdyA. et al.

    SMOTE and feature selection for more effective bug severity prediction

    Int. J. Softw. Eng. Knowl. Eng.

    (2019)
  • ZhangT. et al.

    Bug reports for desktop software and mobile apps in GitHub: What’s the difference?

    IEEE Softw.

    (2019)
  • FuW. et al.

    Coverage-based clustering and scheduling approach for test case prioritization

    IEICE Trans. Inf. Syst.

    (2017)
  • ArafeenM.J. et al.

    Test case prioritization using requirements-based clustering

  • ShannonC.E.

    A mathematical theory of communication

    ACM SIGMOBILE Mob. Comput. Commun. Rev.

    (2001)
  • IlievaM.G. et al.

    Automatic transition of natural language software requirements specification into formal presentation

  • WangX. et al.

    An approach to detecting duplicate bug reports using natural language and execution information

  • ShutovaE. et al.

    Metaphor identification using verb and noun clustering

  • DiabM.T. et al.

    Verb noun construction MWE token classification

  • Cited by (6)

    • CASMS: Combining clustering with attention semantic model for identifying security bug reports

      2022, Information and Software Technology
      Citation Excerpt :

      Finally, we conclude our work in Section 7. In the early phase, bug tracking systems are used to help engineers to identify duplicate bug reports [7–11], evaluate the severity or priority of bug reports [12–14], trace bug reports back to relevant source documents [15–19], analyze and predict the effort needed to fix software bugs [20], work on characteristics of software vulnerabilities [21,22], and evaluate the ability of code analysis tools to detect security vulnerabilities [23]. Most of these methods applied textual similarity metrics (e.g., cosine similarity) and machine learning methods (e.g., SVM and KNN) to extract textual information, while the sequential and semantic information have not been considered.

    View full text