Elsevier

Journal of Systems and Software

Volume 147, January 2019, Pages 172-214
Journal of Systems and Software

Controversy corner
Robustness of spectrum-based fault localisation in environments with labelling perturbations

https://doi.org/10.1016/j.jss.2018.09.091Get rights and content

Highlights

  • It is the first work to explore the robustness of SBFL under labelling perturbations.

  • Influence of labelling perturbations on three relations is theoretically analysed.

  • Effect of multiple mislabelled cases on the outputting ranks is theoretically analysed.

  • The robustness of 23 classes of risk evaluation formulas are empirically studied.

  • A new evaluation metric for SBFL is proposed.

Abstract

Most fault localisation techniques take as inputs a faulty program and a test suite, and produce as output a ranked list of suspicious code locations at which the program may be defective. If only a small portion of the executions are labelled erroneously, we expect a fault localisation technique to be robust to these errors. However, it is not known which fault localisation techniques with high accuracy are robust and which techniques are best at finding faults under the trade-off between accuracy and robustness.

In this paper, a theoretical analysis of the impacts of labelling perturbations on spectrum-based fault localisation techniques (SBFL) is presented from different aspects first. We theoretically analyse the influence of labelling perturbations on three relations among risk evaluation formulas and the effect of mislabelling cases on the ranking of faulty statements. Then, we conduct controlled experiments on 18 programs with 3079 faulty versions from different domains to compare the robustness of 23 classes of risk evaluation formulas. Besides, experiments are conducted for evaluating the robustness of two neural network-based techniques. The impacts of perturbation degrees, number of faults and types of labelling perturbation on the robustness of formulas are empirically studied, and several interesting findings are obtained.

Introduction

Software and software systems can be found everywhere in daily life and software failures are frequently encountered. Many failures are due to faults (or bugs) that are embedded in programs during development, and debugging is a very effective way of identifying their presence. It is commonly recognized that debugging is important but resource-consuming in software engineering, in which fault localisation is one of the essential activities. Due to a substantial amount of manual involvement, fault localisation is a very resource-consuming task in the software development lifecycle. Therefore, many researchers have proposed various automatic and effective techniques for fault localisation to decrease its cost and to increase the software quality.

One promising approach towards fault localisation is spectrum-based fault localisation (referred to as SBFL in this article). SBFL refers to the automatic mechanism for predicting fault positions in a faulty program by analysing the dynamic program spectra that are captured in program runs. Typically, a program element that is always exercised in failed runs and never exercised in passed runs has a high chance of explaining the observed failures and is deemed very suspicious, in terms of relations to one or more faults. Many heuristics (Jones and Harrold, 2005, Abreu et al., 2007, Gong et al., 2012, Steimann et al., 2013, Neelofar et al., 2017) and mathematical models (Liblit et al., 2005, Liu et al., 2006, Zhang et al., 2011, Gore and Reynolds, 2012, Tang et al., 2017) have been proposed.

SBFL has received substantial attention due to its simplicity and effectiveness. It takes as inputs a faulty program and a test suite and produces as output a ranked list of suspicious code locations at which the program may be defective. Ideally, there should be no perturbations to the input of SBFL. In this way, output values of an SBFL technique will not be changed by the perturbation parameters of input values. Unfortunately, perturbations could occur in real testing process. They may produce errors to the obtained test information, which are propagated by SBFL techniques and cause unexpected results. Consequently, we have a problem: will a small perturbation of the input cause a large variance?

There could be various types of perturbations in the process of testing and debugging. As an attempt to study the above problem, we focus on labelling perturbations in this work. This type of perturbations is caused by the incorrect labelling on a small number of test cases in a test suite, for example mislabelling a test case as passed although it is actually failed or vice versa. It could be very common due to the facts such as human errors, imperfect development of test systems, differences between the test environment and the actual execution environment, and etc. We expect a fault localisation technique to be robust to the perturbations. However, it has been observed that, even under a small labelling perturbation, there may be a substantial impact on the results of fault localisation. In the example shown in Table 1,1 although there is only one test case mislabelled as failed, the faulty statement is given the lowest suspiciousness degree by Naish1, which has been evaluated as one of the “maximal” risk evaluation formula under the single-fault scenario. The ranking of the faulty statement drops from the first to almost the last. For other formulas, for example Jaccard, we can find a similar situation as shown in Table 1.

In this paper, we first provide a theoretical investigation of the impacts of labelling perturbations on the accuracy of risk evaluation formulas. The preservation of three relations among the formulas, namely, strict equivalence relations (Naish et al., 2011), Xie and Chen's equivalence relations (Xie et al., 2013a), and Xie and Chen's order relations (Xie et al., 2013a), is proved under the scenario of labelling perturbations. In addition, the problem of how the labelling perturbation influences the outputting rank list of formulas is theoretically studied both in the scenario of all mislabelling activities are in the same direction and the mislabelling activities are in different directions.

To further explore the impacts of labelling perturbations on different risk evaluation formulas, we conducted controlled experiments using the Siemens suite, UNIX utility software, space and Defects4J. The robustness of 23 classes of risk evaluation formulas and their impact factors, including perturbation degrees, number of faults and types of labelling perturbation are empirically studied. We observe that (1) Different risk evaluation formulas usually have different robustness values and the robustness values of risk evaluation formulas are not positive or negative correlation to their Expense; (2) The robustness of most risk evaluation formulas decreases with the increase of perturbation degrees; (3) Most formulas show an increasing trend of robustness with the increase of the number of faults; (4) On average, the impacts of mislabelling passed cases as failed cases are greater than the ones of mislabelling failed cases as passed cases. Based on the findings, a new metric is proposed for evaluating risk evaluation formulas by synthetically determining their robustness and accuracy. Experiments show the rationality of the metric. Besides, we also perform the experiments to evaluate the robustness of two neural network-based fault localization techniques.

The rest of this article is organized as follows: Section 2 provides the problem description and motivation of this work. Section 3 introduces the theoretical analysis from two aspects. Section 4 presents an empirical study on 18 programs with 3079 faulty versions from different domains. Section 5 discusses the threats to the validity of this work. In Section 6, a review of previous theoretical and empirical studies is presented. Finally, the conclusions for this work is discussed in Section 7.

Section snippets

Spectrum-based fault localisation and its labelling perturbations

Spectrum-based fault localisation (SBFL) refers to the automatic mechanism for predicting potential fault positions in a faulty program by analysing the dynamic program spectra that are captured in program runs. With the approach, each structural element in the program is assigned a suspiciousness value that corresponds to the relative likelihood of the element containing one or more of the faults (Liblit et al., 2005, Jones and Harrold, 2005, Abreu et al., 2007). The concept of spectrum-based

Theoretical analysis

In this section, we present mathematical proofs for the cases in which the accuracy of formulas was observed to have deteriorated, improved or been preserved by considering the perturbations. We carry out the theoretical analysis from two main aspects. First, we analyse the influence of labelling perturbations on three relations among the formulas: strict equivalence relations (Naish et al., 2011), Xie and Chen's equivalence relations (Xie et al., 2013a), and Xie and Chen's order relations (

Controlled experiments

In order to verify the theoretical analysis results and further analyse the impacts of parameters, in this section we design a number of controlled experiments on 18 programs, 23 risk formulas and 2 neural network-based techniques.

Threats to validity

The discussion regarding threats to validity focuses on internal, external and constructing validities. The primary threat to the internal validity involves the correctness of our techniques, which includes the implementation of the fault localisation techniques with or without considering labelling perturbations. The implementation of these techniques was manually evaluated by applying them to small programs; the data that were collected in the experiment, however, were manually evaluated by

Related work

Spectrum-based fault localisation (SBFL) has undergone long-term development and evolution and some work was carried out over a decade ago (Jones et al., 2005). To date, many SBFL techniques have been proposed based on various granularities of program components, including predicate-based techniques (Liblit et al., 2005, Liu et al., 2006), statement-based techniques (Jones and Harrold, 2005, Abreu et al., 2007), and path-oriented techniques (Chilimbi et al., 2009). Moreover, different

Conclusions and future work

No matter how careful testers and debuggers are, how effective and efficient a test system is, and how close the test environment is to reality, we cannot avoid the mislabelling of test cases. Therefore, for a fault localisation technique, it is necessary not only to have a high accuracy in a perfect environment but also to maintain high accuracy in environments with labelling perturbations. In this paper, we theoretically analyse and experimentally study the impacts of labelling perturbations

Yanhong Xu is a M.Sc. student at School of Automation Science and Electrical Engineering, Beihang University. She obtained her bachelor degree in 2016 from Beihang University. Her research interest is program debugging.

References (77)

  • R. Abreu et al.

    An evaluation of similarity coefficients for software fault localization

  • R. Abreu et al.

    On the accuracy of spectrum-based fault localization

  • M.R. Anderberg

    Cluster analysis for applications

    NY Publication: Probability and Mathematical Statistics

    (1973)
  • A. Bandyopadhyay et al.

    Tester feedback driven fault localization

  • J. Campos et al.

    Entropy-based test generation for improved fault localization

  • N.V. Chawla et al.

    SMOTE: synthetic minority over-sampling technique

    J. Artificial Intelligence Res.

    (2002)
  • M.Y. Chen et al.

    Pinpoint: problem determination in large, dynamic internet services

  • T.M. Chilimbi et al.

    HOLMES: effective statistical debugging via efficient path profiling

  • J. Cohen

    A coefficient of agreement for nominal scales

    Educ. Psychol. Measur.

    (1960)
  • V. Debroy et al.

    Insights on fault interference for programs with multiple bugs

  • E. Derouin et al.

    Neural network training on unequally represented classes

    Intell. Eng. Syst. Through Artif. Neural Netw

    (2012)
  • L.R. Dice

    Measures of the amount of ecologic association between species

    Ecology

    (1945)
  • H. Do et al.

    Supporting controlled experimentation with testing techniques: an infrastructure and its potential impact

    Emp. Softw. Eng.

    (2005)
  • J.M. Duarte et al.

    Comparison of similarity coefficients based on RAPD markers in the common bean

    Gen. Mol. Biol.

    (1999)
  • B.S. Everitt

    Graphical Techniques for Multivariate Data

    (1978)
  • J.L. Fleiss

    Estimating the accuracy of dichotomous judgments

    Psychometrika

    (1965)
  • C. Gong et al.

    Effects of class imbalance in test suites: an empirical study of spectrum-based fault localization

  • A. Gonzalez

    Automatic error detection techniques based on dynamic invariants

    (2007)
  • L.A. Goodman et al.

    Measures of association for cross classifications

    J. Am. Statist. Assoc.

    (1954)
  • R. Gore et al.

    Reducing confounding bias in predicate-level statistical debugging metrics

  • M. Harman et al.

    A Comprehensive Survey of Trends in Oracles for Software Testing

    (2013)
  • H. He et al.

    Learning from imbalanced data

    IEEE Trans. Knowl. Data Eng.

    (2008)
  • J.A. Jones et al.

    Empirical evaluation of the tarantula automatic fault-localization technique

  • J.A. Jones et al.

    Visualization of test information to assist fault localization

  • R. Just et al.

    Defects4J: A database of existing faults to enable controlled testing studies for Java programs

  • E.F. Krause

    Taxicab geometry

    Math. Teacher

    (1973)
  • T.D.B. Le et al.

    Theory and practice, do they match? A case with spectrum-based fault localization

  • H.J. Lee et al.

    Study of the relationship of bug consistency with respect to performance of spectra metrics

  • Cited by (0)

    Yanhong Xu is a M.Sc. student at School of Automation Science and Electrical Engineering, Beihang University. She obtained her bachelor degree in 2016 from Beihang University. Her research interest is program debugging.

    Beibei Yin is currently a lecturer at Beihang University of China. She received the Ph.D. degree from Beihang University, China, in 2010. She was working as a research scholar in the Department of Electrical and Computer Engineering at Duke Univesity in 2016. Her research interests include software testing, software reliability, and software cybernetics. She has published research results in venues such as TSE, T-Rel, Inf. Sci. and ISSRE.

    Zheng Zheng is a Professor at Beihang University of China. He received his Ph.D. degree in computer software and theory in Chinese Academy of Science. In 2014 he was with Department of Electrical and Computer Engineering at Duke University, working as a research scholar. His research interests include software fault localization and software dependability modeling. He has published research results in venues such as TDSC, TSC, T-Rel, JSS, COR and ISSRE.

    Xiaoyi Zhang is a Ph.D. candidate at School of Automation Science and Electrical Engineering, Beihang University. He obtained his bachelor degree in Beihang University in 2011. His research interests include program debugging and program repairing.

    Chenglong Li is a Ph.D. candidate at School of Automation Science and Electrical Engineering, Beihang University. He obtained his bachelor degree in 2018 from Beihang University. His research interest is program debugging.

    Shunkun Yang is an Associate Professor at Beihang University of China. He received his Ph.D. degree in Beihang University. His research interests include software testing and software reliability.

    This work was supported by the National Natural Science Foundation of China (Grant Nos. 61772055, 61872169), Equipment Preliminary R&D Project of China (No. 41402020102), Technical Foundation Project of Ministry of Industry and Information Technology of China (JSZL2016601B003).

    View full text