On the estimation of adequate test set size using fault failure rates

https://doi.org/10.1016/j.jss.2010.07.025Get rights and content

Abstract

Test set size in terms of the number of test cases is an important consideration when testing software systems. Using too few test cases might result in poor fault detection and using too many might be very expensive and suffer from redundancy. We define the failure rate of a program as the fraction of test cases in an available test pool that result in execution failure on that program. This paper investigates the relationship between failure rates and the number of test cases required to detect the faults. Our experiments based on 11 sets of C programs suggest that an accurate estimation of failure rates of potential fault(s) in a program can provide a reliable estimate of adequate test set size with respect to fault detection and should therefore be one of the factors kept in mind during test set construction. Furthermore, the model proposed herein is fairly robust to incorrect estimations in failure rates and can still provide good predictive quality. Experiments are also performed to observe the relationship between multiple faults present in the same program using the concept of a failure rate. When predicting the effectiveness against a program with multiple faults, results indicate that not knowing the number of faults in the program is not a significant concern, as the predictive quality is typically not affected adversely.

Introduction

Fundamental to the software testing process is the utilization of good test cases to detect faults. In fact, the number of such test cases (i.e., the size of the test set) used to test software has a significant impact on the quality of testing. If a test set consists of too few test cases, it may not be able to detect faults effectively. On the other hand, if a test set has too many test cases, it would imply a greater cost associated with the testing, and may also result in some test cases being redundant. Test set size is therefore one of the primary determinants of the overall cost and effectiveness of the software testing process. Consequently, selecting an appropriate test set size becomes a very important concern for testers and managers everywhere, who want to improve the effectiveness of software testing while reducing its cost.

Test sets may be constructed to satisfy one or more criteria, such as various forms of code coverage, based on certain aspects of the software under test. A test set is adequate for a selected criterion if it covers the program according to that criterion (Harrold, 1991, Weyuker, 1986). For the purposes of this paper, test set adequacy is discussed in terms of the test set's fault detection effectiveness (which is to be formally defined later in this paper). Recognizing that large, comprehensive sets of test cases are rarely available and impractical to use due to their expense, several studies have explored the link between the size of a test set and its corresponding effectiveness in terms of its fault detection capabilities (Rothermel and Harrold, 1998, Wong et al., 1998, Wong et al., 1999). In contrast this study explores this link from the opposite direction, in that it proposes a model to estimate the corresponding number of test cases (test set size) that must be used in order to reach a certain level of fault detection effectiveness.

In order to build such a predictive model, assumptions similar to the ones made in random testing scenarios are also made in this paper. In random testing, test cases are selected randomly from the entire input domain (Chen and Yu, 1996, Leung et al., 2000), thus assuming that each test case may be equally as good (or bad) as the next in terms of its ability to detect a fault (result in failure). Random testing also assumes that the cost associated with the execution of one test case is the same as the cost associated with another. Since our empirical studies conducted are based on subject programs with pre-existing available test cases, we restrict these random testing assumptions to test cases from the available test pool rather than the entire input domain. Stated differently, for the purposes of this paper and the descriptions of experiments herein, when constructing a test set of a certain size – say x test cases, those x test cases are selected randomly (without replacement) from the entire available pool of test cases.

The model is developed by making use of failure rates – informally to be interpreted as the hardness of detection of fault(s) in a program (and formally to be defined subsequently in Section 2.3). Our approach utilizes these failure rates to make a probabilistic estimation of the expected number of faults (faulty programs1) detected by a test set. Supposing we have a total of n different faults that we are working with, we can reverse the above process to identify the test set size (number of test cases) such that all of the n faults are expected to be detected. Thus, we make it possible to predict an adequate test set size such that all the faults are expected to be detected, i.e., we expect to have 100% fault detection effectiveness. It is possible for us to have 100% fault detection effectiveness because we know exactly how many single-fault versions there are (n is a known variable), and in our case studies we restrict ourselves to faults that result in at least one test case failure. Then the quality of the prediction is evaluated by comparing the expected number of faults detected (as predicted by the model using a test set of a certain size) to the actual number of faults detected by real test sets of the same size.

Thus, the contributions made by this paper may be summarized as follows:

  • (1)

    We posit the use of fault failure rates as useful estimators of adequate test set size with respect to fault detection effectiveness (to be detailed in Section 2.1).

  • (2)

    To this effect, a predictive model is presented to illustrate how the above might be achieved.

  • (3)

    We present empirical data to illustrate the effectiveness and predictive quality of the proposed model via the seven programs of the Siemens suite, and the space, grep, gzip, and make programs (i.e., a total of 11 programs are experimented upon).

  • (4)

    The interference (or lack of independence) between multiple faults present in the same program is also investigated via the notion of failure rates.

The remainder of the paper is organized in the following manner. Section 2 first describes the fundamental concepts that help facilitate better understanding of this paper and its discussions. Section 3 then presents the probabilistic model that estimates adequate test set size using the failure rates, followed by Section 4 which evaluates the model against 11 C programs. Subsequently Section 5 identifies two relevant issues and provides more insights on the proposed model and derived data. Section 6 assesses the quality of the proposed model in the presence of incorrect failure rates and three different perturbation schemes are presented. Section 7 then goes on to evaluate the relationship between multiple faults in the same program from the point of view of the failure rate. Section 8 presents relevant discussions and details the threats to the validity of the approach and Section 9 is an overview of related work. Finally, we present our conclusions and ideas for future work in Section 10.

Section snippets

Preliminaries

In this section we present concepts that are fundamental to the understanding of the model and corresponding discussions presented in this paper.

The proposed model

Let there be a set of n faults: f1, f2, … , fn. Let there also be a test set (without loss of generality, let us assume that this is our test pool) T that contains m test cases: t1, t2, …, tm. Let each fault fi be associated with a failure rate θi such thatθi=βi/mwhere βi represents the number of test cases out of m that fail on fault fi. Supposing we wish to use a subset of T, say Tα with α test cases such that α  m. We define piF(α) as the probability of detecting fault fi by a test case in Tα

Case studies

The model proposed in Section 3 was evaluated via case studies conducted on a total of 11 programs: the seven programs of the Siemens suite, the space program, the grep program, the gzip program, and the make program. A discussion regarding the choice of programs and why we feel they are ideal for our cases studies is presented in the Threats to Validity discussion in Section 8.5 of this paper. Additionally, detailed information on each program is provided as follows.

Further exploring the results

While the results of the case studies are indeed encouraging (as the error between observed effectiveness and predicted effectiveness is generally low), they lead us to two issues of intrigue. First, we note that the predictions (when not perfectly accurate) are at times higher than the observed effectiveness and at times lower. Therefore, are we more likely to over-predict than under-predict or vice-versa? Second, while the data presented in Fig. 1 may not be as clear (and therefore not as

On the incorrect estimation of failure rates

It is arguable that since the proposed model is based on fault failure rates, the accuracy of the model depends in turn on how accurately the failure rates have been estimated. For the purposes of this study, failure rates are directly obtained by running the entire set of available test cases against a faulty version of a program and observing the fraction of failures. We realize that in practice it is very difficult if not entirely impossible to obtain failure rates with such a high degree of

Programs with multiple faults

Thus far, when we have used the term ‘fault’, we have referred to the faulty version of a program which contains exactly one fault. It has therefore been quite convenient to refer to the failure rate used in this paper as the ‘fault failure rate’. However, this clearly changes if there are multiple faults present in the same program. To continue to refer to the proportion of test cases that fail as the ‘fault failure rate’ would lead to ambiguity as we do not know which fault in the program we

Discussion

In this section we present discussions on some of the important aspects of the model proposed herein, address some important concerns, and discuss the threats to the validity of the approach and its derived results.

Related work

We now overview work that is related to this paper and the ideas expressed herein. Failure rates have been used extensively in the area of adaptive random testing (Cangussu et al., 2009, Chen et al., 2004, Kuo et al., 2007) which seeks to distribute test cases more evenly within the input space, as an enhanced form of random testing. As per the description in Chen et al. (2004), given a faulty program for an input domain D, let d, m and n denote the size, number of failure-causing inputs and

Conclusions and future work

This paper explores the link between adequate test set size (in terms of fault detection effectiveness) and fault failure rates. A probabilistic model is developed that makes use of failure rate information in order to predict an adequate test set size with respect to the fault detection. Via case studies performed on 11 sets of programs, the usefulness and validity of failure rates is established. The results are indicative of the fact that a precise estimation of fault failure rates can

Vidroha Debroy received his BS in software engineering and his MS in computer science from the University of Texas at Dallas. He is currently a PhD student in computer science at UT-Dallas. His interests include software testing and fault localization, program debugging and automated and semi-automated ways to repair software faults. He is a student member of the IEEE and the ACM.

References (40)

  • T.Y. Chen et al.

    An upper bound on software testing effectiveness

    ACM Transactions on Software Engineering and Methodology

    (2008)
  • T.Y. Chen et al.

    On the expected number of failures detected by subdomain testing and random testing

    IEEE Transactions on Software Engineering

    (1996)
  • H. Cleve et al.

    Locating causes of program failures

  • V. Debroy et al.

    Insights on Fault Interference for Programs with Multiple Bugs

  • V. Debroy et al.

    Using mutation to automatically suggest fixes for faulty programs

  • R.A. DeMillo et al.

    Hints on test data selection: help for the practicing programmer

    IEEE Computer

    (1978)
  • H. Do et al.

    On the use of mutation faults in empirical assessments of test case prioritization techniques

    IEEE Transactions on Software Engineering

    (2006)
  • M.J. Harrold et al.

    A methodology for controlling the size of a test suite

    ACM Transactions on Software Engineering and Methodology

    (1993)
  • M.J. Harrold

    The effects of optimizing transformations on data-flow adequate test sets

  • J.A. Jones et al.

    Empirical evaluation of the Tarantula automatic fault-localization technique

  • Cited by (6)

    • Combining mutation and fault localization for automated program debugging

      2014, Journal of Systems and Software
      Citation Excerpt :

      Individual descriptions of each program follow. Siemens Suite: The seven C programs of the Siemens suite have been well studied and used in several fault localization studies (Cleve and Zeller, 2005; Debroy and Wong, 2009, 2011; Jones and Harrold, 2005; Liu et al., 2006; Renieris and Reiss, 2003; Wong et al., 2010, 2012a, 2012b; Wong and Qi, 2009). The correct and faulty versions of each of the programs, as well as the respective test sets were downloaded from Siemens Suite (2007).

    • Cost-benefit analysis of using dependency knowledge at integration testing

      2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Evaluating the quality of test code against feature code on popular javascript libraries

      2015, Proceedings of the IASTED International Symposium on Software Engineering and Applications, SEA 2015

    Vidroha Debroy received his BS in software engineering and his MS in computer science from the University of Texas at Dallas. He is currently a PhD student in computer science at UT-Dallas. His interests include software testing and fault localization, program debugging and automated and semi-automated ways to repair software faults. He is a student member of the IEEE and the ACM.

    W. Eric Wong received his MS and PhD in computer science from Purdue University, West Lafayette, Indiana. He is currently an associate professor in the Department of Computer Science at the University of Texas at Dallas. Dr. Wong is a recipient of the Quality Assurance Special Achievement Award from Johnson Space Center, NASA (1997). Prior to joining UT-Dallas, he was with Telcordia Technologies (formerly Bell Communications Research a.k.a. Bellcore) as a Senior Research Scientist and as the project manager in charge of the initiative for Dependable Telecom Software Development. Dr. Wong's research focus is on the technology to help practitioners produce high quality software at low cost. In particular, he is doing research in the areas of software testing, debugging, safety, reliability, and metrics. He has very strong experience in applying his research results to real-life industry projects. Dr. Wong has received funding from such organizations as NSF, NASA, Avaya Research, Texas Instruments, and EDS/HP among others. He has published over 120 refereed papers in journals and conference/workshop proceedings. Dr. Wong has served as special issue guest editor for six journals and as general or program chair for many international conferences. He also serves as the Secretary of ACM SIGAPP and is on the Administrative Committee of the IEEE Reliability Society.

    View full text