On the estimation of adequate test set size using fault failure rates
Introduction
Fundamental to the software testing process is the utilization of good test cases to detect faults. In fact, the number of such test cases (i.e., the size of the test set) used to test software has a significant impact on the quality of testing. If a test set consists of too few test cases, it may not be able to detect faults effectively. On the other hand, if a test set has too many test cases, it would imply a greater cost associated with the testing, and may also result in some test cases being redundant. Test set size is therefore one of the primary determinants of the overall cost and effectiveness of the software testing process. Consequently, selecting an appropriate test set size becomes a very important concern for testers and managers everywhere, who want to improve the effectiveness of software testing while reducing its cost.
Test sets may be constructed to satisfy one or more criteria, such as various forms of code coverage, based on certain aspects of the software under test. A test set is adequate for a selected criterion if it covers the program according to that criterion (Harrold, 1991, Weyuker, 1986). For the purposes of this paper, test set adequacy is discussed in terms of the test set's fault detection effectiveness (which is to be formally defined later in this paper). Recognizing that large, comprehensive sets of test cases are rarely available and impractical to use due to their expense, several studies have explored the link between the size of a test set and its corresponding effectiveness in terms of its fault detection capabilities (Rothermel and Harrold, 1998, Wong et al., 1998, Wong et al., 1999). In contrast this study explores this link from the opposite direction, in that it proposes a model to estimate the corresponding number of test cases (test set size) that must be used in order to reach a certain level of fault detection effectiveness.
In order to build such a predictive model, assumptions similar to the ones made in random testing scenarios are also made in this paper. In random testing, test cases are selected randomly from the entire input domain (Chen and Yu, 1996, Leung et al., 2000), thus assuming that each test case may be equally as good (or bad) as the next in terms of its ability to detect a fault (result in failure). Random testing also assumes that the cost associated with the execution of one test case is the same as the cost associated with another. Since our empirical studies conducted are based on subject programs with pre-existing available test cases, we restrict these random testing assumptions to test cases from the available test pool rather than the entire input domain. Stated differently, for the purposes of this paper and the descriptions of experiments herein, when constructing a test set of a certain size – say x test cases, those x test cases are selected randomly (without replacement) from the entire available pool of test cases.
The model is developed by making use of failure rates – informally to be interpreted as the hardness of detection of fault(s) in a program (and formally to be defined subsequently in Section 2.3). Our approach utilizes these failure rates to make a probabilistic estimation of the expected number of faults (faulty programs1) detected by a test set. Supposing we have a total of n different faults that we are working with, we can reverse the above process to identify the test set size (number of test cases) such that all of the n faults are expected to be detected. Thus, we make it possible to predict an adequate test set size such that all the faults are expected to be detected, i.e., we expect to have 100% fault detection effectiveness. It is possible for us to have 100% fault detection effectiveness because we know exactly how many single-fault versions there are (n is a known variable), and in our case studies we restrict ourselves to faults that result in at least one test case failure. Then the quality of the prediction is evaluated by comparing the expected number of faults detected (as predicted by the model using a test set of a certain size) to the actual number of faults detected by real test sets of the same size.
Thus, the contributions made by this paper may be summarized as follows:
- (1)
We posit the use of fault failure rates as useful estimators of adequate test set size with respect to fault detection effectiveness (to be detailed in Section 2.1).
- (2)
To this effect, a predictive model is presented to illustrate how the above might be achieved.
- (3)
We present empirical data to illustrate the effectiveness and predictive quality of the proposed model via the seven programs of the Siemens suite, and the space, grep, gzip, and make programs (i.e., a total of 11 programs are experimented upon).
- (4)
The interference (or lack of independence) between multiple faults present in the same program is also investigated via the notion of failure rates.
The remainder of the paper is organized in the following manner. Section 2 first describes the fundamental concepts that help facilitate better understanding of this paper and its discussions. Section 3 then presents the probabilistic model that estimates adequate test set size using the failure rates, followed by Section 4 which evaluates the model against 11 C programs. Subsequently Section 5 identifies two relevant issues and provides more insights on the proposed model and derived data. Section 6 assesses the quality of the proposed model in the presence of incorrect failure rates and three different perturbation schemes are presented. Section 7 then goes on to evaluate the relationship between multiple faults in the same program from the point of view of the failure rate. Section 8 presents relevant discussions and details the threats to the validity of the approach and Section 9 is an overview of related work. Finally, we present our conclusions and ideas for future work in Section 10.
Section snippets
Preliminaries
In this section we present concepts that are fundamental to the understanding of the model and corresponding discussions presented in this paper.
The proposed model
Let there be a set of n faults: f1, f2, … , fn. Let there also be a test set (without loss of generality, let us assume that this is our test pool) T that contains m test cases: t1, t2, …, tm. Let each fault fi be associated with a failure rate θi such thatwhere βi represents the number of test cases out of m that fail on fault fi. Supposing we wish to use a subset of T, say Tα with α test cases such that α ≤ m. We define as the probability of detecting fault fi by a test case in Tα
Case studies
The model proposed in Section 3 was evaluated via case studies conducted on a total of 11 programs: the seven programs of the Siemens suite, the space program, the grep program, the gzip program, and the make program. A discussion regarding the choice of programs and why we feel they are ideal for our cases studies is presented in the Threats to Validity discussion in Section 8.5 of this paper. Additionally, detailed information on each program is provided as follows.
Further exploring the results
While the results of the case studies are indeed encouraging (as the error between observed effectiveness and predicted effectiveness is generally low), they lead us to two issues of intrigue. First, we note that the predictions (when not perfectly accurate) are at times higher than the observed effectiveness and at times lower. Therefore, are we more likely to over-predict than under-predict or vice-versa? Second, while the data presented in Fig. 1 may not be as clear (and therefore not as
On the incorrect estimation of failure rates
It is arguable that since the proposed model is based on fault failure rates, the accuracy of the model depends in turn on how accurately the failure rates have been estimated. For the purposes of this study, failure rates are directly obtained by running the entire set of available test cases against a faulty version of a program and observing the fraction of failures. We realize that in practice it is very difficult if not entirely impossible to obtain failure rates with such a high degree of
Programs with multiple faults
Thus far, when we have used the term ‘fault’, we have referred to the faulty version of a program which contains exactly one fault. It has therefore been quite convenient to refer to the failure rate used in this paper as the ‘fault failure rate’. However, this clearly changes if there are multiple faults present in the same program. To continue to refer to the proportion of test cases that fail as the ‘fault failure rate’ would lead to ambiguity as we do not know which fault in the program we
Discussion
In this section we present discussions on some of the important aspects of the model proposed herein, address some important concerns, and discuss the threats to the validity of the approach and its derived results.
Related work
We now overview work that is related to this paper and the ideas expressed herein. Failure rates have been used extensively in the area of adaptive random testing (Cangussu et al., 2009, Chen et al., 2004, Kuo et al., 2007) which seeks to distribute test cases more evenly within the input space, as an enhanced form of random testing. As per the description in Chen et al. (2004), given a faulty program for an input domain D, let d, m and n denote the size, number of failure-causing inputs and
Conclusions and future work
This paper explores the link between adequate test set size (in terms of fault detection effectiveness) and fault failure rates. A probabilistic model is developed that makes use of failure rate information in order to predict an adequate test set size with respect to the fault detection. Via case studies performed on 11 sets of programs, the usefulness and validity of failure rates is established. The results are indicative of the fact that a precise estimation of fault failure rates can
Vidroha Debroy received his BS in software engineering and his MS in computer science from the University of Texas at Dallas. He is currently a PhD student in computer science at UT-Dallas. His interests include software testing and fault localization, program debugging and automated and semi-automated ways to repair software faults. He is a student member of the IEEE and the ACM.
References (40)
- et al.
Test case selection with and without replacement
Journal of Information Sciences
(2000) - et al.
Reliability prediction for component-based software architectures
Journal of Systems and Software
(2003) - et al.
Test set size minimization and fault detection effectiveness: a case study in a space application
Journal of Systems and Software
(1999) - et al.
Is mutation an appropriate tool for testing experiments?
- et al.
Basic concepts and taxonomy of dependable and secure computing
IEEE Transactions on Dependable and Secure Computing
(2004) - Budd, T.A., 1980. Mutation analysis of program test data. Ph.D. Dissertation, Yale...
- Cancellieri, A., Giorgi, A., 1994. Array preprocessor user manual. Technical Report...
- et al.
A segment-based approach for the reduction of the number of test cases for performance evaluation of components
International Journal of Software Engineering and Knowledge Engineering
(2009) - et al.
Adaptive random testing
An upper bound on software testing effectiveness
ACM Transactions on Software Engineering and Methodology
On the expected number of failures detected by subdomain testing and random testing
IEEE Transactions on Software Engineering
Locating causes of program failures
Insights on Fault Interference for Programs with Multiple Bugs
Using mutation to automatically suggest fixes for faulty programs
Hints on test data selection: help for the practicing programmer
IEEE Computer
On the use of mutation faults in empirical assessments of test case prioritization techniques
IEEE Transactions on Software Engineering
A methodology for controlling the size of a test suite
ACM Transactions on Software Engineering and Methodology
The effects of optimizing transformations on data-flow adequate test sets
Empirical evaluation of the Tarantula automatic fault-localization technique
Cited by (6)
Combining mutation and fault localization for automated program debugging
2014, Journal of Systems and SoftwareCitation Excerpt :Individual descriptions of each program follow. Siemens Suite: The seven C programs of the Siemens suite have been well studied and used in several fault localization studies (Cleve and Zeller, 2005; Debroy and Wong, 2009, 2011; Jones and Harrold, 2005; Liu et al., 2006; Renieris and Reiss, 2003; Wong et al., 2010, 2012a, 2012b; Wong and Qi, 2009). The correct and faulty versions of each of the programs, as well as the respective test sets were downloaded from Siemens Suite (2007).
Software Testing Effort Estimation and Related Problems
2021, ACM Computing SurveysCost-benefit analysis of using dependency knowledge at integration testing
2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Towards earlier fault detection by value-driven prioritization of test cases using fuzzy TOPSIS
2016, Advances in Intelligent Systems and ComputingEvaluating the quality of test code against feature code on popular javascript libraries
2015, Proceedings of the IASTED International Symposium on Software Engineering and Applications, SEA 2015Quasi renewal analysis of software reliability growth model incorporating testing efforts and time delay
2013, International Journal of Mathematics in Operational Research
Vidroha Debroy received his BS in software engineering and his MS in computer science from the University of Texas at Dallas. He is currently a PhD student in computer science at UT-Dallas. His interests include software testing and fault localization, program debugging and automated and semi-automated ways to repair software faults. He is a student member of the IEEE and the ACM.
W. Eric Wong received his MS and PhD in computer science from Purdue University, West Lafayette, Indiana. He is currently an associate professor in the Department of Computer Science at the University of Texas at Dallas. Dr. Wong is a recipient of the Quality Assurance Special Achievement Award from Johnson Space Center, NASA (1997). Prior to joining UT-Dallas, he was with Telcordia Technologies (formerly Bell Communications Research a.k.a. Bellcore) as a Senior Research Scientist and as the project manager in charge of the initiative for Dependable Telecom Software Development. Dr. Wong's research focus is on the technology to help practitioners produce high quality software at low cost. In particular, he is doing research in the areas of software testing, debugging, safety, reliability, and metrics. He has very strong experience in applying his research results to real-life industry projects. Dr. Wong has received funding from such organizations as NSF, NASA, Avaya Research, Texas Instruments, and EDS/HP among others. He has published over 120 refereed papers in journals and conference/workshop proceedings. Dr. Wong has served as special issue guest editor for six journals and as general or program chair for many international conferences. He also serves as the Secretary of ACM SIGAPP and is on the Administrative Committee of the IEEE Reliability Society.