How to judge testing progress

https://doi.org/10.1016/j.infsof.2003.09.008Get rights and content

Abstract

It is usual to base the assessment of software testing progress on a coverage measure such as code coverage or specification coverage, or on the percentage of the input domain exercised. In this paper it is argued that these characteristics do not provide good indications of the degree to which the software has been tested. Instead we propose that the assessment of testing progress be based on the total percentage of the probability mass that corresponds to the test cases selected and run. To do this, it is necessary to collect data that profiles how the software will be used once it is operational in the field. By so doing, we are able to accurately determine how much testing has been done, and whether it has met the standards of completeness for the product under consideration.

Introduction

The world runs on software. Not just any software, but software that works correctly, efficiently, and, in many cases, continuously. This means that we need a way of convincing ourselves that the software behaves as it should, and that, in turn, implies that we need to test the software comprehensively. Therefore, we spend a lot of time and money testing software, and consequently software testing research provides an important way of optimizing the process, leading to the development of systems with enhanced dependability. We devise test case selection algorithms, develop testing tools to automate the process, and invent ways of determining when to stop testing. Some of our testing methods are based on the characteristics of code, others require that we write formal specifications. Most techniques used to test large industrial software systems have to rely on informally written, natural language specifications.

Because of software's importance, there is a large software testing literature beyond the obvious one describing new test case selection algorithms. One area of research interest focuses on ways of comparing test case selection methods either analytically or empirically. Research in this area typically compares two testing criteria in terms of some characteristic that relates to their ‘goodness’ including how difficult it is to satisfy the criteria, the probability of a fault being exposed by the criteria, or how many faults are likely to be exposed. The comparison could also be made based on the cost of satisfying the criteria. This sort of research is very important because it allows us to say that one testing method is better than another, in a concrete way.

Another type of software testing research that has gained importance in recent years involves empirical studies to identify properties of files of a software system that make them particularly likely to he fault-prone. Once we are able to identify these properties, it should allow testers to prioritize effort, concentrating primarily on those files that are most likely to contain substantial numbers of faults. Knowing these characteristics should also be helpful for developers, to help them ensure that the software systems they produce do not have the undesirable traits, and consequently that they have low fault densities. There are many other areas of related research including algorithms for performance testing, security testing, and regression testing, as well as the use of formal specifications and models to select test cases, to name a few.

Although all of these types of research are important, the issue we focus on in this paper involves determining the extent to which testing has been done, measured in a precise manner, that can be used for assessing large industrial software systems. Once we decide to stop testing, either because we believe that the software has been comprehensively tested, because we have run some pre-determined number of test cases, or simply have used up all the allotted time, money, or ideas for what to test, eventually the time comes to decide whether it is wise and safe to stop testing. If it is determined that testing is not complete, we have to determine how comprehensively the software has been tested, and unfortunately, there is very little useful guidance available in the research literature to help us make this determination.

There have been a number of papers written about so-called test data adequacy criteria, which purport to tell testers when sufficient testing has been done so that it is safe to stop. The majority of these criteria fall into at least one of two categories. The first type is a predicate that reports a simple ‘yes’ or ‘no’, rather than the amount of testing progress that has been completed. The second describes some sort of coverage measure, and these rarely, if ever, provide either empirical or analytic evidence, or even a compelling argument that they represent more than a necessary set of conditions.

For example, it is easy to argue that if there are statements in the software that have never been exercised by any test case, we have no basis for being confident that those statements are correct. However, it is also easy to devise examples for which the test suite was designed so that every statement in the software was exercised, but there are still faults in the code because some other unselected input exercises the same statements and yields faulty outputs. The simple act of exercising program statements does not guarantee that the statement behaves correctly for all possible inputs. This makes this sort of coverage a poor indication of how thoroughly the software has been tested. The same is true when basing adequacy on other code coverage measures or the coverage of functionality units in the specification.

For the reasons outlined earlier, we propose an alternative to code or specification coverage as a way of assessing testing progress. In particular, we focus on the use of the input domain to determine whether testing can be stopped. We want to be able to say that n% of the testing is complete. When n is sufficiently close to 100, then we can have confidence that the software is dependable. The required degree of closeness to 100% will, of course, depend on the nature of the system being produced and the criticality of the application.

Section snippets

Operational distributions

An operational distribution is a probability distribution describing a software system's operational use in the field. They are sometimes also referred to as operational profiles [5]. For each possible input, the probability of occurrence is determined based on monitoring a system in the field. Occasionally, that is not feasible and so a system expert may have to make approximations of the frequency of occurrence for some of the elements of the input domain. In that case, it might be necessary

Assessing test progress

In order to quantitatively assess software testing progress, we have to have a meaningful way of determining the percentage of testing that has been done, and deciding the minimal level of testing that is acceptable for the project under test. There are lots of non-meaningful ways of making a determination to stop testing, and, unfortunately, they are used all too often. The most frequently used criteria appear to be time, money, code coverage, and specification coverage. One could also use the

Conclusions

This paper has proposed a new way of determining the progress of software testing by first determining the operational distribution describing project usage, and using this probability of occurrence associated with each element of the input domain. As testing proceeds, it is then possible to assess what percentage of testing is complete by summing the probabilities associated with each test case that has been run. A threshold value should have been determined before testing has begun that

References (10)

  • A. Avritzer et al.

    Reliability testing of rule-based systems

    IEEE Software

    (1996)
  • A. Avritzer et al.

    The automatic generation of load test suites and the assessment of the resulting software

    IEEE Transactions of Software Engineering

    (1995)
  • A. Avritzer et al.

    Deriving workloads for performance testing

    Software Practice and Experience

    (1996)
  • A. Avritzer et al.

    Software performance testing based on workload characterization

    Proceedings of ACM/Third International Workshop on Software and Performance (WOSP 2002) Rome, Italy, July

    (2002)
  • J.D. Musa

    Operational profiles in software reliability engineering

    IEEE Software

    (1993)
There are more references available in the full text version of this article.

Cited by (0)

View full text