Examining random and designed tests to detect code mistakes in scientific software

https://doi.org/10.1016/j.jocs.2010.12.002Get rights and content

Abstract

Successfully testing computational software to detect code mistakes is impacted by multiple factors. One factor is the tolerance accepted in test output. Other factors are the nature of the code mistake, the characteristics of the code structure, and the choice of test input. We have found that randomly generated test input is a viable approach to testing for code mistakes and that simple structural metrics have little predictive power in the type of testing required. We provide further evidence that reduction of tolerance in expected test output has a much larger impact than running many more tests to discover code mistakes.

Research highlights

▶ From our research and that of others, the practice of testing computational software almost always focuses on validation testing. The assumption is that code mistakes will be obvious when the entire software model is executed. There have been glaring examples of code mistakes going undetected and causing grave problems. Yet, scientists seldom test specifically for code mistakes. Part of the reason may be that the software engineering community has not offered up testing approaches specifically aimed at computational software and its particular set of problems. ▶ The work described in this paper is part of a research program aimed at providing scientists and engineers advice on improving their software. Specifically, this work looks at the effectiveness of random testing and the conditions under which it is useful. The goal of this testing is finding code mistakes. We have found that randomly generated test data can be a good “first” approach to testing, followed by designed tests targeting boundary conditions related both to the physical phenomena being modeled and to the structure of condition statements. As part of our general program, we are advising scientists to test for code mistakes first, before doing verification or validation testing. Our work is helping fill the gap both in testing activities for computational software and in software engineering research.

Introduction

In scientific software, any number of sources [10] can contribute to error in output. Typical contributors are round-off error from finite representations in the hardware, truncation error from computational techniques, measurement error and approximations in empirically based models and input data, and simplifications to realise the computational solution. All these contributors can readily mask an unintended source of error: a mistake in the source code. We define a mistake in the source code (code mistake) as any divergence from what was intended.

The most common software assessment activity is testing. Scientists routinely test their software to validate their scientific models (e.g. [3], [23]). The goal is to ensure the science is correct. As well, there is considerable work on code verification, where the goal is to ensure the solution technique is correct (e.g. [15], [22]). However, as far as we are aware, there is no work on identifying techniques for testing scientific software specifically for code mistakes. This is despite such examples as [8], where nine different companies wrote software based on the same algorithm and produced nine significantly different answers. Most of the discrepancies came from code mistakes, particularly what are called “one-off” mistakes. In another example, the discovery of a code mistake in analysis software for molecular structures caused the retraction of five journal papers [18]. In our own research into industrial software, we have come across code mistakes that have reduced the ability of analysis software to reliably identify artifacts in images [12]. In other examples, researchers incorrectly altered their scientific models to compensate for an undetected code mistake [5], [13].

In the work described in this paper, we confine ourselves to code mistakes and finding them by testing.

We suggest that searching for code mistakes should precede both validation testing and verification testing as commonly carried out by scientists. Preferably, the search should be carried out at the module level, where the length of executions and the extent of code being exercised are shorter. This also removes some of the masking of error from other sources other than the code mistake. However, even at this smaller scope of testing, decisions about what is exactly the “right” answer may be difficult to make, given the nature of most examples of computational software.

Successfully testing computational application software is confounded by what we call the “oracle assumption”. Weyuker [24] describes an oracle as the mechanism used to determine the correctness of a program for given test data. Weyuker further explains that the “.. belief that the tester is routinely able to determine whether or not the test output is correct is the oracle assumption.” Determining whether or not the results of a test are correct for computational software is hampered by unexpected interactions between underlying hardware, software code, input data, scientific model, and solution technique [3]. Weyuker [24] called such software “untestable” and suggested special considerations were needed.

Many testing techniques described in software engineering literature assume accurate oracles for tests are always available, and available for possibly copious numbers of tests.

Related to the oracle assumption, Hook [9] describes the tolerance problem. This deals with the fact that scientists often have an incomplete oracle (i.e., the mechanism to check test results only exists for a limited number of possible cases) as well as an approximate oracle. An approximate oracle is a mechanism that provides only approximate checks for tests, or in other words, checks with large tolerances. Work by Hook and Kelly [11] showed that reducing the tolerances in oracles could significantly improve the ability of the scientist to notice increased and unexpected error in test output values.

The work described in this paper extends work started by Hook (e.g. [9]). Our ultimate goal is to provide scientists with advice on testing for code mistakes in computational software. This would help fill the gap in testing literature where there is currently no research in testing computational codes for code mistakes. Our research towards this goal includes developing a research technique to evaluate different testing techniques. This paper largely describes this research technique and its application to a series of small MATLAB functions in order to evaluate the effectiveness of randomly generated tests versus designed tests. We use Hook's tool, MATMute [16] to generate a test-bed of code mistakes as part of our research technique.

Our works confirms Hook's results on the impact of tolerance on testing, finds no useful correlation between our testing experience and typical code structure metrics, and provides evidence that random testing is a viable approach for finding code mistakes in our generated test-bed.

In the next section we discuss the tools we used for our research. Section 3 contains our observations from using these tools. Section 4 discusses the results and concludes with a recommended approach to testing that could be the basis for further research.

Section snippets

Goals for research

We give our working definition of a successful test as one that “produces a discrepancy in the output large enough that it is noticed by the tester” and we consider factors that impact this success.

  • 1.

    Because of the tolerance problem, the test output must be outside the accepted tolerance level in order for the tester to notice a problem.

  • 2.

    The type of code mistake affects the amount of error observed in the output. There are some types of code mistakes that contribute no error or insignificant

Replication of Hook's detection scores

Hook's work includes an analysis of the mutant detection rate observed while varying the tolerance γd. Hook defines detection score as the number of mutants detected by a set of tests for a given γd over the total number of revealable mutants. The detection score is given as a percentage. The lower bound of the detection boundary is set by γd. For a set of tests, a mutant is included in this detection boundary if its γ is greater than γd.

Hook's results clearly show a significant drop in

Discussion and conclusions

As defined at the outset of this paper, successfully testing for code mistakes is affected by four factors:

  • -

    the allowed tolerance in the expected output of the test;

  • -

    the maximum discrepancy caused by the code mistake when tested;

  • -

    the structure of the source code;

  • -

    the choice of input data used for the test.

All these factors affect the ability of the tester to notice symptoms of a code mistake.

In our investigations, we used a tool and approach that allowed us to generate a pool of source code samples

Acknowledgements

This work is funded by the Natural Science and Research Council of Canada (NSERC) and by the Applied Research Program (ARP) of the Canadian Forces.

Diane Kelly is an associate professor at the Royal Military College of Canada (RMC). She has a BSc in pure mathematics from the University of Toronto and a PhD in software engineering from RMC. Her research interests are informed by her 20 years of industrial experience in the Canadian Nuclear Industry as a scientist. Her goal is to provide computational scientists with reasonable software engineering approaches to maintain and improve the quality of their software.

References (24)

  • J.H. Andrews et al.

    Using mutation analysis for assessing and comparing testing coverage criteria

    IEEE Transactions on Software Engineering

    (2006)
  • B. Beizer

    Software Testing Techniques

    (1990)
  • N. Cote, An exploration of a testing strategy to support refactoring, Master's thesis, Royal Military College of...
  • P.F. Dubois

    Object Technology for Scientific Computing: Object-oriented Numerical Software in Eiffel and C

    (1997)
  • R. Gray, Investigating test selection techniques for scientific software, Master's thesis, Royal Military College of...
  • D. Hamlet

    When only random testing will do

  • L. Hatton et al.

    How accurate is scientific software?

    IEEE Transactions on Software Engineering

    (1994)
  • D. Hook, Using code mutation to study code faults in scientific software, Master's thesis, Queen's University,...
  • D. Hook et al.

    Testing for trustworthiness in scientific software

  • D. Hook et al.

    Mutation sensitivity testing

    IEEE Computing in Science and Engineering

    (2009)
  • D. Kelly, S. Thorsteinson, D. Hook, Scientific software testing: analysis in four dimensions, IEEE software, preprint...
  • Cited by (0)

    Diane Kelly is an associate professor at the Royal Military College of Canada (RMC). She has a BSc in pure mathematics from the University of Toronto and a PhD in software engineering from RMC. Her research interests are informed by her 20 years of industrial experience in the Canadian Nuclear Industry as a scientist. Her goal is to provide computational scientists with reasonable software engineering approaches to maintain and improve the quality of their software.

    Lieutenant-Commander Rob Gray is a naval combat systems engineering officer with over 20 years experience within the Canadian Navy. He received a master of software engineering from the Royal Military College of Canada in May 2010.

    Yizhen Shao is a fourth year student in computer science at the University of Ottawa. His interests are in game design, particularly, game mechanics. He is working on a Microsoft Certified Technology Specialist credential.

    View full text