Applications of rule-base coverage measures to expert system evaluation

doi:10.1016/S0950-7051(99)00005-2

Knowledge-Based Systems

Volume 12, Issues 1–2, April 1999, Pages 27-35

https://doi.org/10.1016/S0950-7051(99)00005-2 Get rights and content

Abstract

Often a rule-based system is tested by checking its performance on a number of test cases with known solutions, modifying the system until it gives the correct results for all or a sufficiently high proportion of the test cases. This method cannot guarantee that the rule-base has been adequately or completely covered during the testing process. We introduce an approach to testing of rule-based systems, which uses coverage measures to guide and evaluate the testing process. In addition, the coverage measures can be used to assist rule-base pruning and identification of class dependencies, and serve as the foundation for a set of test data selection heuristics. We also introduce a complexity metric for rule-bases.

Introduction

Evaluation of a knowledge-based system is a multi-faceted problem, with numerous approaches and techniques. The results generated by the system must be evaluated, along with its features, the usability of the system, how easily it can be enhanced, and whether or not it has a positive impact on the people who are using the system in place of an approach which is not computer based. The system’s performance must also be evaluated in light of its intended use [1]. If the expert system is meant to function as an intelligent assistant then it must satisfy the criterion of being a useful adjunct to the human problem solver. If the system is expected to emulate the reasoning of a human expert then a more rigorous evaluation of the system is needed.

During the last 20 years there has been considerable development and use of knowledge-based systems for medical decision support. In this period there has been heavy emphasis on functional analysis, addressing two primary questions:

•
Does the system give the results we expect on test cases?
•
Does the system improve the effectiveness of those who use it?

The emphasis on functional analysis can lead to seemingly strong statistical statements about the correctness of a system, demonstrating that it gives the correct result, or the same result as a human expert, in a high percentage of test cases. However, functional testing does not guarantee that all parts of the system are actually tested. If a section of the rule-base is not exercised during the functional test then there is no information about that section of the system and whether it is correct or contains errors. Further, many performance problems for rule-bases result from unforeseen rule interactions [2]. A test suite of known cases may never trigger these interactions, though they should be identified and corrected before a system is put into actual use.

The method we present enhances functional analysis of rule-based classification systems with a rule-base coverage assessment, overcoming limitations of common methods for rule-based expert systems evaluation. The underlying premise of this work is that an ideal testing method is one that guarantees that all possible reasoning paths through a rule-base have been exercised. As with procedural software, this is often an unreasonable and/or unattainable goal, possibly due to a lack of test data, to un-executable program paths, or to the size of the rule-base. Further, even if each possible path is exercised, we cannot realistically do so with each distinct set of test values that could cause its traversal. A reasonable goal is for the rule-base testing process to exercise every inference chain or provide information about the failure of the testing process to do so.

Usually verification and validation (V & V) of rule-based systems involves a static structural analysis (verification) method to detect internal inconsistencies, followed by a dynamic, functional, validation in which system behavior on a set of test cases is compared with expected results. The weakness of a strictly functional approach to validation is that the test data available may not adequately cover the rule-base, and, at best, limited information about coverage will be obtained. System performance statistics are usually presented as if they apply to the entire rule-base, rather than just to the tested sections. This can lead to false estimates of system performance in actual use. The system performance indicated by the comparison of actual and expected results is relevant only for the tested sections, while performance in the untested sections cannot be predicted.

We must also consider completeness of the test set and age of the rule-base by the test data. Completeness of the test set refers to the degree to which the data represents all types of cases, which could be presented to the system under intended conditions of use. Coverage of the rule-base refers to how extensively possible combinations of inference relations are exercised during test data evaluation. In the trivial case, with a correct rule-base and a complete test suite, the test data would completely cover the rule-base, all actual results would agree with expected results, and we could predict completely correct performance of the rule-base in actual use. In the more usual situation we may have errors and incompleteness in the rule-base, as well as inadequacies in the test data. If we only judge the system based on a comparison of actual and expected results, the rule-base could perform well on the test data, but actually contain errors which are not identified due to incompleteness of the test data. This could lead to a false prediction of correct performance on all cases, when in fact we cannot make any accurate prediction about performance of the rule-base in those areas for which there is an absence of test data.

Our testing approach, as outlined in Fig. 1, allows clear identification of incompleteness in the test data and potential errors in the rule-base through identification of sections of the rule-base that have not been exercised during functional test. This can indicate weaknesses in the test set and/or sections of the rule-base that may not be necessary. An incomplete test set can be supplemented with additional cases chosen from the available population, guided by a series of heuristics and the coverage analysis information. Alternatively, if there is no test data which covers certain parts of the system, it is possible that those sections should be pruned from the rule-base or modified.

Our approach carries out structural analysis of the rule-base using five rule-base coverage measures (RBCMs) which identify sections not exercised by the test data. This makes it possible to improve completeness of the test suite, thereby increasing the kinds of cases on which the rule-base will be tested and improving coverage of the rule-base.

In addition to the coverage analysis, we employ a rule-base representation which facilitates application of the coverage measures; a set of heuristics for re-sampling the population of available test cases, based on coverage information, as shown in Fig. 2; strategies for rule-base pruning and identification of class-dependencies; a rule-base complexity metric. In another study [3] the utility of the aforementioned is illustrated extensively using rule-bases which were prototypes for the AI/RHEUM system [4] and the TRUBAC (Testing with RUle-BAse Coverage), a tool which implements the coverage analysis method.

Section snippets

Related work

This work builds on both coverage-based testing methods for procedural software (see [5] for a review of methods and [6], [7] for a data-flow approach to testing) and earlier work on rule-base analysis. Early approaches for rule-base analysis carried out only verification or validation. A number of systems, such as the ONCOCIN rule checker program (RCP) [8], CHECK [9], [10], ESC (expert system checker) [11], and KB-Reducer [12], [13] carry out only verification. Beyond their limitation to

Testing with rule-base coverage measures

The first step in rule-base testing with coverage measures is to build a graph representation of the rule-base. Our method uses a directed acyclic graph (DAG) representation. We assume a generic propositional rule-base language [3] into which other rule-base languages can be translated. During construction of the DAG, pairwise redundant rules, pairwise simple contradictory rules and potential contradictions (ambiguities) are identified. After DAG construction is complete, static analysis

Applications of coverage analysis

In addition to providing information about the testing process itself, the coverage analysis can be used to enhance testing and facilitate other kinds of rule-base analysis, as described below.

Conclusions and future work

This work shows that there are numerous uses of rule-base coverage data in the testing process. Rule-base performance evaluation can be misleading unless care is taken to identify problems with both the test data and the rule-base. Both the test data and the rule-base can be improved by using information about the extent to which the test data has covered the rule-base under test.

This work can be extended in a number of directions. Quantitative performance prediction can be computed based on

References (30)

B.J Cragun et al.
A decision-table-based processor for checking completeness and consistency in rule-based expert systems
International Journal of Man-Machine Studies
(1987)
S.M Weiss et al.
A model-based method for computer-aided medical decision-making
Artificial Intelligence
(1978)
P Jackson
Introduction to Expert Systems
(1990)
R O'Keefe et al.
Expert system verification and validation: a survey and tutorial
Artificial Intelligence Review
(1993)
V. Barr, Applications of rule-base coverage measures to expert system evaluation, PhD thesis, Rutgers University,...
L.C. Kingsland, The evaluation of medical expert systems: experiences with the AI/RHEUM knowledge-based consultant...
W.R Adrion et al.
Validation, verification, and testing of computer software
ACM Computing Surveys
(1982)
P. Frankl, E. Weyuker, A data flow testing tool, Proceedings of IEEE Softfair II, San Francisco, December...
S Rapps et al.
Selecting software test data using data flow information
IEEE Transactions on Software Engineering
(1985)
M Suwa et al.
An approach to verifying completeness and consistency in rule-based expert system
AI Magazine
(1982)

T.A. Nguyen, W.A. Perkins, T.J. Laffey, D. Pecora, Checking an expert systems knowledge base for consistency and...

T.A Nguyen et al.

Knowledge base verification

AI Magazine

(1987)

A Ginsberg, A new approach to checking knowledge bases for inconsistency and redundancy, Proceedings of the Third...

A Ginsburg

Automatic Refinement of Expert System Knowledge Bases

(1988)

R Davis

Interactive transfer of expertise

Cited by (14)

Advanced empirical testing
2011, Knowledge-Based Systems
Citation Excerpt :
Here, every test case covers a specific aspect of the knowledge in the best case. Coverage measures for a knowledge base and the test suite, respectively, have been discussed for example in [4]. Fig. 1 shows an example of a unit testing tool in practice.
In today’s industrial applications, we see that knowledge systems are successfully implemented. However, critical domains require the elaborate and thoughtful validation of the knowledge bases before the deployment. Empirical testing, also known as regression testing, denotes the most popular validation technique, where predefined test cases are used to simulate and review the correct behavior of the system. In this paper, we motivate that the classic notions of a test case and the corresponding measures are not sufficient in many application scenarios. We present enhanced notions generalizing the standard test case, and we show appropriate extensions of the measures precision and recall, that work on these test case notions. Furthermore, the effective inspection of test runs is important whenever test cases fail. We introduce a novel visualization technique that allows for the effective and intuitive analysis of test cases and test run outcomes. The new visualization is useful for debugging a knowledge base and test case, respectively, but it also provides an intuitive overview of the status of the entire test suite. A case study reports on the (repeated) validation of a medical decision-support system and demonstrates the practical relevance of the presented work.
Software complexity and its impacts in embedded intelligent real-time systems
2005, Journal of Systems and Software
Citation Excerpt :
It is the complexity of a system’s intelligence. Metrics for this type of complexity can be derived from the decision support systems, intelligent systems (Reich, 1995), expert systems (Sharma and Conrath, 1996), psychology and cognitive science, system science (Kanoun et al., 1997; Lew et al., 1988; Marr, 1982), computer science and software engineering (Barr, 1999; McCabe, 1976), and control systems (Munson, 1996) literatures. In EIRTS, data processing and reasoning complexity can be measured at the application layer in the system knowledge bases, control structures, and in the objects and classes associated with the system’s reasoning.
Applications of intelligent software systems are proliferating. As these systems proliferate, understanding and measuring their complexity becomes vital, especially in safety-critical environments. This paper proposes a model assessing the impacts of complexity for a particular type of intelligent software system, embedded intelligent real-time systems (EIRTS), and answers two research questions. (1) How should the complexity of embedded intelligent real-time systems be measured?, and (2) What are the impacts of differing levels of EIRTS complexity on software, operator and system performance when EIRTS are deployed in a safety-critical large-scale system? The model is tested and operationalized using an operational EIRTS in a safety-critical environment. The results suggest that users significantly prefer simple decision support and user interfaces, even when sophisticated user interfaces and complex decision support capabilities have been embedded in the system. These results have interesting implications for operators using complex EIRTS in safety-critical settings.
Intelligent Costume Recommendation System Based on Expert System
2018, Journal of Shanghai Jiaotong University (Science)
CoverageCity: Test coverage for clinical guidelines
2012, CEUR Workshop Proceedings
Resubmission: Developing knowledge systems with continuous integration
2011, LWA 2011 - Technical Report of the Symposium "Lernen, Wissen, Adaptivitat - Learning, Knowledge, and Adaptivity 2011" of the GI Special Interest Groups KDML, IR and WM
Evaluation of expert systems: The application of a reference model to the usability parameter
2011, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text

Applications of rule-base coverage measures to expert system evaluation

Abstract

Introduction

Section snippets

Related work

Testing with rule-base coverage measures

Applications of coverage analysis

Conclusions and future work

International Journal of Man-Machine Studies

Artificial Intelligence

Introduction to Expert Systems

Expert system verification and validation: a survey and tutorial

Artificial Intelligence Review

Validation, verification, and testing of computer software

ACM Computing Surveys

Selecting software test data using data flow information

IEEE Transactions on Software Engineering

An approach to verifying completeness and consistency in rule-based expert system

AI Magazine

Knowledge base verification

AI Magazine

Automatic Refinement of Expert System Knowledge Bases

Interactive transfer of expertise