Representing the behaviour of supervised classification learning algorithms by Bayesian networks

doi:10.1016/S0167-8655(99)00095-1

Pattern Recognition Letters

Volume 20, Issues 11–13, November 1999, Pages 1201-1209

https://doi.org/10.1016/S0167-8655(99)00095-1 Get rights and content

Abstract

In this paper, an approach to study the nature of the classification models induced by Machine Learning algorithms is proposed. Instead of the predictive accuracy, the values of the predicted class labels are used to characterize the classification models. Over these predicted class labels Bayesian networks are induced. Using these Bayesian networks, several assertions are extracted about the nature of the classification models induced by Machine Learning algorithms.

Introduction

The objective of a supervised classification learning algorithm is to induce a general rule that allows us to classify new examples E*={e_n+1,…,e_n+m} that are only characterized by their p descriptive variables. To generate this general rule, we have a set of n samples E={e₁,…,e_n} characterized by p descriptive variables X={X₁,…,X_p} and the class label C={w₁,…,w_n} to which they belong. The general rule (or classifier) can be seen as a classification hypothesis (or model) induced by the learning algorithm.

This problem was studied by the statistic community (Duda and Hart, 1973), using the term Pattern Recognition. In the Machine Learning literature, many representations for inducing classification hypotheses have been suggested (including decision trees, rule induction, Naive Bayes or k-NN), assuming the target function belongs to some restricted space of hypotheses.

To form a hypothesis structure, an algorithm makes assumptions, which are called biases in Machine Learning. Apart from the data, biases are the principal builders of a learning algorithm's hypothesis. A question of interest for researchers in Machine Learning is how to define the biases of existing algorithms and how to find out when a given bias is appropriate, based on background knowledge. Biases can be divided into two types (Kohavi, 1995a):

•
Restricted hypothesis space bias. This bias assumes that the model belongs to some restricted space of hypotheses, typically defined in terms of their representation. For example, most decision tree algorithms restrict the hypothesis space to the space of finite trees with univariate splits in its nodes, assuming that classes are separated by line segments parallel to the coordinate axes.
•
Preference bias. This bias places a preference ordering on hypotheses. Many times the preference ordering is defined by how the search through the space of hypotheses is conducted. Most preference biases attempt to minimize some measure of syntactic complexity, following Occam's Razor principle of preferring simpler hypotheses.

Regarding the large amount of available algorithms, the user is frequently faced with the problem of selecting the ideal algorithm for a specific dataset, trying to select the learning algorithm of which the biases are best suited to the data. The ambitious objective of generating and selecting a unique winner algorithm for all datasets has been rejected by the empirical evidence of the `No Free Lunch Theorem' (Kohavi et al., 1997). The selection of the `best' algorithm, usually only based on the error percentage, had the effect of mainly focusing the attention of the Machine Learning community on predictive accuracy. A no-end career can be felt in Machine Learning, with the aim of constructing the algorithm with the highest predictive accuracy for each dataset. In this way, Clark (1998) mentioned us the obsession of the Machine Learning community with summary statistics.

Since our aim is to study the learning algorithm induced hypothesis' nature, we think that the error percentage, which is a measure of the accuracy of the generated model, cannot help us. Instead of the predictive accuracy, the class predictions will be used as the external expression of the hypothesis induced by the learning algorithm. The probability distribution of the class predictions of a set of learning algorithms over a dataset will be studied in a joint manner by a Bayesian network, displaying the joint behaviour of the hypotheses induced by these algorithms.

We will consider the qualitative part of one kind of probabilistic graphical models known as Bayesian networks to represent the joint behaviour of learning algorithms. Using the semantics of the Bayesian networks, regularities are found, based on the definition of conditional independence, about the classification hypotheses induced by common Machine Learning inducers over a set of medical datasets. The method proposed in this paper does not compare or study algorithms from the point of view of classification accuracy. The nature of the hypotheses induced by a set of algorithms is our main interest: thus, for this purpose, class predictions are used.

The work is organized as follows. Section 2 introduces Bayesian networks, based on the conditional independence concept. Various approaches for inducing Bayesian networks are also related. Section 3 presents the datasets and Machine Learning algorithms used and the chosen methodology for inducing the Bayesian networks. The proposed concepts to extract conclusions from the Bayesian networks about the joint behaviour of the algorithms also appear in this section. Section 4 shows the results obtained in tested domains and their interpretations. A resumé and future work appear in Section 5.

Section snippets

Bayesian networks

Bayesian networks (BNs) (Pearl, 1988) constitute a probabilistic framework for reasoning under uncertainty. From an informal perspective, BNs are directed acyclic graphs (DAGs) where the nodes are random variables and the arcs specify the independence assumptions that must be held between the random variables. BNs are based upon the concept of conditional independence among variables. This concept makes a factorization of the probability distribution of the n-dimensional random variable (Z₁,…,Z_n

Datasets used

Eleven medical databases from the UCI Machine Learning Repository (Murphy and Aha, 1994) are selected. Selecting the datasets from a specific domain, we hope to obtain more homogeneous conclusions. All databases have a separate set of training data and testing data in a 2/3:1/3 proportion. The characteristics of the databases are given in Table 1.

Classifiers

Fourteen well-known learning algorithms with different biases are used in experiments. Most relevant biases for each algorithm will be cited:

•
ID3

Results from induced Bayesian networks

Assertions on different types of behaviour will be extracted, based on the number of times a learning algorithm or a set of algorithms shows one of the explained conditional independence variants in the Bayesian network structure. Based on the number of domains for which a learning algorithm or a set of algorithms presents one of the explained concepts, assertions about different types of behaviour of the studied algorithms can be extracted. It must be noted that the extracted assertions are

Resumé and future work

From a homogeneous set of databases, we have carried out a study of the joint behaviour of the predictions made by a set of Machine Learning algorithms. Bayesian networks, induced from the learning algorithms class predictions, were used to research the behaviour of a set of known algorithms. From the obtained Bayesian networks, guided by the conditional independence concept, relations between the probability distributions of the hypotheses formed by different algorithms were found. Three

Discussion

Brailovsky: The results that you presented are very interesting. Before you can ascribe certain properties to an algorithm you need to check a lot of things, for example the independence, for a given problem, of the training and test sample set. Have you done this?

Inza: You are right! I mentioned that we only extracted assertions or guidelines on different types of behavior. It is true that for each training- and test-set and different proportions the results are different. For that reason we

Acknowledgements

This work was supported by the grant PI 96/12 from the Gobierno Vasco – Departamento de Educación, Universidades e Investigación.

References (34)

R Etxeberria et al.
Analysis of the behaviour of genetic algorithms when learning Bayesian networks structure from data
Pattern Recognition Letters
(1997)
D Aha et al.
Instance-based learning algorithms
Machine Learning
(1991)
Andersen, S.K., Olesen, K.G., Jensen, F.V., Jensen, F., 1989. HUGIN – a shell for building Bayesian belief universes...
Auer, P., Holte, R., Maass, W., 1995. Theory and applications of agnostic PAC-learning with small decision trees. In:...
Bouckaert, R.R., 1995. Bayesian belief networks: from construction to inference. Ph.D Thesis, Department of Computer...
Cestnik, B., 1990. Estimating probabilities: a crucial task in machine learning. In: Proceedings of the European...
Clark, P., 1998. Personal...
P Clark et al.
The CN2 induction algorithm
Machine Learning
(1989)
Cohen, W.W., 1995. Fast effective rule induction. In: Machine Learning, Proceedings of the 12th International...
G.F Cooper et al.
A Bayesian method for the induction of probabilistic networks from data
Machine Learning
(1992)

S Cost et al.

A weighted nearest neighbor algorithm for learning with symbolic features

Machine Learning

(1993)

A.P Dawid

Conditional independence in statistical theory

Journal of the Royal Statistics Society Series B

(1979)

R Duda et al.

Pattern Classification and Scene Analysis

(1973)

Heckerman, D., 1995. A tutorial on learning with Bayesian networks. Technical Report,...

D Heckerman et al.

Learning Bayesian networks: the combination of knowledge and statistical data

Machine Learning

(1995)

Herskovits, E., Cooper, G., 1990. Kutató – An entropy-driven system for construction of probabilistic expert systems...

R.C Holte

Very simple classification rules perform well on most commonly used databases

Machine Learning

(1993)

Cited by (12)

Supervised pre-processing approaches in multiple class variables classification for fish recruitment forecasting
2013, Environmental Modelling and Software
Citation Excerpt :
The purpose of experiments with synthetic data is to empirically test the behaviour of the different proposed multi-dimensional pre-processing strategies for a broad range of datasets with different intrinsic data characteristics; i.e. to know which strategies perform best and under which conditions. This is accomplished by means of statistical tests and a process of meta-learning (Hall, 1999; Inza et al., 1999; Witten and Frank, 2005). Thus, a schema to generate synthetic data, based upon Pérez et al. (2006), and procedures for the comparison of methods is described in Appendix I.
A multi-species approach to fisheries management requires taking into account the interactions between species in order to improve recruitment forecasting of the fish species. Recent advances in Bayesian networks direct the learning of models with several interrelated variables to be forecasted simultaneously. These models are known as multi-dimensional Bayesian network classifiers (MDBNs). Pre-processing steps are critical for the posterior learning of the model in these kinds of domains. Therefore, in the present study, a set of ‘state-of-the-art’ uni-dimensional pre-processing methods, within the categories of missing data imputation, feature discretization and feature subset selection, are adapted to be used with MDBNs. A framework that includes the proposed multi-dimensional supervised pre-processing methods, coupled with a MDBN classifier, is tested with synthetic datasets and the real domain of fish recruitment forecasting. The correctly forecasting of three fish species (anchovy, sardine and hake) simultaneously is doubled (from 17.3% to 29.5%) using the multi-dimensional approach in comparison to mono-species models. The probability assessments also show high improvement reducing the average error (estimated by means of Brier score) from 0.35 to 0.27. Finally, these differences are superior to the forecasting of species by pairs.
Classification of fluorescence in situ hybridization images using belief networks
2004, Pattern Recognition Letters
Feature subset selection by Bayesian networks: A comparison with genetic and sequential algorithms
2001, International Journal of Approximate Reasoning
In this paper we perform a comparison among FSS–EBNA, a randomized, population-based and evolutionary algorithm, and two genetic and other two sequential search approaches in the well-known feature subset selection (FSS) problem. In FSS–EBNA, the FSS problem, stated as a search problem, uses the estimation of Bayesian network algorithm (EBNA) search engine, an algorithm within the estimation of distribution algorithm (EDA) approach. The EDA paradigm is born from the roots of the genetic algorithm (GA) community in order to explicitly discover the relationships among the features of the problem and not disrupt them by genetic recombination operators. The EDA paradigm avoids the use of recombination operators and it guarantees the evolution of the population of solutions and the discovery of these relationships by the factorization of the probability distribution of best individuals in each generation of the search. In EBNA, this factorization is carried out by a Bayesian network induced by a cheap local search mechanism. FSS–EBNA can be seen as a hybrid Soft Computing system, a synergistic combination of probabilistic and evolutionary computing to solve the FSS task. Promising results on a set of real Data Mining domains are achieved by FSS–EBNA in the comparison respect to well-known genetic and sequential search algorithms.
Using Bayesian networks in the construction of a bi-level multi-classifier. A case study using intensive care unit patients data
2001, Artificial Intelligence in Medicine
Citation Excerpt :
Obtained BN is used as a consensed voting system for the level-0 single classifiers. There is assumed that the BN reflects the interrelation among the different classifiers used [17]. In order to give a real perspective of applied methods, we use 10-fold cross-validation [40] in all the experiments.
Combining the predictions of a set of classifiers has shown to be an effective way to create composite classifiers that are more accurate than any of the component classifiers. There are many methods for combining the predictions given by component classifiers. We introduce a new method that combine a number of component classifiers using a Bayesian network as a classifier system given the component classifiers predictions. Component classifiers are standard machine learning classification algorithms, and the Bayesian network structure is learned using a genetic algorithm that searches for the structure that maximises the classification accuracy given the predictions of the component classifiers. Experimental results have been obtained on a datafile of cases containing information about ICU patients at Canary Islands University Hospital. The accuracy obtained using the presented new approach statistically improve those obtained using standard machine learning methods.
Feature subset selection by genetic algorithms and estimation of distribution algorithms: A case study in the survival of cirrhotic patients treated with TIPS
2001, Artificial Intelligence in Medicine
The transjugular intrahepatic portosystemic shunt (TIPS) is an interventional treatment for cirrhotic patients with portal hypertension. In the light of our medical staff’s experience, the consequences of TIPS are not homogeneous for all the patients and a subgroup dies in the first 6 months after TIPS placement. Actually, there is no risk indicator to identify this subgroup of patients before treatment. An investigation for predicting the survival of cirrhotic patients treated with TIPS is carried out using a clinical database with 107 cases and 77 attributes. Four supervised machine learning classifiers are applied to discriminate between both subgroups of patients. The application of several feature subset selection (FSS) techniques has significantly improved the predictive accuracy of these classifiers and considerably reduced the amount of attributes in the classification models. Among FSS techniques, FSS–TREE, a new randomized algorithm inspired on the new EDA (estimation of distribution algorithm) paradigm has obtained the best average accuracy results for each classifier.
Family and prejudice: A behavioural taxonomy of machine learning techniques
2020, Frontiers in Artificial Intelligence and Applications

View all citing articles on Scopus

View full text

Representing the behaviour of supervised classification learning algorithms by Bayesian networks

Abstract

Introduction

Section snippets

Bayesian networks

Datasets used

Classifiers

Results from induced Bayesian networks

Resumé and future work

Discussion

Acknowledgements

Pattern Recognition Letters

Instance-based learning algorithms

Machine Learning

The CN2 induction algorithm

Machine Learning

A Bayesian method for the induction of probabilistic networks from data

Machine Learning

A weighted nearest neighbor algorithm for learning with symbolic features

Machine Learning

Conditional independence in statistical theory

Journal of the Royal Statistics Society Series B

Pattern Classification and Scene Analysis

Learning Bayesian networks: the combination of knowledge and statistical data

Machine Learning

Very simple classification rules perform well on most commonly used databases

Machine Learning