Representing the behaviour of supervised classification learning algorithms by Bayesian networks

https://doi.org/10.1016/S0167-8655(99)00095-1Get rights and content

Abstract

In this paper, an approach to study the nature of the classification models induced by Machine Learning algorithms is proposed. Instead of the predictive accuracy, the values of the predicted class labels are used to characterize the classification models. Over these predicted class labels Bayesian networks are induced. Using these Bayesian networks, several assertions are extracted about the nature of the classification models induced by Machine Learning algorithms.

Introduction

The objective of a supervised classification learning algorithm is to induce a general rule that allows us to classify new examples E*={en+1,…,en+m} that are only characterized by their p descriptive variables. To generate this general rule, we have a set of n samples E={e1,…,en} characterized by p descriptive variables X={X1,…,Xp} and the class label C={w1,…,wn} to which they belong. The general rule (or classifier) can be seen as a classification hypothesis (or model) induced by the learning algorithm.

This problem was studied by the statistic community (Duda and Hart, 1973), using the term Pattern Recognition. In the Machine Learning literature, many representations for inducing classification hypotheses have been suggested (including decision trees, rule induction, Naive Bayes or k-NN), assuming the target function belongs to some restricted space of hypotheses.

To form a hypothesis structure, an algorithm makes assumptions, which are called biases in Machine Learning. Apart from the data, biases are the principal builders of a learning algorithm's hypothesis. A question of interest for researchers in Machine Learning is how to define the biases of existing algorithms and how to find out when a given bias is appropriate, based on background knowledge. Biases can be divided into two types (Kohavi, 1995a):

  • Restricted hypothesis space bias. This bias assumes that the model belongs to some restricted space of hypotheses, typically defined in terms of their representation. For example, most decision tree algorithms restrict the hypothesis space to the space of finite trees with univariate splits in its nodes, assuming that classes are separated by line segments parallel to the coordinate axes.

  • Preference bias. This bias places a preference ordering on hypotheses. Many times the preference ordering is defined by how the search through the space of hypotheses is conducted. Most preference biases attempt to minimize some measure of syntactic complexity, following Occam's Razor principle of preferring simpler hypotheses.

Regarding the large amount of available algorithms, the user is frequently faced with the problem of selecting the ideal algorithm for a specific dataset, trying to select the learning algorithm of which the biases are best suited to the data. The ambitious objective of generating and selecting a unique winner algorithm for all datasets has been rejected by the empirical evidence of the `No Free Lunch Theorem' (Kohavi et al., 1997). The selection of the `best' algorithm, usually only based on the error percentage, had the effect of mainly focusing the attention of the Machine Learning community on predictive accuracy. A no-end career can be felt in Machine Learning, with the aim of constructing the algorithm with the highest predictive accuracy for each dataset. In this way, Clark (1998) mentioned us the obsession of the Machine Learning community with summary statistics.

Since our aim is to study the learning algorithm induced hypothesis' nature, we think that the error percentage, which is a measure of the accuracy of the generated model, cannot help us. Instead of the predictive accuracy, the class predictions will be used as the external expression of the hypothesis induced by the learning algorithm. The probability distribution of the class predictions of a set of learning algorithms over a dataset will be studied in a joint manner by a Bayesian network, displaying the joint behaviour of the hypotheses induced by these algorithms.

We will consider the qualitative part of one kind of probabilistic graphical models known as Bayesian networks to represent the joint behaviour of learning algorithms. Using the semantics of the Bayesian networks, regularities are found, based on the definition of conditional independence, about the classification hypotheses induced by common Machine Learning inducers over a set of medical datasets. The method proposed in this paper does not compare or study algorithms from the point of view of classification accuracy. The nature of the hypotheses induced by a set of algorithms is our main interest: thus, for this purpose, class predictions are used.

The work is organized as follows. Section 2 introduces Bayesian networks, based on the conditional independence concept. Various approaches for inducing Bayesian networks are also related. Section 3 presents the datasets and Machine Learning algorithms used and the chosen methodology for inducing the Bayesian networks. The proposed concepts to extract conclusions from the Bayesian networks about the joint behaviour of the algorithms also appear in this section. Section 4 shows the results obtained in tested domains and their interpretations. A resumé and future work appear in Section 5.

Section snippets

Bayesian networks

Bayesian networks (BNs) (Pearl, 1988) constitute a probabilistic framework for reasoning under uncertainty. From an informal perspective, BNs are directed acyclic graphs (DAGs) where the nodes are random variables and the arcs specify the independence assumptions that must be held between the random variables. BNs are based upon the concept of conditional independence among variables. This concept makes a factorization of the probability distribution of the n-dimensional random variable (Z1,…,Zn

Datasets used

Eleven medical databases from the UCI Machine Learning Repository (Murphy and Aha, 1994) are selected. Selecting the datasets from a specific domain, we hope to obtain more homogeneous conclusions. All databases have a separate set of training data and testing data in a 2/3:1/3 proportion. The characteristics of the databases are given in Table 1.

Classifiers

Fourteen well-known learning algorithms with different biases are used in experiments. Most relevant biases for each algorithm will be cited:

  • ID3

Results from induced Bayesian networks

Assertions on different types of behaviour will be extracted, based on the number of times a learning algorithm or a set of algorithms shows one of the explained conditional independence variants in the Bayesian network structure. Based on the number of domains for which a learning algorithm or a set of algorithms presents one of the explained concepts, assertions about different types of behaviour of the studied algorithms can be extracted. It must be noted that the extracted assertions are

Resumé and future work

From a homogeneous set of databases, we have carried out a study of the joint behaviour of the predictions made by a set of Machine Learning algorithms. Bayesian networks, induced from the learning algorithms class predictions, were used to research the behaviour of a set of known algorithms. From the obtained Bayesian networks, guided by the conditional independence concept, relations between the probability distributions of the hypotheses formed by different algorithms were found. Three

Discussion

Brailovsky: The results that you presented are very interesting. Before you can ascribe certain properties to an algorithm you need to check a lot of things, for example the independence, for a given problem, of the training and test sample set. Have you done this?

Inza: You are right! I mentioned that we only extracted assertions or guidelines on different types of behavior. It is true that for each training- and test-set and different proportions the results are different. For that reason we

Acknowledgements

This work was supported by the grant PI 96/12 from the Gobierno Vasco – Departamento de Educación, Universidades e Investigación.

References (34)

  • R Etxeberria et al.

    Analysis of the behaviour of genetic algorithms when learning Bayesian networks structure from data

    Pattern Recognition Letters

    (1997)
  • D Aha et al.

    Instance-based learning algorithms

    Machine Learning

    (1991)
  • Andersen, S.K., Olesen, K.G., Jensen, F.V., Jensen, F., 1989. HUGIN – a shell for building Bayesian belief universes...
  • Auer, P., Holte, R., Maass, W., 1995. Theory and applications of agnostic PAC-learning with small decision trees. In:...
  • Bouckaert, R.R., 1995. Bayesian belief networks: from construction to inference. Ph.D Thesis, Department of Computer...
  • Cestnik, B., 1990. Estimating probabilities: a crucial task in machine learning. In: Proceedings of the European...
  • Clark, P., 1998. Personal...
  • P Clark et al.

    The CN2 induction algorithm

    Machine Learning

    (1989)
  • Cohen, W.W., 1995. Fast effective rule induction. In: Machine Learning, Proceedings of the 12th International...
  • G.F Cooper et al.

    A Bayesian method for the induction of probabilistic networks from data

    Machine Learning

    (1992)
  • S Cost et al.

    A weighted nearest neighbor algorithm for learning with symbolic features

    Machine Learning

    (1993)
  • A.P Dawid

    Conditional independence in statistical theory

    Journal of the Royal Statistics Society Series B

    (1979)
  • R Duda et al.

    Pattern Classification and Scene Analysis

    (1973)
  • Heckerman, D., 1995. A tutorial on learning with Bayesian networks. Technical Report,...
  • D Heckerman et al.

    Learning Bayesian networks: the combination of knowledge and statistical data

    Machine Learning

    (1995)
  • Herskovits, E., Cooper, G., 1990. Kutató – An entropy-driven system for construction of probabilistic expert systems...
  • R.C Holte

    Very simple classification rules perform well on most commonly used databases

    Machine Learning

    (1993)
  • Cited by (12)

    • Supervised pre-processing approaches in multiple class variables classification for fish recruitment forecasting

      2013, Environmental Modelling and Software
      Citation Excerpt :

      The purpose of experiments with synthetic data is to empirically test the behaviour of the different proposed multi-dimensional pre-processing strategies for a broad range of datasets with different intrinsic data characteristics; i.e. to know which strategies perform best and under which conditions. This is accomplished by means of statistical tests and a process of meta-learning (Hall, 1999; Inza et al., 1999; Witten and Frank, 2005). Thus, a schema to generate synthetic data, based upon Pérez et al. (2006), and procedures for the comparison of methods is described in Appendix I.

    • Using Bayesian networks in the construction of a bi-level multi-classifier. A case study using intensive care unit patients data

      2001, Artificial Intelligence in Medicine
      Citation Excerpt :

      Obtained BN is used as a consensed voting system for the level-0 single classifiers. There is assumed that the BN reflects the interrelation among the different classifiers used [17]. In order to give a real perspective of applied methods, we use 10-fold cross-validation [40] in all the experiments.

    • Family and prejudice: A behavioural taxonomy of machine learning techniques

      2020, Frontiers in Artificial Intelligence and Applications
    View all citing articles on Scopus
    View full text