Learning Bayesian networks for discrete data

https://doi.org/10.1016/j.csda.2008.10.007Get rights and content

Abstract

Bayesian networks have received much attention in the recent literature. In this article, we propose an approach to learn Bayesian networks using the stochastic approximation Monte Carlo (SAMC) algorithm. Our approach has two nice features. Firstly, it possesses the self-adjusting mechanism and thus avoids essentially the local-trap problem suffered by conventional MCMC simulation-based approaches in learning Bayesian networks. Secondly, it falls into the class of dynamic importance sampling algorithms; the network features can be inferred by dynamically weighted averaging the samples generated in the learning process, and the resulting estimates can have much lower variation than the single model-based estimates. The numerical results indicate that our approach can mix much faster over the space of Bayesian networks than the conventional MCMC simulation-based approaches.

Introduction

The use of graphs to represent statistical models has been one focus of research in recent years. In particular, researchers have directed interest in Bayesian networks and applications of such models to biological data, see e.g., Friedman et al. (2000) and Ellis and Wong (2008). The Bayesian network, as illustrated by Fig. 1, is a directed acyclic graph (DAG) in which the nodes represent the variables in the domain and the edges correspond to direct probabilistic dependencies between them. As indicated by many applications, the Bayesian network is a powerful knowledge representation and reasoning tool under conditions of uncertainty that is typical of real-life applications.

Many approaches have been developed for learning of Bayesian networks in the literature. These approaches can be roughly grouped into three categories: the conditional independence test-based approaches, the optimization-based approaches, and the MCMC simulation-based approaches.

The approaches in the first category perform a qualitative study of dependence relationships between the nodes, and generate a network that represents most of the relationships. The approaches described in Spirtes et al. (1993), Wermuth and Lauritzen (1983) and de Campos and Huete (2000) belong to this category. The networks constructed by these approaches are usually asymptotically correct, but as pointed out by Cooper and Herskovits (1992) that the conditional independence tests with large condition-sets may be unreliable unless the volume of data is enormous. We note that due to limited research resources, the sample size of the biological data is often small, e.g., the gene expression data studied in Friedman et al. (2000) and the real examples studied in this paper.

The approaches in the second category attempt to find a network that optimizes a selected scoring function, which evaluates the fitness of each feasible network to the data. The scoring functions can be formulated based on different principles, such as entropy (Herskovits and Cooper, 1990), the minimum description length (Lam and Bacchus, 1994), and Bayesian scores (Cooper and Herskovits, 1992, Heckerman et al., 1995). The optimization procedures employed are usually heuristic, such as tabu search (Bouckaert, 1995) and evolutionary computation (de Campos and Huete, 2000, Neil and Korb, 1999). Unfortunately, the task of finding a network structure that optimizes the scoring function is known to be a NP-hard problem (Chickering, 1996). Hence, the optimization process often stops at a local optimal structure.

The approaches in the third category work by simulating a Markov chain over the space of feasible network structures with the stationary distribution being the posterior distribution of the network. The work belonging to this category include Madigan and Raftery (1994), Madigan and York (1995), and Giudici and Green (1999), among others. In these works, the simulation is done using the Metropolis–Hastings (MH) algorithm, and the network features are inferred by averaging over a large number of networks simulated from the posterior distribution. Averaging over different networks can significantly reduce the variation of estimation suffered by the single network-based inference procedure. Although the approaches seem attractive, they can only work well for the problems with a very small number of variables. This is because the energy landscape of the Bayesian network can be quite rugged, with a multitude of local energy minima being separated by high energy barriers, especially when the network size is large. Here, the energy function refers to the negative log-posterior distribution function of the Bayesian network. As known by many researchers, the MH algorithm is prone to get trapped in a local energy minimum indefinitely in simulations from a system for which the energy landscape is rugged. To alleviate this difficulty, Friedman and Koller (2003) introduce a two-stage algorithm: use the MH algorithm to sample a temporal order of the nodes, and then sample a network structure compatible with the given node order. As discussed in Friedman and Koller (2003), for any Bayesian networks, there exists a temporal order of the nodes such that for any two nodes X and Y, if there is an edge from X and Y, then X must be preceding to Y in the order. For example, for the network shown in Fig. 1, a temporal order compatible with the network is ACDFBGE. The two-stage algorithm improves the mixing over the space of network structures, however, the structures sampled by it does not follow the correct posterior distribution, because the temporal order does not induce a partition of the space of network structures. A network may be compatible with more than one order. For example, the network shown in Fig. 1 is compatible with both the orders ACDFBGE and ADCFBGE.

In this article, we propose to learn Bayesian networks using the stochastic approximation Monte Carlo (SAMC) algorithm (Liang et al., 2007). A remarkable feature of the SAMC algorithm is that it possesses the self-adjusting mechanism and is thus less likely trapped by local energy minima. This is very important for learning of Bayesian networks. In addition, SAMC belongs to the class of dynamic weighting algorithms (Wong and Liang, 1997, Liu et al., 2001, Liang, 2002), and the samples generated in the learning process can be used to infer the network features via a dynamically weighted estimator. Like Bayesian model averaging estimators, the dynamically weighted estimator can have much lower variation than the single model-based estimator.

The remainder of this article is organized as follows. In Section 2, we give the formulation of Bayesian networks. In Section 3, we first give a brief review of the SAMC algorithm and then describe its implementation for Bayesian networks. In Section 4, we present the numerical results on a simulated example and two real biological data example. In Section 5, we conclude the paper with a brief discussion.

Section snippets

Bayesian networks

A Bayesian network model can be defined as a pair B=(G,ρ), where G=(V,E) is a directed acyclic graph that represents the structure of the network, V denotes the set of nodes, E denotes the set of edges, and ρ is a vector of conditional probabilities as described below. For a node VV, a parent of V is a node from which there is a directed link to V. The set of parents of V is denoted by pa(V). In this article, we study only the discrete case where V is a categorical variable taking values in a

A review of the SAMC algorithm

Suppose that we are working with the following Boltzmann distribution, f(x)=1Zexp{U(x)/τ},xX, where Z is the normalizing constant, τ is the temperature, X is the sample space, and U(x) is called the energy function in terms of physics. In the context of Bayesian networks, U(x) corresponds to logP(G|D), the negative logarithm of the posterior distribution (6), and the sample space X is finite. Furthermore, we suppose that the sample space has been partitioned according to the energy function

An illustrative example

Consider the Bayesian network shown in Fig. 1 again. Suppose that a dataset, consisting of 500 independent observations, has been generated from the network according to the following distributions: VA Bernoulli(0.7), VD Bernoulli(0.5), VC|VAP1, VF|VC,VDP2, VB|VA,VFP2, VG|VB,VCP2, and VE|VGP1, where P1 and P2 are defined as in Table 1, Table 2, respectively.

SAMC was first applied to this example, we partitioned the sample space into 501 subregions with an equal energy bandwidth, E1={x:U(x

Discussion

In this article, we have applied the SAMC algorithm to the learning of Bayesian networks. The numerical results indicate that SAMC can mix much faster over the space of Bayesian networks than the MH algorithm. All the examples we studied are with discrete data. Our approach can also be applied to the continuous data by including a pre-discretization step, but the resulting networks may depend on the discretization scheme. In general, discretization with a small number of categories can lead to

Acknowledgment

Liang’s research was supported in part by the grant (DMS-0607755) of the National Science Foundation and the award (KUS-C1-016-04) given by King Abdullah University of Science and Technology (KAUST). The authors thank Professor S.P. Azen, the associate editor, and the referee for their comments which have led to significant improvement of this paper.

References (35)

  • L.M. de Campos et al.

    A new approach for learning belief networks using independence criteria

    Int. J. Approx. Reason

    (2000)
  • L.A. Kurgan et al.

    Knowledge discovery approach to automated cardiac SPECT diagnosis

    Artif. Intell. Med.

    (2001)
  • C. Andrieu et al.

    Stability of stochastic approximation under verifiable conditions

    SIAM J. Control Optim.

    (2005)
  • Bouckaert, R.R., 1995. Bayesian belief networks: From construction to inference. Ph.D. Thesis, University of...
  • H.F. Chen

    Stochastic Approximation and its Applications

    (2002)
  • D.M. Chickering

    Learning Bayesian networks is NP-complete

  • K.J. Cios et al.

    CLIP3: Cover learning using integer programming

    Kybernetes

    (1997)
  • G.F. Cooper et al.

    A Bayesian method for the induction of probabilistic networks from data

    Mach. Learn.

    (1992)
  • G.F. Cooper et al.

    Causal discovery from a mixture of experimental and observational data

  • B. Ellis et al.

    Learning causal Bayesian network structures from experimental data

    J. Amer. Statist. Assoc.

    (2008)
  • Fayyad, U., Irani, K., 1993. Multi-interval discretization of continuous-valued attributes for classification learning....
  • N. Friedman et al.

    Using Bayesian network to analyze expression data

    J. Comput. Biol.

    (2000)
  • N. Friedman et al.

    Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks

    Mach. Learn.

    (2003)
  • J. Geweke

    Bayesian inference in econometric models using Monte Carlo integration

    Econometrica

    (1989)
  • P. Giudici et al.

    Decomposable graphical Gaussian model determination

    Biometrika

    (1999)
  • D. Heckerman et al.

    Learning Bayesian networks: The combination of knowledge and statistical data

    Mach. Learn.

    (1995)
  • Herskovits, E., Cooper, G.F., 1990. Kutato´: An entropy-driven system for the construction of probabilistic expert...
  • Cited by (23)

    • Bayesian network modeling: A case study of an epidemiologic system analysis of cardiovascular risk

      2016, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Contrary to deterministic understanding of the causality phenomenon [4], BN modeling has its origins within data mining and machine learning research [5,6] and captures probabilistic influences induced out of big data sets. They constitute a powerful knowledge representation and an efficient reasoning tool under conditions of uncertainty [7]. The network structure is a directed acyclic graph (DAG) where each node represents a random variable [8,9] and the arcs are suitable for representing causality [10].

    • The Bayesian method for causal discovery of latent-variable models from a mixture of experimental and observational data

      2012, Computational Statistics and Data Analysis
      Citation Excerpt :

      We also compare the performance of the ILVS method with the best-performing approximation method. In this section we introduce the Implicit Latent Variable Scoring (ILVS) method that was first described in Yoo and Cooper (2001) and referred by many others (Liang and Zhang, 2009; Chu et al., 2003; Sebastiani et al., 2003; Liu et al., 2008). The focus in this section will be on modeling with a single latent variable.

    • Diagnose the mild cognitive impairment by constructing Bayesian network with missing data

      2011, Expert Systems with Applications
      Citation Excerpt :

      We can conclude four principal advantages of using a discrete variable Bayesian network for network analysis: (1) the Bayesian network framework does not require that the joint distribution follows a specific parametric distribution; (2) a Bayesian network supports probabilistic reasoning as it consists of probabilistic associations among variables; (3) because the Bayesian network representation is based on the concept of conditional independence, it supports Bayesian inference without having to maintain the full joint distribution in memory; and (4) since each Bayesian network is a multivariate model that we can evaluate using a single probability score, we can evaluate many structure–function interactions without the multiple comparison problem. Although the method of network analysis is very effective, such as the method of reference (Wong & Leung, 2004; Chen & Herskovits, 2006; Liang & Zhang, 2009), these algorithms can not deal with the missing data. Whereas, in real-world applications, missing data is a great quantity, especially in the medical problem.

    • Computational intelligence techniques for medical diagnosis and prognosis: Problems and current developments

      2019, Biocybernetics and Biomedical Engineering
      Citation Excerpt :

      The origin of BNs lies within DM and ML techniques [71,72] which can capture the induced probabilistic influences in big data sets. It is considered to be a robust knowledge representation technique as well as an effective technique for reasoning in the presence of uncertainty [73]. BNs can be represented by a directed acyclic graph where the nodes represent the variables and the directed edge represents the causality [74].

    • Learning Bayesian networks: Approaches and issues

      2011, Knowledge Engineering Review
    • Sparse Graphical Modeling for High Dimensional Data: A Paradigm of Conditional Independence Tests

      2023, Sparse Graphical Modeling for High Dimensional Data: A Paradigm of Conditional Independence Tests
    View all citing articles on Scopus
    View full text