Learning Bayesian networks for discrete data

doi:10.1016/j.csda.2008.10.007

Computational Statistics & Data Analysis

Volume 53, Issue 4, 15 February 2009, Pages 865-876

https://doi.org/10.1016/j.csda.2008.10.007 Get rights and content

Abstract

Bayesian networks have received much attention in the recent literature. In this article, we propose an approach to learn Bayesian networks using the stochastic approximation Monte Carlo (SAMC) algorithm. Our approach has two nice features. Firstly, it possesses the self-adjusting mechanism and thus avoids essentially the local-trap problem suffered by conventional MCMC simulation-based approaches in learning Bayesian networks. Secondly, it falls into the class of dynamic importance sampling algorithms; the network features can be inferred by dynamically weighted averaging the samples generated in the learning process, and the resulting estimates can have much lower variation than the single model-based estimates. The numerical results indicate that our approach can mix much faster over the space of Bayesian networks than the conventional MCMC simulation-based approaches.

Introduction

The use of graphs to represent statistical models has been one focus of research in recent years. In particular, researchers have directed interest in Bayesian networks and applications of such models to biological data, see e.g., Friedman et al. (2000) and Ellis and Wong (2008). The Bayesian network, as illustrated by Fig. 1, is a directed acyclic graph (DAG) in which the nodes represent the variables in the domain and the edges correspond to direct probabilistic dependencies between them. As indicated by many applications, the Bayesian network is a powerful knowledge representation and reasoning tool under conditions of uncertainty that is typical of real-life applications.

Many approaches have been developed for learning of Bayesian networks in the literature. These approaches can be roughly grouped into three categories: the conditional independence test-based approaches, the optimization-based approaches, and the MCMC simulation-based approaches.

The approaches in the first category perform a qualitative study of dependence relationships between the nodes, and generate a network that represents most of the relationships. The approaches described in Spirtes et al. (1993), Wermuth and Lauritzen (1983) and de Campos and Huete (2000) belong to this category. The networks constructed by these approaches are usually asymptotically correct, but as pointed out by Cooper and Herskovits (1992) that the conditional independence tests with large condition-sets may be unreliable unless the volume of data is enormous. We note that due to limited research resources, the sample size of the biological data is often small, e.g., the gene expression data studied in Friedman et al. (2000) and the real examples studied in this paper.

The approaches in the second category attempt to find a network that optimizes a selected scoring function, which evaluates the fitness of each feasible network to the data. The scoring functions can be formulated based on different principles, such as entropy (Herskovits and Cooper, 1990), the minimum description length (Lam and Bacchus, 1994), and Bayesian scores (Cooper and Herskovits, 1992, Heckerman et al., 1995). The optimization procedures employed are usually heuristic, such as tabu search (Bouckaert, 1995) and evolutionary computation (de Campos and Huete, 2000, Neil and Korb, 1999). Unfortunately, the task of finding a network structure that optimizes the scoring function is known to be a NP-hard problem (Chickering, 1996). Hence, the optimization process often stops at a local optimal structure.

The approaches in the third category work by simulating a Markov chain over the space of feasible network structures with the stationary distribution being the posterior distribution of the network. The work belonging to this category include Madigan and Raftery (1994), Madigan and York (1995), and Giudici and Green (1999), among others. In these works, the simulation is done using the Metropolis–Hastings (MH) algorithm, and the network features are inferred by averaging over a large number of networks simulated from the posterior distribution. Averaging over different networks can significantly reduce the variation of estimation suffered by the single network-based inference procedure. Although the approaches seem attractive, they can only work well for the problems with a very small number of variables. This is because the energy landscape of the Bayesian network can be quite rugged, with a multitude of local energy minima being separated by high energy barriers, especially when the network size is large. Here, the energy function refers to the negative log-posterior distribution function of the Bayesian network. As known by many researchers, the MH algorithm is prone to get trapped in a local energy minimum indefinitely in simulations from a system for which the energy landscape is rugged. To alleviate this difficulty, Friedman and Koller (2003) introduce a two-stage algorithm: use the MH algorithm to sample a temporal order of the nodes, and then sample a network structure compatible with the given node order. As discussed in Friedman and Koller (2003), for any Bayesian networks, there exists a temporal order of the nodes such that for any two nodes $X$ and $Y$ , if there is an edge from $X$ and $Y$ , then $X$ must be preceding to $Y$ in the order. For example, for the network shown in Fig. 1, a temporal order compatible with the network is ACDFBGE. The two-stage algorithm improves the mixing over the space of network structures, however, the structures sampled by it does not follow the correct posterior distribution, because the temporal order does not induce a partition of the space of network structures. A network may be compatible with more than one order. For example, the network shown in Fig. 1 is compatible with both the orders ACDFBGE and ADCFBGE.

In this article, we propose to learn Bayesian networks using the stochastic approximation Monte Carlo (SAMC) algorithm (Liang et al., 2007). A remarkable feature of the SAMC algorithm is that it possesses the self-adjusting mechanism and is thus less likely trapped by local energy minima. This is very important for learning of Bayesian networks. In addition, SAMC belongs to the class of dynamic weighting algorithms (Wong and Liang, 1997, Liu et al., 2001, Liang, 2002), and the samples generated in the learning process can be used to infer the network features via a dynamically weighted estimator. Like Bayesian model averaging estimators, the dynamically weighted estimator can have much lower variation than the single model-based estimator.

The remainder of this article is organized as follows. In Section 2, we give the formulation of Bayesian networks. In Section 3, we first give a brief review of the SAMC algorithm and then describe its implementation for Bayesian networks. In Section 4, we present the numerical results on a simulated example and two real biological data example. In Section 5, we conclude the paper with a brief discussion.

Section snippets

Bayesian networks

A Bayesian network model can be defined as a pair $B = (G, ρ)$ , where $G = (V, E)$ is a directed acyclic graph that represents the structure of the network, $V$ denotes the set of nodes, $E$ denotes the set of edges, and $ρ$ is a vector of conditional probabilities as described below. For a node $V \in V$ , a parent of $V$ is a node from which there is a directed link to $V$ . The set of parents of $V$ is denoted by $p a (V)$ . In this article, we study only the discrete case where $V$ is a categorical variable taking values in a

A review of the SAMC algorithm

Suppose that we are working with the following Boltzmann distribution, $f (x) = \frac{1}{Z} exp {- U (x) / τ}, x \in X,$ where $Z$ is the normalizing constant, $τ$ is the temperature, $X$ is the sample space, and $U (x)$ is called the energy function in terms of physics. In the context of Bayesian networks, $U (x)$ corresponds to $- log P (G | D)$ , the negative logarithm of the posterior distribution (6), and the sample space $X$ is finite. Furthermore, we suppose that the sample space has been partitioned according to the energy function

An illustrative example

Consider the Bayesian network shown in Fig. 1 again. Suppose that a dataset, consisting of 500 independent observations, has been generated from the network according to the following distributions: $V_{A} \sim Bernoulli (0.7)$ , $V_{D} \sim Bernoulli (0.5)$ , $V_{C} | V_{A} \sim P_{1}$ , $V_{F} | V_{C}, V_{D} \sim P_{2}$ , $V_{B} | V_{A}, V_{F} \sim P_{2}$ , $V_{G} | V_{B}, V_{C} \sim P_{2}$ , and $V_{E} | V_{G} \sim P_{1}$ , where $P_{1}$ and $P_{2}$ are defined as in Table 1, Table 2, respectively.

SAMC was first applied to this example, we partitioned the sample space into 501 subregions with an equal energy bandwidth, $E_{1} = {x : U (x$

Discussion

In this article, we have applied the SAMC algorithm to the learning of Bayesian networks. The numerical results indicate that SAMC can mix much faster over the space of Bayesian networks than the MH algorithm. All the examples we studied are with discrete data. Our approach can also be applied to the continuous data by including a pre-discretization step, but the resulting networks may depend on the discretization scheme. In general, discretization with a small number of categories can lead to

Acknowledgment

Liang’s research was supported in part by the grant (DMS-0607755) of the National Science Foundation and the award (KUS-C1-016-04) given by King Abdullah University of Science and Technology (KAUST). The authors thank Professor S.P. Azen, the associate editor, and the referee for their comments which have led to significant improvement of this paper.

References (35)

L.M. de Campos et al.
A new approach for learning belief networks using independence criteria
Int. J. Approx. Reason
(2000)
L.A. Kurgan et al.
Knowledge discovery approach to automated cardiac SPECT diagnosis
Artif. Intell. Med.
(2001)
C. Andrieu et al.
Stability of stochastic approximation under verifiable conditions
SIAM J. Control Optim.
(2005)
Bouckaert, R.R., 1995. Bayesian belief networks: From construction to inference. Ph.D. Thesis, University of...
H.F. Chen
Stochastic Approximation and its Applications
(2002)
D.M. Chickering
Learning Bayesian networks is NP-complete
K.J. Cios et al.
CLIP3: Cover learning using integer programming
Kybernetes
(1997)
G.F. Cooper et al.
A Bayesian method for the induction of probabilistic networks from data
Mach. Learn.
(1992)
G.F. Cooper et al.
Causal discovery from a mixture of experimental and observational data
B. Ellis et al.
Learning causal Bayesian network structures from experimental data
J. Amer. Statist. Assoc.
(2008)

Fayyad, U., Irani, K., 1993. Multi-interval discretization of continuous-valued attributes for classification learning....

N. Friedman et al.

Using Bayesian network to analyze expression data

J. Comput. Biol.

(2000)

N. Friedman et al.

Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks

Mach. Learn.

(2003)

J. Geweke

Bayesian inference in econometric models using Monte Carlo integration

Econometrica

(1989)

P. Giudici et al.

Decomposable graphical Gaussian model determination

Biometrika

(1999)

D. Heckerman et al.

Learning Bayesian networks: The combination of knowledge and statistical data

Mach. Learn.

(1995)

Herskovits, E., Cooper, G.F., 1990. Kutato´: An entropy-driven system for the construction of probabilistic expert...

Cited by (23)

Bayesian network modeling: A case study of an epidemiologic system analysis of cardiovascular risk
2016, Computer Methods and Programs in Biomedicine
Citation Excerpt :
Contrary to deterministic understanding of the causality phenomenon [4], BN modeling has its origins within data mining and machine learning research [5,6] and captures probabilistic influences induced out of big data sets. They constitute a powerful knowledge representation and an efficient reasoning tool under conditions of uncertainty [7]. The network structure is a directed acyclic graph (DAG) where each node represents a random variable [8,9] and the arcs are suitable for representing causality [10].
An extensive, in-depth study of cardiovascular risk factors (CVRF) seems to be of crucial importance in the research of cardiovascular disease (CVD) in order to prevent (or reduce) the chance of developing or dying from CVD. The main focus of data analysis is on the use of models able to discover and understand the relationships between different CVRF. In this paper a report on applying Bayesian network (BN) modeling to discover the relationships among thirteen relevant epidemiological features of heart age domain in order to analyze cardiovascular lost years (CVLY), cardiovascular risk score (CVRS), and metabolic syndrome (MetS) is presented. Furthermore, the induced BN was used to make inference taking into account three reasoning patterns: causal reasoning, evidential reasoning, and intercausal reasoning. Application of BN tools has led to discovery of several direct and indirect relationships between different CVRF. The BN analysis showed several interesting results, among them: CVLY was highly influenced by smoking being the group of men the one with highest risk in CVLY; MetS was highly influence by physical activity (PA) being again the group of men the one with highest risk in MetS, and smoking did not show any influence. BNs produce an intuitive, transparent, graphical representation of the relationships between different CVRF. The ability of BNs to predict new scenarios when hypothetical information is introduced makes BN modeling an Artificial Intelligence (AI) tool of special interest in epidemiological studies. As CVD is multifactorial the use of BNs seems to be an adequate modeling tool.
The Bayesian method for causal discovery of latent-variable models from a mixture of experimental and observational data
2012, Computational Statistics and Data Analysis
Citation Excerpt :
We also compare the performance of the ILVS method with the best-performing approximation method. In this section we introduce the Implicit Latent Variable Scoring (ILVS) method that was first described in Yoo and Cooper (2001) and referred by many others (Liang and Zhang, 2009; Chu et al., 2003; Sebastiani et al., 2003; Liu et al., 2008). The focus in this section will be on modeling with a single latent variable.
This paper describes a Bayesian method for learning causal Bayesian networks through networks that contain latent variables from an arbitrary mixture of observational and experimental data. The paper presents Bayesian methods (including a new method) for learning the causal structure and parameters of the underlying causal process that is generating the data, given that the data contain a mixture of observational and experimental cases. These learning methods were applied using as input various mixtures of experimental and observational data that were generated from the ALARM causal Bayesian network. The paper reports how these structure predictions and parameter estimates compare with the true causal structures and parameters as given by the ALARM network. The paper shows that (1) the new method for learning Bayesian network structure from a mixture of data that this paper introduce, the Gibbs Volume method, best estimates the probability of the data, given the latent variable model and (2) using large data (>10,000 cases), another model, the implicit latent variable method, is asymptotically correct and efficient.
Diagnose the mild cognitive impairment by constructing Bayesian network with missing data
2011, Expert Systems with Applications
Citation Excerpt :
We can conclude four principal advantages of using a discrete variable Bayesian network for network analysis: (1) the Bayesian network framework does not require that the joint distribution follows a specific parametric distribution; (2) a Bayesian network supports probabilistic reasoning as it consists of probabilistic associations among variables; (3) because the Bayesian network representation is based on the concept of conditional independence, it supports Bayesian inference without having to maintain the full joint distribution in memory; and (4) since each Bayesian network is a multivariate model that we can evaluate using a single probability score, we can evaluate many structure–function interactions without the multiple comparison problem. Although the method of network analysis is very effective, such as the method of reference (Wong & Leung, 2004; Chen & Herskovits, 2006; Liang & Zhang, 2009), these algorithms can not deal with the missing data. Whereas, in real-world applications, missing data is a great quantity, especially in the medical problem.
Mild Cognitive Impairment (MCI) is thought to be the prodromal phase to Alzheimer’s disease (AD), which is the most common form of dementia and leads to irreversible neurogenerative damage of the brain. In order to further improve the diagnostic quality of the MCI, we developed a MCI expert system to address MCI’s prediction and inference question, consequently, assist the diagnosis of doctor. In this system, we mainly deal with following problems: (1) Estimate missing data in the experiment by utilizing mutual information and Newton interpolation. (2) Make certain the prior feature ordering in constructing Bayesian network. (3) Construct the Bayesian network (We term the algorithm as MNBN). The experimental results indicate that MNBN algorithm achieved better results than some existing methods in most instances. The mean square error comes to 0.0173 in the MCI experiment. Our results shed light on the potential application in MCI diagnosis.
Computational intelligence techniques for medical diagnosis and prognosis: Problems and current developments
2019, Biocybernetics and Biomedical Engineering
Citation Excerpt :
The origin of BNs lies within DM and ML techniques [71,72] which can capture the induced probabilistic influences in big data sets. It is considered to be a robust knowledge representation technique as well as an effective technique for reasoning in the presence of uncertainty [73]. BNs can be represented by a directed acyclic graph where the nodes represent the variables and the directed edge represents the causality [74].
Diagnosis, being the first step in medical practice, is very crucial for clinical decision making. This paper investigates state-of-the-art computational intelligence (CI) techniques applied in the field of medical diagnosis and prognosis. The paper presents the performance of these techniques in diagnosing different diseases along with the detailed description of the data used. This paper includes basic as well as hybrid CI techniques that have been used in recent years so as to know the current trends in medical diagnosis domain. The paper presents the merits and demerits of different techniques in general as well as application specific context. This paper discusses some critical issues related to the medical diagnosis and prognosis such as uncertainties in the medical domain, problems in the medical data especially dealing with time-stamped (temporal) data, and knowledge acquisition. Moreover, this paper also discusses the features of good CI techniques in medical diagnosis. Overall, this review provides new insight for future research requirements in the medical diagnosis domain.
Learning Bayesian networks: Approaches and issues
2011, Knowledge Engineering Review
Sparse Graphical Modeling for High Dimensional Data: A Paradigm of Conditional Independence Tests
2023, Sparse Graphical Modeling for High Dimensional Data: A Paradigm of Conditional Independence Tests

View all citing articles on Scopus

View full text

Learning Bayesian networks for discrete data

Abstract

Introduction

Section snippets

Bayesian networks

A review of the SAMC algorithm

An illustrative example

Discussion

Acknowledgment

Int. J. Approx. Reason

Artif. Intell. Med.

Stability of stochastic approximation under verifiable conditions

SIAM J. Control Optim.

Stochastic Approximation and its Applications

Learning Bayesian networks is NP-complete

CLIP3: Cover learning using integer programming

Kybernetes

A Bayesian method for the induction of probabilistic networks from data

Mach. Learn.

Causal discovery from a mixture of experimental and observational data

Learning causal Bayesian network structures from experimental data

J. Amer. Statist. Assoc.

Using Bayesian network to analyze expression data

J. Comput. Biol.

Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks

Mach. Learn.

Bayesian inference in econometric models using Monte Carlo integration

Econometrica

Decomposable graphical Gaussian model determination

Biometrika

Learning Bayesian networks: The combination of knowledge and statistical data

Mach. Learn.