Information geometry of the EM and em algorithms for neural networks

doi:10.1016/0893-6080(95)00003-8

Neural Networks

Volume 8, Issue 9, 1995, Pages 1379-1408

https://doi.org/10.1016/0893-6080(95)00003-8 Get rights and content

Abstract

To realize an input-output relation given by noise-contaminated examples, it is effective to use a stochastic model of neural networks. When the model network includes hidden units whose activation values are not specified nor observed, it is useful to estimate the hidden variables from the observed or specified input-output data based on the stochastic model. Two algorithms, the EM and em algorithms, have so far been proposed for this purpose. The EM algorithm is an iterative statistical technique of using the conditional expectation, and the em algorithm is a geometrical one given by information geometry. The em algorithm minimizes iteratively the Kullback-Leibler divergence in the manifold of neural networks. These two algorithms are equivalent in most cases. The present paper gives a unified information geometrical framework for studying stochastic models of neural networks, by focusing on the EM and em algorithms, and proves a condition that guarantees their equivalence. Examples include: (1) stochastic multilayer perceptron, (2) mixtures of experts, and (3) normal mixture model.

References (53)

S. Amari
Dualistic geometry of the manifold of higher-order neurons
Neural Networks
(1991)
A. Fujiwara et al.
Dualistic dynamical systems in the framework of information geometry
Physica D
(1995)
Y. Guan et al.
Noisy reinforcement training for pRAM nets
Neural Networks
(1994)
A. Ohara et al.
Differential geometric structures of stable feedback systems with dual connections
S. Amari
Differential geometry of curved exponential families—curvatures and information loss
Annals of Statistics
(1982)
S. Amari
Differential geometrical methods in statistics
S. Amari
Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual affine connections and divergence
Mathematical Systems Theory
(1987)
S. Amari
Differential geometrical theory of statistics
S. Amari
Fisher information under restriction of Shannon information in multiterminal situations
Annals of Institute of Statistical Mathematics
(1989)
S. Amari
Mathematical foundations of neurocomputing

S. Amari

The EM algorithm and information geometry in neural network learning

Neural Computation

(1995)

S. Amari et al.

Differential geometry in statistical inferences

S. Amari et al.

Statistical inference under multiterminal rate restrictions—a differential geometrical approach

IEEE Transactions on Information Theory

(1989)

S. Amari et al.

Information geometry of estimating functions in semiparametric statistical models

S. Amari et al.

Information geometry of Boltzmann machines

IEEE Transactions on Neural Networks

(1992)

P. Baldi et al.

Smooth on-line learning algorithm for hidden Markov models

Neural Computation

(1994)

O.E. Barndorff-Nielsen

Information and exponential families in statistical theory

(1978)

O.E. Barndorff-Nielsen

Parametric statistical model and likelihood

O.E. Barndorff-Nielsen et al.

The role of differential geometry in statistical theory

International Statistical Review

(1986)

O.E. Barndorff-Nielsen et al.

Approximating exponential models

Annals of Institute of Statistical Mathematics

(1989)

L.E. Baum et al.

A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

Annals of Mathematical Statistics

(1970)

J. Besag et al.

Spatial statistics and Bayesian computation

Journal of Royal Statistical Society

(1993)

W. Byrne

Alternating minimization and Boltzmann machine learning

IEEE Transactions on Neural Networks

(1992)

B. Cheng et al.

Neural networks—a review from statistical perspectives—comments and rejoinders

Statistical Science

(1994)

N.N. Chentsov

Statistical decision rules and optimal inference

(1972)

D.R. Cox et al.

Theoretical statistics

(1974)

Cited by (257)

Natural Reweighted Wake–Sleep
2022, Neural Networks
Citation Excerpt :
In their work the authors show conditions for the theoretical convergence of a modified version of the Wake–Sleep algorithm, identified as a variant of the geometric em algorithm. The convergence of the em and their relation to the Expectation–Maximization (EM) optimization process is known in literature and in particular has been studied by Amari (1995) and Fujiwara and Amari (1995). Notice that the algorithm by Ikeda et al. is using the exact FIM, while in the present work we are employing an estimation of the gradients and of the FIM based on the minibatch.
Helmholtz Machines (HMs) are a class of generative models composed of two Sigmoid Belief Networks (SBNs), acting respectively as an encoder and a decoder. These models are commonly trained using a two-step optimization algorithm called Wake–Sleep (WS) and more recently by improved versions, such as Reweighted Wake–Sleep (RWS) and Bidirectional Helmholtz Machines (BiHM). The locality of the connections in an SBN induces sparsity in the Fisher Information Matrices associated to the probabilistic models, in the form of a finely-grained block-diagonal structure. In this paper we exploit this property to efficiently train SBNs and HMs using the natural gradient. We present a novel algorithm, called Natural Reweighted Wake–Sleep (NRWS), that corresponds to the geometric adaptation of its standard version. In a similar manner, we also introduce Natural Bidirectional Helmholtz Machine (NBiHM). Differently from previous work, we will show how for HMs the natural gradient can be efficiently computed without the need of introducing any approximation in the structure of the Fisher information matrix. The experiments performed on standard datasets from the literature show a consistent improvement of NRWS and NBiHM not only with respect to their non-geometric baselines but also with respect to state-of-the-art training algorithms for HMs. The improvement is quantified both in terms of speed of convergence as well as value of the log-likelihood reached after training.
Reverse em-problem based on Bregman divergence and its application to classical and quantum information theory
2024, arXiv
Revisiting Probabilistic Latent Semantic Analysis: Extensions, Challenges and Insights
2024, Technologies
Geometry of EM and related iterative algorithms
2023, Information Geometry
Non-Parametric Nonnegative Matrix Factorization Fisher Kernel
2023, SSRN
Information-Geometrical Perspectives of Regime Switching in Stock Markets
2023, SSRN

View all citing articles on Scopus

View full text

Invited articleInformation geometry of the EM and em algorithms for neural networks

Abstract

Neural Networks

Physica D

Neural Networks

Differential geometry of curved exponential families—curvatures and information loss

Annals of Statistics

Differential geometrical methods in statistics

Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual affine connections and divergence

Mathematical Systems Theory

Differential geometrical theory of statistics

Fisher information under restriction of Shannon information in multiterminal situations

Annals of Institute of Statistical Mathematics

Mathematical foundations of neurocomputing

The EM algorithm and information geometry in neural network learning

Neural Computation

Differential geometry in statistical inferences

Statistical inference under multiterminal rate restrictions—a differential geometrical approach

IEEE Transactions on Information Theory

Information geometry of estimating functions in semiparametric statistical models

Information geometry of Boltzmann machines

IEEE Transactions on Neural Networks

Smooth on-line learning algorithm for hidden Markov models

Neural Computation

Information and exponential families in statistical theory

Parametric statistical model and likelihood

The role of differential geometry in statistical theory

International Statistical Review

Approximating exponential models

Annals of Institute of Statistical Mathematics

A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

Annals of Mathematical Statistics

Spatial statistics and Bayesian computation

Journal of Royal Statistical Society

Alternating minimization and Boltzmann machine learning

IEEE Transactions on Neural Networks

Neural networks—a review from statistical perspectives—comments and rejoinders

Statistical Science

Statistical decision rules and optimal inference

Theoretical statistics

Invited article
Information geometry of the EM and em algorithms for neural networks