Elsevier

Neurocomputing

Volume 415, 20 November 2020, Pages 358-367
Neurocomputing

Towards the representational power of restricted Boltzmann machines

https://doi.org/10.1016/j.neucom.2020.07.090Get rights and content

Highlights

  • The number of hidden units of RBMs required to compute some distributions is studied.

  • We show how RBMs compute distributions that depend on scalar projections of inputs.

  • We study how to represent distributions as forms that depend on scalar projections.

  • A new proof for the universal approximation properties of RBMs is presented.

  • We show the equivalence of RBMs and 2-layer neural networks in expressive efficiency.

Abstract

The restricted Boltzmann machine (RBM), which is a graphical model for binary random variables, has been proven to be a powerful tool in machine learning. However, theoretical foundations for understanding the approximation ability of RBMs are lacking. In this paper, we study the representational power of RBMs, with the focus lying on the sufficient number of hidden units of RBMs required to compute some classes of distributions of interest, with a fixed number of inputs. First, it is constructively shown how RBMs can approximate any distribution that depends on the scalar projection of the inputs onto a given vector up to arbitrary accuracy. Then, for any given distribution, we explore how it can be represented as the form that depends on the scalar projection of the inputs onto some vectors, and then study the properties of these vectors, from which a new proof for the universal approximation theorem of RBMs is deduced. Finally, we investigate the representational efficiency of RBMs by providing a description of all the distributions that can be efficiently computed by RBMs. More specifically, it is shown that a distribution can be computed by a polynomial-size RBM with polynomially bounded parameters, if and only if its mass can be computed by a two-layer feedforward network with threshold/ReLU activation functions, whose size and parameters are polynomially bounded.

Introduction

The restricted Boltzmann machine (RBM) [1], [2], [3] is a parameterized generative model to simulate a probability distribution of binary data. In recent years, RBMs have been stacked to build multilayer learning architectures, such as the deep belief network (DBN) [4], which is widely regarded as one of the first effective deep learning systems, and the deep Boltzmann machine [5], [6]. As the development of the efficient learning strategies, the RBM, DBN and DBM have been successfully applied for dimensionality reduction [7], classification [8], [9], [10], features extraction [11], [12], time-series modeling [13] and many other application domains. In a word, the RBM has recently received a lot of attention in deep learning, both on its own and as the building block for DBNs and DBMs. However, the mathematical study of the representational power of RBMs is still an intractable problem. The research on this topic not only has theoretical significance but also provides guidance on the applications in deep learning.

An RBM with n visible units and m hidden units (with a state vector h{0,1}m), as shown in Fig. 1, generates a distribution of visible (or input) units with the state of x{0,1}n:p(x)=1ZW,b,ah{0,1}mexp(xWh+ax+bh),where WRn×m is the weight matrix between the visible units and the hidden units, bRm and aRn are the biases, and ZW,b,a is the corresponding normalization constant. The so-called representational power of RBMs refers to what probability distributions of visible units can be expressed by an RBM and how they are expressed. It was shown in [14] that, due to the intractability of the RBMs’ partition function, even with given parameters, there is still no polynomial-time algorithm to estimate the marginal probabilities on visible units of the RBM.

As shown by Roux and Bengio [15], any distribution on {0,1}n can be closely approximated by an RBM of exponential size, namely, the number of hidden units m2n+1, which was improved to 2n-1-1 by Montúfar and Ay [16]. Roux and Bengio constructively built an RBM with as many hidden units as the number of input vectors whose probability is strictly positive [15]. It works by instantiating, for each of the possible values of x{0,1}n that has support, a single hidden unit which turns on only for that particular value (with overwhelming probability), so that the corresponding probability mass can be individually set by manipulating that unit’s bias parameter. This construction was then improved in [16] by using a single hidden unit to assign probability mass to a pair of visible vectors differing in only one entry. Next, Montúfar and Morton generalized this improvement to discrete RBMs and proved the universal approximation property of discrete RBMs [17]. Additionally, in [18], the upper bound of the size of an RBM that is a universal approximator is improved to [2(logn+1)/(n+1)]2n-1; however, it still requires that m grows exponentially fast in n. In fact, it is shown in [19] that the dimension of the distributions realized by RBMs is equal to min{2n-1,mn+m+n}, which implies the existence of distributions that cannot be computed by RBMs whose size is sub-exponential in the number of visible units n. However, large models or models with large parameters in magnitude are impractical and tend to have poor generalization ability [20]. Thus, instead of approximating all distributions, the investigation of how many hidden units and how large parameters in magnitude required to approximate some classes of interesting distributions with high accuracy deserves particular concern in practical applications.

By generalizing the construction used in [15], [16], Montúfar et al. used each hidden unit to turn on for not just a single x, but a “cubical set” of possible x’s, and showed that any k-component mixture of product distributions with support on arbitrary but disjoint faces of the n-cube can be approximated arbitrarily well by an RBM with k hidden units [21]. Unfortunately, as discussed in [20], families of these specialized distributions (for some k bounded by a polynomial of n) constitute only a very limited subset of all the distributions that have some kind of meaningful structure.

Martens et al. [20] investigated the representational power of RBMs. They first developed two easier-to-analyze approximations of the unnormalized log marginal probability of RBMs, called softplus RBM networks and hardplus RBM networks, which are two-layer feedforward neural networks; some simulation results which relate softplus RBM networks to hardplus RBM networks were built. By using these tools, they showed that any distribution that depends on the number of 1’s in the inputs can be approximated by an RBM of size n2. This result was improved by reducing the size from n2 to 2n+1 in our previous work [22]. Besides, in this paper, the representational power of the softplus RBM networks was proven to be equal to that of the depth-2 threshold circuits in efficiently computing Boolean functions.

However, the research of [20], [22] for the representational ability of RBMs only involved the distributions that depend on the number of 1’s in the inputs and the distributions whose mass represents a Boolean function. In this paper, by generalizing the methods used in our previous works [22], we consider more general distributions: the distributions that depend on the scalar projection of the inputs onto some vectors and the distributions can be efficiently computed by RBMs. Here, we say a distribution to be computed efficiently by a network if it can be computed by a network with the size and the magnitudes of the parameters bounded by a polynomial of n. In this paper, we study whether some class of distributions of interest can be efficiently computed by RBMs and how to characterize them. We shall give an equivalence characterization on the representational efficiency between RBMs and the two-layer threshold/ReLU feedforward networks. The main contributions of this paper are as follows:

  • For any distribution that depends on the scalar projection of the inputs onto a given vector, we constructively show how RBMs can approximate it with arbitrary accuracy (Section 3.1);

  • Given a distribution, we explore how it can be represented as the form that depends on the scalar projection of the inputs onto some vectors, and then study the properties of these vectors (Section 3.2.1), from which a new proof for the universal approximation properties of RBMs is deduced (Section 3.2.2).

  • It is shown that a distribution can be computed by an RBM whose size and parameters are bounded by polynomials in the number of the input units, if and only if its mass can be computed by a polynomial-size two-layer threshold/ReLU feedforward network with polynomially bounded parameters (Section 3.2.3).

As a preliminary, we will briefly introduce some related works on RBMs in Section 2. The main results of the paper are presented in Section 3. All the detailed proofs of these results are presented in Section 4. In Section 5 we offer a discussion and an outlook.

Section snippets

RBM networks

According to (1), the probability distributions of the visible units of an RBM can be deduced asp(x)=1ZW,b,aexpax+j=1mlog(1+ewjx+bj),x{0,1}n.where W=(w1,,wm)Rn×m,b=(bi)Rm and a=(aj)Rn. Due to the intractability of the partition term in the definition (2), a special two-layer feedforward neural network was developed as a tool for studying p(x) [20]. The following network defined bySW,b,c(x)j=1msoft(wjx+bj)+c,x{0,1}n,is called a softplus RBM network, where soft(x)log(1+ex) is the

Main results

In this section, we present the main results of this paper. First, in Section 3.1, we explore the representational ability of RBM networks to compute the functions that depend on the scalar projection of the inputs onto a given vector. Then, in Section 3.2.1, we aim to investigate whether a given function can be represented as the form that depends on the scalar projection of the inputs onto some vectors; if so, what do these vectors look like? On this basis, the universal approximation

Proof of Theorem 3.1

We follow the following steps to prove Theorem 3.1:

  • (1)

    if f is convex, then it can be computed by a hardplus RBM network of size Nw-1. This result will be given in Lemma 4.2;

  • (2)

    otherwise, we subtract g by a concave function y with sufficiently small second-order difference quotient such that hg-y is convex. The function y can be chosen as a quadratic function of the following form:

y(x)βXw2-2βX0Xw,withβ<0,which can be computed by a hardplus RBM network of size n+1 as shown in Lemma 4.3. Therefore, f(x

Conclusions

Even though RBMs are promising for practical applications in deep learning, the theoretical foundations for understanding the expressive power of RBMs have only begun to be studied. This paper further explores the representational power of RBMs. In particular, our focus lies on the sufficient number of hidden units of RBMs required to compute some classes of distributions of interest, with a fixed number of inputs. Specifically, we focus on the distributions that depend on the scalar projection

CRediT authorship contribution statement

Linyan Gu: Conceptualization, Methodology, Writing - original draft. Feng Zhou: Writing - review & editing, Funding acquisition. Lihua Yang: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was partially supported by Shenzhen Key Laboratory for Exascale Engineering and Scientific Computing under Grant No. ZDSYS201703031711426, the China Postdoctoral Science Foundation funded project under Grant No. 2019M66318, the National Natural Science Foundation of China under Grant Nos. 11901113 and 11771458, the Guangzhou Science and Technology Plan Project under Grant No. 201904010225, Guangdong Province Key Laboratory of Computational Science at Sun Yat-sen University under

Linyan Gu received the B.S. and Ph.D. degrees in the School of Mathematics from Sun Yat-sen University, China, in 2013 and 2019 respectively. As a visiting student, she studied in the Department of Computer Science, University of Colorado Boulder, in 2017. She is currently working as a postdoctoral scholar in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. Her current research interests include the theory and applications of deep learning and neural networks.

References (33)

  • N. Srivastava et al.

    Multimodal learning with deep Boltzmann machines

    J. Mach. Learn. Res.

    (2014)
  • A. Mohamed et al.

    Acoustic modeling using deep belief networks

    Trans. Audio Speech Lang. Proc.

    (2012)
  • H. Larochelle, Y. Bengio, Classification using discriminative restricted Boltzmann machines, in: Proceedings of the...
  • P. Romeu et al.

    Time-series forecasting of indoor temperature using pre-trained deep neural networks

  • P.M. Long, R.A. Servedio, Restricted Boltzmann machines are hard to approximately evaluate or simulate, in:...
  • N.L. Roux et al.

    Representational power of restricted Boltzmann machines and deep belief networks

    Neural Comput.

    (2008)
  • Cited by (3)

    • Approximation properties of Gaussian-binary restricted Boltzmann machines and Gaussian-binary deep belief networks

      2022, Neural Networks
      Citation Excerpt :

      The expressive power of these probability models has attracted much attention and has been studied in numerous papers. For example, RBMs with binary hidden and visible variables, referred to as binary–binary RBMs (BB-RBMs), have been studied in many aspects, such as their universal approximation properties (Montufaŕ & Ay, 2011; Roux & Bengio, 2008), approximation errors (Gu, Yang, & Zhou, 2021a; Montufaŕ, Rauh, & Ay, 2011), and efficiency of representation (Gu, Huang, & Yang, 2019; Gu et al., 2021a; Gu, Zhou, & Yang, 2020; Martens, Chattopadhya, Pitassi, & Zemel, 2013). For DBNs with binary hidden and visible variables (referred to as BB-DBN), it has been shown that BB-DBNs with narrow and deep architectures are able to approximate any distribution arbitrarily well (Le Roux & Bengio, 2010).

    • Refinements of Approximation Results of Conditional Restricted Boltzmann Machines

      2023, IEEE Transactions on Neural Networks and Learning Systems

    Linyan Gu received the B.S. and Ph.D. degrees in the School of Mathematics from Sun Yat-sen University, China, in 2013 and 2019 respectively. As a visiting student, she studied in the Department of Computer Science, University of Colorado Boulder, in 2017. She is currently working as a postdoctoral scholar in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. Her current research interests include the theory and applications of deep learning and neural networks.

    Feng Zhou received the B.S. degree in Information Computing Science from Minnan Normal University, China, in 2010 and Ph.D. degree in Computational Mathematics from Sun Yat-sen University, China, in 2015. As a visiting student, he studied in the School of Mathematics, Georgia Institute of Technology from Sep. 2013 to Sep. 2014. Now he is an Associate Professor at the School of Information, Guangdong University of Finance and Economics. His research interests include signal analysis, machine learning and ensemble learning.

    Lihua Yang received the B.S., M.S. and PhD degrees in mathematics from Hunan Normal University, Beijing Normal University, and Sun Yat-sen University in 1984, 1987, and 1995 respectively. He is now a professor at School of Mathematics, Sun Yat-sen University, China. From 1996 to 1998, he worked as a postdoctoral fellow in the Institute of Mathematics, Academia Sinica, China. He has been actively involved in many international conferences as an organizing committee and/or technical program committee member. His current research interests include signal processing and machine learning. He has published more than 100 papers, 2 books and 3 translations. He was the supervisor of 20 Ph.D. candidates. He is a senior member of the IEEE.

    1

    orcid: 0000-0003-4294-9233.

    View full text