Elsevier

Neural Networks

Volume 94, October 2017, Pages 96-102
Neural Networks

Limitations of shallow nets approximation

https://doi.org/10.1016/j.neunet.2017.06.016Get rights and content

Abstract

In this paper, we aim at analyzing the approximation abilities of shallow networks in reproducing kernel Hilbert spaces (RKHSs). We prove that there is a probability measure such that the achievable lower bound for approximating by shallow nets can be realized for all functions in balls of reproducing kernel Hilbert space with high probability, which is different with the classical minimax approximation error estimates. This result together with the existing approximation results for deep nets shows the limitations for shallow nets and provides a theoretical explanation on why deep nets perform better than shallow nets.

Introduction

Recent years have witnessed a tremendous growth of interest in deep nets, a.k.a., neural networks with more than one hidden layer.Applications include the image classification (Krizhevsky, Sutskever, & Hinton, 2012), speech recognition (Lee, Pham, Largman, & Ng, 2010), manifold learning (Basri & Jacobs, 2016) and so on. All these applications show the excellent power of deep nets over shallow nets, i.e, neural networks with one hidden layer. We refer the readers to Bengio (2009), Chui and Mhaskar (2016), Hinton, Osindero, and Teh (2006), LeCun (2014), Schmidhuber (2015) and references therein for more applications and details of deep nets.

The comparison of performances between deep nets and shallow nets is a classical topic in approximation theory. Without consideration of the computational burden, there are roughly two advantages of deep nets approximation. The first one, called as the expressivity (Raghu, Poole, Kleinberg, Ganguli, & Sohl-Dickstein, 2016), shows that there are various functions expressible by deep nets but cannot be approximated by any shallow nets with similar number of neurons. A typical example is that deep nets can provide localized approximation but shallow nets fail (Chui, Li, & Mhaskar, 1994). The other one, proposed in Chui, Li, and Mhaskar (1996), is that deep nets can break through some lower bounds of approximation for shallow nets. In particular, utilizing the Kolmogorov superposition theorem, Maiorov and Pinkus (1999) proved that there exists a deep net with two hidden layers and finitely many neurons possessing universal approximation property. In a nutshell, the first advantage shows that deep nets can approximate more functions than shallow nets, while the second one implies that deep nets possess better approximation capability for some functions expressible by shallow nets.

Most of the recent studies on deep nets focus on the expressivity Delalleau & Bengio (2011), Eldan & Shamir (2015), Kürková & Sanguineti (2013), Mhaskar et al. (2016), Montúfar et al. (2013), Raghu et al. (2016), Telgarsky (2016). All these results presented theoretical explanations of the excellent performance of deep nets in some difficult learning tasks. However, compared with avid research activities on the expressivity, the second advantage of deep nets does not attract much attention. The main reason is that there lack comprehensive studies on the limitations of shallow nets, which makes it difficult to quantify the difference of approximation abilities between deep and shallow nets. To be detailed, the existing results Chui et al. (1996), Lin et al. (2011), Maiorov (1999), Maiorov (2003), Maiorov (2005) concerning the lower bounds of shallow nets approximation were built upon the minimax sense in terms of constructing some bad functions in a class of functions to achieve the worst approximation rates. If the measure of the set of these bad functions is small, then the minimax lower bound is difficult to reflect limitations of shallow nets. In other words, the massiveness of the bad functions plays a crucial role in analyzing the limitations of shallow nets.

In this paper, we aim at deriving limitations of shallow nets via employing a massiveness analysis of the bad functions in some reproducing kernel Hilbert space (RKHS). Motivated by Maiorov, Meir, and Ratsaby (1999), we utilize Kolmogorov extension of measure theorem to construct a probability measure, under which all functions in the unit ball of some RKHS are bad in the sense that the approximation rate of shallow nets for all these functions is larger than a specified value with high probability. Using the classical results for polynomial approximation in RKHS (Petrushev, 1999), we prove that the aforementioned specified lower bound is achievable. With this, we derive the limitations of shallow nets in approximating functions in RKHS, which together with the recent results in deep nets approximation (Ismailov, 2014) present a theoretical explanation for the success of deep learning.

The rest of paper is organized as follows. In Section 2, we give the main result of the paper, where optimal approximation rates of shallow nets are deduced in the probabilistic sense. In Section 3, we compare our result with some related work. In Section 4, we present the construction of probability measure by means of Kolmogorov extension of measure theorem. In Section 5, we prove the main result of this paper.

Section snippets

Main results

Let d2 and Sσ,nj=1ncjσ(wjx+θj):cj,θjR,wjRdbe the set of shallow nets with activation function σ. In this paper, we focus on deriving lower bound of a wider range of shallow nets than Sσ,n. Define by Nni=1naiϕi(ξix):ξSd1,ϕiL2([1,1])a manifold of ridge functions, where Sd1 is the unit sphere in Rd. It is easy to see that Sσ,nNn, provided σLLoc2(R), where fLLoc2(R) denotes that for arbitrary closed set A in R, f is square integrable.

Comparisons and related work

Approximation ability analysis for shallow nets is a long-standing and classical topic in approximation theory and neural networks. Approximation error estimates for various shallow nets have been conducted in Anastassiou (2011), Barron and Klusowski (2016), Costarelli (2015), Costarelli & Vinti (2016a), Costarelli & Vinti (2016b), Costarelli & Vinti (2016c), Gripenberg (2003), Hahm and Hong (2016), Iliev, Kyurkchiev, and Markov (2015), Ismailov (2014), Maiorov (1999) and Pinkus (1999) and

Construction of the probability measure

Let uN and Au(Φ)={xR:(x1,,xu)Φ},ΦB(Ru),where B(Ru) denotes a Borel algebraic of Ru (Shiryayev, 1984 P.143). The following Kolmogorov extension of measure theorem can be found in Shiryayev (1984, Theorem 4).

Lemma 4.1

Let P1, P2, be a sequence of probability measure on the measure space (R,B(R)), (R2,B(R2)),, possessing a consistency property: Pu+1(Φ×R)=Pu(Φ),foru=1,2,,andΦB(Ru). Then, there is a unique probability measure P on (R,B(R)) such that P(Au

Proofs

It can be found in Maiorov (1999) (or Petrushev, 1999) that there exists a constant c¯ depending only on d such that PsdNn,provided n=c¯sd1. Hence, to prove Theorem 2.1, it suffices to prove dist(f,Psd,H)2βsαs1Rand the lower bound of (2.7). Since βkαk is decreasing with respect to k, for arbitrary fHKR, we get dist(f,Psd,H)2=k=s+1jTki=1Djd1|fˆk,j,i|2αkβsαs1k=s+1jTki=1Djd1|fˆk,j,i|2βkβsαs1R,which verifies (5.2). Noticing (4.3), it then suffices to prove the lower bound of

References (45)

  • AnastassiouG.

    Intelligent systems: Approximation by artificial neural networks

    (2011)
  • AronszajnN.

    Theory of reproducing kernels

    Transactions of the American Society

    (1950)
  • Barron, A., & Klusowski, J. (2016). Uniform approximation by neural networks activated by first and second order ridge...
  • Basri, R., & Jacobs, D. (2016). Efficient representation of low-dimensional manifolds using deep networks, arXiv...
  • BengioY.

    Learning deep architectures for AI

    Foundations and Trends in Machine Learning

    (2009)
  • ChuiC.K. et al.

    Neural networks for lozalized approximation

    Mathematics of Computation

    (1994)
  • ChuiC.K. et al.

    Limitations of the approximation capabilities of neural networks with one hidden layer

    Advances in Computational Mathematics

    (1996)
  • Chui, C. K., & Mhaskar, H. N. (2016). Deep nets for local manifold learning, arXiv preprint...
  • CostarelliD. et al.

    Approximation by max-product neural network operators of Kantorovich type

    Results in Mathematics

    (2016)
  • DelalleauO. et al.

    Shallow vs. deep sum–product networks

  • Eldan, R., & Shamir, O. (2015). The power of depth for feedforward neural networks, arXiv preprint...
  • GuliyevN. et al.

    A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function

    Neural Computation

    (2016)
  • Cited by (28)

    • Shallow neural network with kernel approximation for prediction problems in highly demanding data networks

      2019, Expert Systems with Applications
      Citation Excerpt :

      Other studies support that, although theoretically both shallow and deep networks possess the universal approximation property, deep networks can do so with a much smaller number of parameters and complexity (Bianchini & Scarselli, 2014; Mhaskar, Liao, & Poggio, 2016). There are also works presenting the theoretical limitations of shallow networks (Kurkova & Sanguineti, 2017; Lin, 2017). And, in Ba and Caruana (2014) there is an interesting study that shows that shallow networks can get results as good or even better than deep networks, but when they are trained to simulate the results of another deep network, in this way, the focus is placed not only on the intrinsic inabilities of shallow architectures, but rather on the difficulties to train them adequately.

    • Approximation capability of two hidden layer feedforward neural networks with fixed weights

      2018, Neurocomputing
      Citation Excerpt :

      Theorem 2.1, in particular, shows that TLFNs are more powerful than SLFNs, since SLFNs with a fixed number of hidden neurons and/or weights have not the capability of approximating multivariate functions (see Introduction). We refer the reader to [36] for interesting results and discussions around the comparison of performances between MLFNs and SLFNs. In [17], Gripenberg showed that the general approximation property of feedforward multilayer perceptron networks can be achieved in networks where the number of neurons in each layer is bounded, but the number of layers grows to infinity.

    • On the approximation by single hidden layer feedforward neural networks with fixed weights

      2018, Neural Networks
      Citation Excerpt :

      We refer the reader to Chui, Li, and Mhaskar (1996), Kůrková (2016), Lin (2017), Mhaskar and Poggio (2016) and Schmidhuber (2015) for interesting results and discussions around other limitations of such networks.

    • Construction of Deep ReLU Nets for Spatially Sparse Learning

      2023, IEEE Transactions on Neural Networks and Learning Systems
    View all citing articles on Scopus

    The research was supported by the National Natural Science Foundation of China (Grant Nos. 61502342, 11401462).

    View full text