Limitations of shallow nets approximation☆
Introduction
Recent years have witnessed a tremendous growth of interest in deep nets, a.k.a., neural networks with more than one hidden layer.Applications include the image classification (Krizhevsky, Sutskever, & Hinton, 2012), speech recognition (Lee, Pham, Largman, & Ng, 2010), manifold learning (Basri & Jacobs, 2016) and so on. All these applications show the excellent power of deep nets over shallow nets, i.e, neural networks with one hidden layer. We refer the readers to Bengio (2009), Chui and Mhaskar (2016), Hinton, Osindero, and Teh (2006), LeCun (2014), Schmidhuber (2015) and references therein for more applications and details of deep nets.
The comparison of performances between deep nets and shallow nets is a classical topic in approximation theory. Without consideration of the computational burden, there are roughly two advantages of deep nets approximation. The first one, called as the expressivity (Raghu, Poole, Kleinberg, Ganguli, & Sohl-Dickstein, 2016), shows that there are various functions expressible by deep nets but cannot be approximated by any shallow nets with similar number of neurons. A typical example is that deep nets can provide localized approximation but shallow nets fail (Chui, Li, & Mhaskar, 1994). The other one, proposed in Chui, Li, and Mhaskar (1996), is that deep nets can break through some lower bounds of approximation for shallow nets. In particular, utilizing the Kolmogorov superposition theorem, Maiorov and Pinkus (1999) proved that there exists a deep net with two hidden layers and finitely many neurons possessing universal approximation property. In a nutshell, the first advantage shows that deep nets can approximate more functions than shallow nets, while the second one implies that deep nets possess better approximation capability for some functions expressible by shallow nets.
Most of the recent studies on deep nets focus on the expressivity Delalleau & Bengio (2011), Eldan & Shamir (2015), Kürková & Sanguineti (2013), Mhaskar et al. (2016), Montúfar et al. (2013), Raghu et al. (2016), Telgarsky (2016). All these results presented theoretical explanations of the excellent performance of deep nets in some difficult learning tasks. However, compared with avid research activities on the expressivity, the second advantage of deep nets does not attract much attention. The main reason is that there lack comprehensive studies on the limitations of shallow nets, which makes it difficult to quantify the difference of approximation abilities between deep and shallow nets. To be detailed, the existing results Chui et al. (1996), Lin et al. (2011), Maiorov (1999), Maiorov (2003), Maiorov (2005) concerning the lower bounds of shallow nets approximation were built upon the minimax sense in terms of constructing some bad functions in a class of functions to achieve the worst approximation rates. If the measure of the set of these bad functions is small, then the minimax lower bound is difficult to reflect limitations of shallow nets. In other words, the massiveness of the bad functions plays a crucial role in analyzing the limitations of shallow nets.
In this paper, we aim at deriving limitations of shallow nets via employing a massiveness analysis of the bad functions in some reproducing kernel Hilbert space (RKHS). Motivated by Maiorov, Meir, and Ratsaby (1999), we utilize Kolmogorov extension of measure theorem to construct a probability measure, under which all functions in the unit ball of some RKHS are bad in the sense that the approximation rate of shallow nets for all these functions is larger than a specified value with high probability. Using the classical results for polynomial approximation in RKHS (Petrushev, 1999), we prove that the aforementioned specified lower bound is achievable. With this, we derive the limitations of shallow nets in approximating functions in RKHS, which together with the recent results in deep nets approximation (Ismailov, 2014) present a theoretical explanation for the success of deep learning.
The rest of paper is organized as follows. In Section 2, we give the main result of the paper, where optimal approximation rates of shallow nets are deduced in the probabilistic sense. In Section 3, we compare our result with some related work. In Section 4, we present the construction of probability measure by means of Kolmogorov extension of measure theorem. In Section 5, we prove the main result of this paper.
Section snippets
Main results
Let and be the set of shallow nets with activation function . In this paper, we focus on deriving lower bound of a wider range of shallow nets than . Define by a manifold of ridge functions, where is the unit sphere in . It is easy to see that , provided , where denotes that for arbitrary closed set in , is square integrable.
Comparisons and related work
Approximation ability analysis for shallow nets is a long-standing and classical topic in approximation theory and neural networks. Approximation error estimates for various shallow nets have been conducted in Anastassiou (2011), Barron and Klusowski (2016), Costarelli (2015), Costarelli & Vinti (2016a), Costarelli & Vinti (2016b), Costarelli & Vinti (2016c), Gripenberg (2003), Hahm and Hong (2016), Iliev, Kyurkchiev, and Markov (2015), Ismailov (2014), Maiorov (1999) and Pinkus (1999) and
Construction of the probability measure
Let and where denotes a Borel algebraic of (Shiryayev, 1984 P.143). The following Kolmogorov extension of measure theorem can be found in Shiryayev (1984, Theorem 4).
Lemma 4.1 Let
,
be a sequence of probability measure on the measure space
, possessing a consistency property:
Then, there is a unique probability measure
on
such that
Proofs
It can be found in Maiorov (1999) (or Petrushev, 1999) that there exists a constant depending only on such that provided . Hence, to prove Theorem 2.1, it suffices to prove and the lower bound of (2.7). Since is decreasing with respect to , for arbitrary we get which verifies (5.2). Noticing (4.3), it then suffices to prove the lower bound of
References (45)
Neural network operators: constructive interpolation of multivariate functions
Neural Networks
(2015)- et al.
Max-product neural network and quasi in- terpolation operators activated by sigmoidal functions
Journal of Approximation Theory
(2016) - et al.
Pointwise and uniform approximation by multivariate neural network operators of the max-product type
Neural Networks
(2016) Approximation by neural network with a bounded number of nodes at each level
Journal of Approximation Theory
(2003)On the approximation by neural networks with bounded number of neurons in hidden layers
Journal of Mathematical Analysis and Applications
(2014)- et al.
Essential rate for approximation by spherical neural networks
Neural Networks
(2011) On best approximation by ridge functions
Journal of Approximation Theory
(1999)On best approximation of classes by radial functions
Journal of Approximation Theory
(2003)- et al.
Lower bounds for approximation by MLP neural networks
Neurocomputing
(1999) Deep learning in neural networks: an overview
Neural Networks
(2015)
Intelligent systems: Approximation by artificial neural networks
Theory of reproducing kernels
Transactions of the American Society
Learning deep architectures for AI
Foundations and Trends in Machine Learning
Neural networks for lozalized approximation
Mathematics of Computation
Limitations of the approximation capabilities of neural networks with one hidden layer
Advances in Computational Mathematics
Approximation by max-product neural network operators of Kantorovich type
Results in Mathematics
Shallow vs. deep sum–product networks
A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function
Neural Computation
Cited by (28)
Shallow neural network with kernel approximation for prediction problems in highly demanding data networks
2019, Expert Systems with ApplicationsCitation Excerpt :Other studies support that, although theoretically both shallow and deep networks possess the universal approximation property, deep networks can do so with a much smaller number of parameters and complexity (Bianchini & Scarselli, 2014; Mhaskar, Liao, & Poggio, 2016). There are also works presenting the theoretical limitations of shallow networks (Kurkova & Sanguineti, 2017; Lin, 2017). And, in Ba and Caruana (2014) there is an interesting study that shows that shallow networks can get results as good or even better than deep networks, but when they are trained to simulate the results of another deep network, in this way, the focus is placed not only on the intrinsic inabilities of shallow architectures, but rather on the difficulties to train them adequately.
Approximation capability of two hidden layer feedforward neural networks with fixed weights
2018, NeurocomputingCitation Excerpt :Theorem 2.1, in particular, shows that TLFNs are more powerful than SLFNs, since SLFNs with a fixed number of hidden neurons and/or weights have not the capability of approximating multivariate functions (see Introduction). We refer the reader to [36] for interesting results and discussions around the comparison of performances between MLFNs and SLFNs. In [17], Gripenberg showed that the general approximation property of feedforward multilayer perceptron networks can be achieved in networks where the number of neurons in each layer is bounded, but the number of layers grows to infinity.
On the approximation by single hidden layer feedforward neural networks with fixed weights
2018, Neural NetworksCitation Excerpt :We refer the reader to Chui, Li, and Mhaskar (1996), Kůrková (2016), Lin (2017), Mhaskar and Poggio (2016) and Schmidhuber (2015) for interesting results and discussions around other limitations of such networks.
Construction of Deep ReLU Nets for Spatially Sparse Learning
2023, IEEE Transactions on Neural Networks and Learning Systems