Limitations of shallow nets approximation

doi:10.1016/j.neunet.2017.06.016

Neural Networks

Volume 94, October 2017, Pages 96-102

https://doi.org/10.1016/j.neunet.2017.06.016 Get rights and content

Abstract

In this paper, we aim at analyzing the approximation abilities of shallow networks in reproducing kernel Hilbert spaces (RKHSs). We prove that there is a probability measure such that the achievable lower bound for approximating by shallow nets can be realized for all functions in balls of reproducing kernel Hilbert space with high probability, which is different with the classical minimax approximation error estimates. This result together with the existing approximation results for deep nets shows the limitations for shallow nets and provides a theoretical explanation on why deep nets perform better than shallow nets.

Introduction

Recent years have witnessed a tremendous growth of interest in deep nets, a.k.a., neural networks with more than one hidden layer.Applications include the image classification (Krizhevsky, Sutskever, & Hinton, 2012), speech recognition (Lee, Pham, Largman, & Ng, 2010), manifold learning (Basri & Jacobs, 2016) and so on. All these applications show the excellent power of deep nets over shallow nets, i.e, neural networks with one hidden layer. We refer the readers to Bengio (2009), Chui and Mhaskar (2016), Hinton, Osindero, and Teh (2006), LeCun (2014), Schmidhuber (2015) and references therein for more applications and details of deep nets.

The comparison of performances between deep nets and shallow nets is a classical topic in approximation theory. Without consideration of the computational burden, there are roughly two advantages of deep nets approximation. The first one, called as the expressivity (Raghu, Poole, Kleinberg, Ganguli, & Sohl-Dickstein, 2016), shows that there are various functions expressible by deep nets but cannot be approximated by any shallow nets with similar number of neurons. A typical example is that deep nets can provide localized approximation but shallow nets fail (Chui, Li, & Mhaskar, 1994). The other one, proposed in Chui, Li, and Mhaskar (1996), is that deep nets can break through some lower bounds of approximation for shallow nets. In particular, utilizing the Kolmogorov superposition theorem, Maiorov and Pinkus (1999) proved that there exists a deep net with two hidden layers and finitely many neurons possessing universal approximation property. In a nutshell, the first advantage shows that deep nets can approximate more functions than shallow nets, while the second one implies that deep nets possess better approximation capability for some functions expressible by shallow nets.

Most of the recent studies on deep nets focus on the expressivity Delalleau & Bengio (2011), Eldan & Shamir (2015), Kürková & Sanguineti (2013), Mhaskar et al. (2016), Montúfar et al. (2013), Raghu et al. (2016), Telgarsky (2016). All these results presented theoretical explanations of the excellent performance of deep nets in some difficult learning tasks. However, compared with avid research activities on the expressivity, the second advantage of deep nets does not attract much attention. The main reason is that there lack comprehensive studies on the limitations of shallow nets, which makes it difficult to quantify the difference of approximation abilities between deep and shallow nets. To be detailed, the existing results Chui et al. (1996), Lin et al. (2011), Maiorov (1999), Maiorov (2003), Maiorov (2005) concerning the lower bounds of shallow nets approximation were built upon the minimax sense in terms of constructing some bad functions in a class of functions to achieve the worst approximation rates. If the measure of the set of these bad functions is small, then the minimax lower bound is difficult to reflect limitations of shallow nets. In other words, the massiveness of the bad functions plays a crucial role in analyzing the limitations of shallow nets.

In this paper, we aim at deriving limitations of shallow nets via employing a massiveness analysis of the bad functions in some reproducing kernel Hilbert space (RKHS). Motivated by Maiorov, Meir, and Ratsaby (1999), we utilize Kolmogorov extension of measure theorem to construct a probability measure, under which all functions in the unit ball of some RKHS are bad in the sense that the approximation rate of shallow nets for all these functions is larger than a specified value with high probability. Using the classical results for polynomial approximation in RKHS (Petrushev, 1999), we prove that the aforementioned specified lower bound is achievable. With this, we derive the limitations of shallow nets in approximating functions in RKHS, which together with the recent results in deep nets approximation (Ismailov, 2014) present a theoretical explanation for the success of deep learning.

The rest of paper is organized as follows. In Section 2, we give the main result of the paper, where optimal approximation rates of shallow nets are deduced in the probabilistic sense. In Section 3, we compare our result with some related work. In Section 4, we present the construction of probability measure by means of Kolmogorov extension of measure theorem. In Section 5, we prove the main result of this paper.

Section snippets

Main results

Let $d \geq 2$ and $S_{σ, n} ≔ \{\sum_{j = 1}^{n} c_{j} σ (w_{j} x + θ_{j}) : c_{j}, θ_{j} \in R, w_{j} \in R^{d}\}$ be the set of shallow nets with activation function $σ$ . In this paper, we focus on deriving lower bound of a wider range of shallow nets than $S_{σ, n}$ . Define by $N_{n} ≔ \{\sum_{i = 1}^{n} a_{i} ϕ_{i} (ξ_{i} \cdot x) : ξ \in S^{d - 1}, ϕ_{i} \in L^{2} ([- 1, 1])\}$ a manifold of ridge functions, where $S^{d - 1}$ is the unit sphere in $R^{d}$ . It is easy to see that $S_{σ, n} \subset N_{n}$ , provided $σ \in L_{L o c}^{2} (R)$ , where $f \in L_{L o c}^{2} (R)$ denotes that for arbitrary closed set $A$ in $R$ , $f$ is square integrable.

Comparisons and related work

Approximation ability analysis for shallow nets is a long-standing and classical topic in approximation theory and neural networks. Approximation error estimates for various shallow nets have been conducted in Anastassiou (2011), Barron and Klusowski (2016), Costarelli (2015), Costarelli & Vinti (2016a), Costarelli & Vinti (2016b), Costarelli & Vinti (2016c), Gripenberg (2003), Hahm and Hong (2016), Iliev, Kyurkchiev, and Markov (2015), Ismailov (2014), Maiorov (1999) and Pinkus (1999) and

Construction of the probability measure

Let $u \in N$ and $A_{u} (Φ) = {x \in R^{\infty} : (x_{1}, \dots, x_{u}) \in Φ}, Φ \in B (R^{u}),$ where $B (R^{u})$ denotes a Borel algebraic of $R^{u}$ (Shiryayev, 1984 P.143). The following Kolmogorov extension of measure theorem can be found in Shiryayev (1984, Theorem 4).

Lemma 4.1

Let $P_{1}$ , $P_{2}, \dots$ be a sequence of probability measure on the measure space $(R, B (R)),$ $(R^{2}, B (R^{2})), \dots$ , possessing a consistency property: $P_{u + 1} (Φ \times R) = P_{u} (Φ), for u = 1, 2, \dots, and Φ \in B (R^{u}) .$ Then, there is a unique probability measure $P$ on $(R^{\infty}, B (R^{\infty}))$ such that $P (A_{u}$

Proofs

It can be found in Maiorov (1999) (or Petrushev, 1999) that there exists a constant $\bar{c}$ depending only on $d$ such that $P_{s}^{d} \subset N_{n},$ provided $n = \bar{c} s^{d - 1}$ . Hence, to prove Theorem 2.1, it suffices to prove $d i s t {(f, P_{s}^{d}, H^{*})}^{2} \leq β_{s} α_{s}^{- 1} R$ and the lower bound of (2.7). Since $β_{k} ∕ α_{k}$ is decreasing with respect to $k$ , for arbitrary $f \in H_{K}^{R},$ we get $d i s t {(f, P_{s}^{d}, H^{*})}^{2} = \sum_{k = s + 1}^{\infty} \sum_{j \in T_{k}} \sum_{i = 1}^{D_{j}^{d - 1}} \frac{{| {\hat{f}}_{k, j, i} |}^{2}}{α_{k}} \leq β_{s} α_{s}^{- 1} \sum_{k = s + 1}^{\infty} \sum_{j \in T_{k}} \sum_{i = 1}^{D_{j}^{d - 1}} \frac{{| {\hat{f}}_{k, j, i} |}^{2}}{β_{k}} \leq β_{s} α_{s}^{- 1} R,$ which verifies (5.2). Noticing (4.3), it then suffices to prove the lower bound of

References (45)

CostarelliD.
Neural network operators: constructive interpolation of multivariate functions
Neural Networks
(2015)
CostarelliD. et al.
Max-product neural network and quasi in- terpolation operators activated by sigmoidal functions
Journal of Approximation Theory
(2016)
CostarelliD. et al.
Pointwise and uniform approximation by multivariate neural network operators of the max-product type
Neural Networks
(2016)
GripenbergG.
Approximation by neural network with a bounded number of nodes at each level
Journal of Approximation Theory
(2003)
IsmailovV.E.
On the approximation by neural networks with bounded number of neurons in hidden layers
Journal of Mathematical Analysis and Applications
(2014)
LinS.B. et al.
Essential rate for approximation by spherical neural networks
Neural Networks
(2011)
MaiorovV.
On best approximation by ridge functions
Journal of Approximation Theory
(1999)
MaiorovV.
On best approximation of classes by radial functions
Journal of Approximation Theory
(2003)
MaiorovV. et al.
Lower bounds for approximation by MLP neural networks
Neurocomputing
(1999)
SchmidhuberJ.
Deep learning in neural networks: an overview
Neural Networks
(2015)

AnastassiouG.

Intelligent systems: Approximation by artificial neural networks

(2011)

AronszajnN.

Theory of reproducing kernels

Transactions of the American Society

(1950)

Barron, A., & Klusowski, J. (2016). Uniform approximation by neural networks activated by first and second order ridge...

Basri, R., & Jacobs, D. (2016). Efficient representation of low-dimensional manifolds using deep networks, arXiv...

BengioY.

Learning deep architectures for AI

Foundations and Trends in Machine Learning

(2009)

ChuiC.K. et al.

Neural networks for lozalized approximation

Mathematics of Computation

(1994)

ChuiC.K. et al.

Limitations of the approximation capabilities of neural networks with one hidden layer

Advances in Computational Mathematics

(1996)

Chui, C. K., & Mhaskar, H. N. (2016). Deep nets for local manifold learning, arXiv preprint...

CostarelliD. et al.

Approximation by max-product neural network operators of Kantorovich type

Results in Mathematics

(2016)

DelalleauO. et al.

Shallow vs. deep sum–product networks

Eldan, R., & Shamir, O. (2015). The power of depth for feedforward neural networks, arXiv preprint...

GuliyevN. et al.

A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function

Neural Computation

(2016)

Cited by (28)

Approximating smooth and sparse functions by deep neural networks: Optimal approximation rates and saturation
2023, Journal of Complexity
Constructing neural networks for function approximation is a classical and longstanding topic in approximation theory. In this paper, we aim at constructing deep neural networks with three hidden layers using a sigmoidal activation function to approximate smooth and sparse functions. Specifically, we prove that the constructed deep nets with controllable magnitude of free parameters can reach the optimal approximation rate in approximating both smooth and sparse functions. In particular, we prove that neural networks with three hidden layers can avoid the phenomenon of saturation, i.e., the phenomenon that for some neural network architectures, the approximation rate stops improving for functions of very high smoothness.
Shallow neural network with kernel approximation for prediction problems in highly demanding data networks
2019, Expert Systems with Applications
Citation Excerpt :
Other studies support that, although theoretically both shallow and deep networks possess the universal approximation property, deep networks can do so with a much smaller number of parameters and complexity (Bianchini & Scarselli, 2014; Mhaskar, Liao, & Poggio, 2016). There are also works presenting the theoretical limitations of shallow networks (Kurkova & Sanguineti, 2017; Lin, 2017). And, in Ba and Caruana (2014) there is an interesting study that shows that shallow networks can get results as good or even better than deep networks, but when they are trained to simulate the results of another deep network, in this way, the focus is placed not only on the intrinsic inabilities of shallow architectures, but rather on the difficulties to train them adequately.
Intrusion detection and network traffic classification are two of the main research applications of machine learning to highly demanding data networks e.g. IoT/sensors networks. These applications present new prediction challenges and strict requirements to the models applied for prediction. The models must be fast, accurate, flexible and capable of managing large datasets. They must be fast at the training, but mainly at the prediction phase, since inevitable environment changes require constant periodic training, and real-time prediction is mandatory. The models need to be accurate due to the consequences of prediction errors. They need also to be flexible and able to detect complex behaviors, usually encountered in non-linear models and, finally, training and prediction datasets are usually large due to traffic volumes. These requirements present conflicting solutions, between fast and simple shallow linear models and the slower and richer non-linear and deep learning models. Therefore, the perfect solution would be a mixture of both worlds. In this paper, we present such a solution made of a shallow neural network with linear activations plus a feature transformation based on kernel approximation algorithms which provide the necessary richness and non-linear behavior to the whole model. We have studied several kernel approximation algorithms: Nystrom, Random Fourier Features and Fastfood transformation and have applied them to three datasets related to intrusion detection and network traffic classification.
This work presents the first application of a shallow linear model plus a kernel approximation to prediction problems with highly demanding network requirements. We show that the prediction performance obtained by these algorithms is positioned in the same range as the best non-linear classifiers, with a significant reduction in computational times, making them appropriate for new highly demanding networks.
Approximation capability of two hidden layer feedforward neural networks with fixed weights
2018, Neurocomputing
Citation Excerpt :
Theorem 2.1, in particular, shows that TLFNs are more powerful than SLFNs, since SLFNs with a fixed number of hidden neurons and/or weights have not the capability of approximating multivariate functions (see Introduction). We refer the reader to [36] for interesting results and discussions around the comparison of performances between MLFNs and SLFNs. In [17], Gripenberg showed that the general approximation property of feedforward multilayer perceptron networks can be achieved in networks where the number of neurons in each layer is bounded, but the number of layers grows to infinity.
We algorithmically construct a two hidden layer feedforward neural network (TLFN) model with the weights fixed as the unit coordinate vectors of the d-dimensional Euclidean space and having $3 d + 2$ number of hidden neurons in total, which can approximate any continuous d-variable function with an arbitrary precision. This result, in particular, shows an advantage of the TLFN model over the single hidden layer feedforward neural network (SLFN) model, since SLFNs with fixed weights do not have the capability of approximating multivariate functions.
On the approximation by single hidden layer feedforward neural networks with fixed weights
2018, Neural Networks
Citation Excerpt :
We refer the reader to Chui, Li, and Mhaskar (1996), Kůrková (2016), Lin (2017), Mhaskar and Poggio (2016) and Schmidhuber (2015) for interesting results and discussions around other limitations of such networks.
Single hidden layer feedforward neural networks (SLFNs) with fixed weights possess the universal approximation property provided that approximated functions are univariate. But this phenomenon does not lay any restrictions on the number of neurons in the hidden layer. The more this number, the more the probability of the considered network to give precise results. In this note, we constructively prove that SLFNs with the fixed weight 1 and two neurons in the hidden layer can approximate any continuous function on a compact subset of the real line. The proof is implemented by a step by step construction of a universal sigmoidal activation function. This function has nice properties such as computability, smoothness and weak monotonicity. The applicability of the obtained result is demonstrated in various numerical examples. Finally, we show that SLFNs with fixed weights cannot approximate all continuous multivariate functions.
Lifting the Veil: Unlocking the Power of Depth in Q-learning
2023, arXiv
Construction of Deep ReLU Nets for Spatially Sparse Learning
2023, IEEE Transactions on Neural Networks and Learning Systems

View all citing articles on Scopus

^☆: The research was supported by the National Natural Science Foundation of China (Grant Nos. 61502342, 11401462).

View full text

Limitations of shallow nets approximation☆

Abstract

Introduction

Section snippets

Main results

Comparisons and related work

Construction of the probability measure

Proofs

Neural Networks

Journal of Approximation Theory

Neural Networks

Journal of Approximation Theory

Journal of Mathematical Analysis and Applications

Neural Networks

Journal of Approximation Theory

Journal of Approximation Theory

Neurocomputing

Neural Networks

Intelligent systems: Approximation by artificial neural networks

Theory of reproducing kernels

Transactions of the American Society

Learning deep architectures for AI

Foundations and Trends in Machine Learning

Neural networks for lozalized approximation

Mathematics of Computation

Limitations of the approximation capabilities of neural networks with one hidden layer

Advances in Computational Mathematics

Approximation by max-product neural network operators of Kantorovich type

Results in Mathematics

Shallow vs. deep sum–product networks

A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function

Neural Computation