Global Convergence of Sobolev Training for Overparameterized Neural Networks

Cocola, Jorio; Hand, Paul

doi:10.1007/978-3-030-64583-0_51

Jorio Cocola¹⁶ &
Paul Hand^16,17

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12565))

Included in the following conference series:

International Conference on Machine Learning, Optimization, and Data Science

1778 Accesses

Abstract

Sobolev loss is used when training a network to approximate the values and derivatives of a target function at a prescribed set of input points. Recent works have demonstrated its successful applications in various tasks such as distillation or synthetic gradient prediction. In this work we prove that an overparameterized two-layer relu neural network trained on the Sobolev loss with gradient flow from random initialization can fit any given function values and any given directional derivatives, under a separation condition on the input data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Gradient descent optimizes over-parameterized deep ReLU networks

Article 23 October 2019

Convergence and Recovery Guarantees of Unsupervised Neural Networks for Inverse Problems

Article 04 June 2024

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Article Open access 14 June 2023

Notes

1.
Notice that the introduction of the constants $\alpha $ and $\beta $ does not change the expressivity of the network.

References

Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 (2018)
Arora, S., Du, S.S., Hu, W., Li, Z., Wang, R.: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 (2019)
Bietti, A., Mairal, J.: On the inductive bias of neural tangent kernels. In: Advances in Neural Information Processing Systems. pp. 12873–12884 (2019)
Google Scholar
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956 (2018)
Czarnecki, W.M., Osindero, S., Jaderberg, M., Swirszcz, G., Pascanu, R.: Sobolev training for neural networks. In: Advances in Neural Information Processing Systems, pp. 4278–4287 (2017)
Google Scholar
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep relu neural networks in $w^{s, p}$ norms. arXiv preprint arXiv:1902.07896 (2019)
Günther, M., Klotz, L.: Schur’s theorem for a block Hadamard product. Linear Algebra Appl. 437(3), 948–956 (2012)
Article MathSciNet Google Scholar
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)
Article MathSciNet Google Scholar
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent Kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)
Google Scholar
Laub, A.J.: Matrix Analysis for Scientists and Engineers, vol. 91. SIAM (2005)
Google Scholar
Oymak, S., Soltanolkotabi, M.: Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674 (2019)
Simard, P., Victorri, B., LeCun, Y., Denker, J.: Tangent prop-a formalism for specifying selected invariances in an adaptive network. In: Advances in Neural Information Processing Systems, pp. 895–903 (1992)
Google Scholar
Srinivas, S., Fleuret, F.: Knowledge transfer with Jacobian matching. arXiv preprint arXiv:1803.00443 (2018)
Tropp, J.A., et al.: An introduction to matrix concentration inequalities. Found. Trends® Mach. Learn. 8(1–2), 1–230 (2015)
MATH Google Scholar
Vlassis, N., Ma, R., Sun, W.: Geometric deep learning for computational mechanics part i: anisotropic hyperelasticity. arXiv preprint arXiv:2001.04292 (2020)
Weinan, E., Ma, C., Wu, L.: A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. 63, 1235–1258 (2020). https://doi.org/10.1007/s11425-019-1628-5
Article MathSciNet MATH Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zou, D., Cao, Y., Zhou, D., Gu, Q.: Stochastic gradient descent optimizes over-parameterized deep ReLUnetworks. arXiv preprint arXiv:1811.08888 (2018)
Zou, D., Gu, Q.: An improved analysis of training over-parameterized deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2053–2062 (2019)
Google Scholar

Download references

Acknowledgements

PH is supported in part by NSF CAREER Grant DMS-1848087.

Author information

Authors and Affiliations

Department of Mathemathics, Northeastern University, Boston, MA, USA
Jorio Cocola & Paul Hand
Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
Paul Hand

Authors

Jorio Cocola
View author publications
You can also search for this author in PubMed Google Scholar
Paul Hand
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jorio Cocola .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy
Giuseppe Nicosia
University of Reading, Reading, UK
Varun Ojha
University of Oxford, Oxford, UK
Emanuele La Malfa
University of Cambridge, Cambridge, UK
Giorgio Jansen
Almawave, Rome, Italy
Vincenzo Sciacca
University of Florida, Gainesville, FL, USA
Panos Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Harvard University, Cambridge, MA, USA
Renato Umeton

Appendices

A Supplementary proofs for Sect. 3.1

In this section we provide the remaining proofs of the results in Sect. 3.1. We begin recalling the following matrix Chernoff inequality (see for example [15, Theorem 5.1.1]).

Theorem 3

(Matrix Chernoff). Consider a finite sequence $X_k$ of $p \times p$ independent, random, Hermitian matrices with $0 \preceq X_k \preceq L I$. Let $X = \textstyle {\sum }_k X_k$, then for all $\epsilon \in [0,1)$

$$\begin{aligned} \mathbb {P}\Big [\lambda _{\text {min}}(X) \le \epsilon \lambda _{\text {min}}\big (\mathbb {E}[X]\big ) \Big ] \le p e^{-(1-\epsilon )^2 \lambda _{\text {min}}(\mathbb {E}[X]) /2L} \end{aligned}$$

(16)

In order to lower bound the smallest eigenvalue of H(0) we use Lemma 1 together with the previous concentration result.

Proof

(Lemma 2). We first note that $\mathbb {E}[H(0)] = \mathbb {E}[\sum _r H_r(0)] = H^{\infty }$, and moreover $H_r(0)$ is symmetric positive semidefinite with $\lambda _{\text {max}}(H_r) \le n (k+1)/m$ by Lemma 1. Applying then the concentration bound (16) with the assumption $m \ge \frac{32}{{{\lambda }_*}}\, n(k+1) \ln ( n(k +1)/\delta )$ gives the thesis.

We next upper bound the errors at initialization.

Proof

(Lemma 3). Note that for any $x_i$, due the the assumption on the independence of the weights at initialization and the normalization of the data:

$$ \mathbb {E}[(f(W,x_i))^2] = \sum _{r=1}^m \frac{1}{m} \mathbb {E}[\sigma (w_r^T x_i)^2] \le 1 $$

and similarly for the directional derivatives

$$ \mathbb {E}[ \Vert \bar{F}(W,x_i)\Vert _2^2 ] = \mathbb {E}_{g \sim \mathcal {N}(0,I)} [\Vert \sigma '(g^T x_i ) V_i^T g\Vert _2^2 ] \le \sum _{j=1}^k \mathbb {E}[ (v_{i,j}^T g)^2 ] \le k. $$

We conclude the proof by using Jensen’s and Markov’s inequalities.

B Proof of Proposition 1

Consider the $d\times (k+1)$ matrices $\mathbf {X}_i = [x_i, V_i]$, and for define

$$ \hat{\psi }_w(x_i) = \sigma '(w^T x_i) \mathbf {X}_i. $$

and the $d \times (k+1)n$ matrix:

which corresponds to a column permutation of $\varOmega (w)$. Next observe that the matrix $\widehat{H}^\infty = \mathbb {E}_{w \sim \mathcal {N}(0,I_d)}[\widehat{\varOmega }(w)^T \widehat{\varOmega }(w)]$ is similar to $H^\infty $ and therefore has the same eigenvalues. In this section we lower bound ${\lambda }_*$ by analyzing $\widehat{H}^\infty $.

We begin recalling some facts about the spectral properties of the products of matrices.

Definition 2

([8]). Let $\mathbf {A} = [A_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}$ and $\mathbf {B} = [B_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}$ be $n p \times np$ matrices in which each block is in ${p \times p}$. Then we define the block Hadamard product of $\mathbf {A} \square \mathbf {B}$ as the $n p \times np$ matrix with:

$$ \mathbf {A} \square \mathbf {B} := [A_{\alpha \beta } B_{\alpha \beta } ]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n} $$

where $A_{\alpha \beta } B_{\alpha \beta }$ denotes the usual matrix product between $A_{\alpha \beta }$ and $B_{\alpha \beta }$.

Generalizing Schur’s Lemma one has the following regarding the eigenvalues of the block Hadamard product of two block matrices.

Proposition 2

([8]). Let $\mathbf {A} = [A_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}$ and $\mathbf {B} = [B_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}$ be $n p \times np$ positive semidefinite matrices. Assume that every $p\times p$ block of $\mathbf {A}$ commutes with every $p\times p$ block of $\mathbf {B}$, then:

$$ \lambda _{\text {min}}(\mathbf {B} \square \mathbf {A})= \lambda _{\text {min}}(\mathbf {A} \square \mathbf {B}) \ge \lambda _{\text {min}}(A) \cdot \min _{\alpha }\lambda _{\text {min}}(B_{\alpha \alpha }) $$

We finally recall the following on the eigenvalues of Kronecker product of matrices.

Proposition 3

([11]). Let with eigenvalues $\{\lambda _i\}$ and with eigenvalues $\{\mu _i\}$, then Kronecker product $A \otimes B$ between A and B has eigenvalues $\{\lambda _i \mu _j \}$.

We next define the following random kernel matrix.

Definition 3

Let $w \sim \mathcal {N}(0,I)$ then define the random matrix with entries $[\mathcal {M}(w)]_{ij}= \sigma '(w^T x_i) \sigma '(w^T x_j)$.

The next result from [12] establishes positive definiteness of this matrix in expectation, under the separation condition (7).

Lemma 5

([12]). Let $x_1, \dots , x_d$ in with unit Euclidean norm and assume that (7) is satisfied for all $i = 1, \dots d$. Then the following holds:

$$ \mathbb {E}_{w \sim \mathcal {N}(0,I)} [\mathcal {M}(w)] \succeq \frac{\delta _1 }{100 n^2} $$

Finally let block matrix with $d\times (k+1)$ blocks $\mathbf {X}_i$. Thanks to the assumption (8) the following result on the Gram matrices $\mathbf {X}_i^T\mathbf {X}_i$ holds.

Lemma 6

Assume that the condition (8) is satisfied, then for any $i = 1, \dots , n$ we have $\lambda _{\text {min}} (\mathbf {X}_i^T\mathbf {X}_i) \ge 1 - k \delta _2 > 0$.

Proof

The claim follows by observing that by Gershgorin’s Disk Theorem:

$$ | \lambda _{\text {min}} (\mathbf {X}_i^T\mathbf {X}_i) - 1| \le \sum _{ 1 \le j \le k} |x_i^T v_{i,j}|\le k \delta _2. $$

Finally observe that we can write:

$$ \widehat{H}^\infty = \mathbb {E}_{w \sim \mathcal {N}(0,I)}\big [ (\mathbf {X}^T \mathbf {X})\square (\mathcal {M}(w) \otimes I ) \big ]. $$

so that Proposition 2, Proposition 3, Lemma 5 and Lemma 6 allow to derive the thesis of Proposition 1.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cocola, J., Hand, P. (2020). Global Convergence of Sobolev Training for Overparameterized Neural Networks. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2020. Lecture Notes in Computer Science(), vol 12565. Springer, Cham. https://doi.org/10.1007/978-3-030-64583-0_51

Download citation

DOI: https://doi.org/10.1007/978-3-030-64583-0_51
Published: 08 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64582-3
Online ISBN: 978-3-030-64583-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Global Convergence of Sobolev Training for Overparameterized Neural Networks

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Gradient descent optimizes over-parameterized deep ReLU networks

Convergence and Recovery Guarantees of Unsupervised Neural Networks for Inverse Problems

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Notes

References

Acknowledgements