Abstract
Sobolev loss is used when training a network to approximate the values and derivatives of a target function at a prescribed set of input points. Recent works have demonstrated its successful applications in various tasks such as distillation or synthetic gradient prediction. In this work we prove that an overparameterized two-layer relu neural network trained on the Sobolev loss with gradient flow from random initialization can fit any given function values and any given directional derivatives, under a separation condition on the input data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Notice that the introduction of the constants \(\alpha \) and \(\beta \) does not change the expressivity of the network.
References
Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 (2018)
Arora, S., Du, S.S., Hu, W., Li, Z., Wang, R.: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 (2019)
Bietti, A., Mairal, J.: On the inductive bias of neural tangent kernels. In: Advances in Neural Information Processing Systems. pp. 12873–12884 (2019)
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956 (2018)
Czarnecki, W.M., Osindero, S., Jaderberg, M., Swirszcz, G., Pascanu, R.: Sobolev training for neural networks. In: Advances in Neural Information Processing Systems, pp. 4278–4287 (2017)
Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)
Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep relu neural networks in \(w^{s, p}\) norms. arXiv preprint arXiv:1902.07896 (2019)
Günther, M., Klotz, L.: Schur’s theorem for a block Hadamard product. Linear Algebra Appl. 437(3), 948–956 (2012)
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent Kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)
Laub, A.J.: Matrix Analysis for Scientists and Engineers, vol. 91. SIAM (2005)
Oymak, S., Soltanolkotabi, M.: Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674 (2019)
Simard, P., Victorri, B., LeCun, Y., Denker, J.: Tangent prop-a formalism for specifying selected invariances in an adaptive network. In: Advances in Neural Information Processing Systems, pp. 895–903 (1992)
Srinivas, S., Fleuret, F.: Knowledge transfer with Jacobian matching. arXiv preprint arXiv:1803.00443 (2018)
Tropp, J.A., et al.: An introduction to matrix concentration inequalities. Found. Trends® Mach. Learn. 8(1–2), 1–230 (2015)
Vlassis, N., Ma, R., Sun, W.: Geometric deep learning for computational mechanics part i: anisotropic hyperelasticity. arXiv preprint arXiv:2001.04292 (2020)
Weinan, E., Ma, C., Wu, L.: A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. 63, 1235–1258 (2020). https://doi.org/10.1007/s11425-019-1628-5
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zou, D., Cao, Y., Zhou, D., Gu, Q.: Stochastic gradient descent optimizes over-parameterized deep ReLUnetworks. arXiv preprint arXiv:1811.08888 (2018)
Zou, D., Gu, Q.: An improved analysis of training over-parameterized deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2053–2062 (2019)
Acknowledgements
PH is supported in part by NSF CAREER Grant DMS-1848087.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Supplementary proofs for Sect. 3.1
In this section we provide the remaining proofs of the results in Sect. 3.1. We begin recalling the following matrix Chernoff inequality (see for example [15, Theorem 5.1.1]).
Theorem 3
(Matrix Chernoff). Consider a finite sequence \(X_k\) of \(p \times p\) independent, random, Hermitian matrices with \(0 \preceq X_k \preceq L I\). Let \(X = \textstyle {\sum }_k X_k\), then for all \(\epsilon \in [0,1)\)
In order to lower bound the smallest eigenvalue of H(0) we use Lemma 1 together with the previous concentration result.
Proof
(Lemma 2). We first note that \(\mathbb {E}[H(0)] = \mathbb {E}[\sum _r H_r(0)] = H^{\infty }\), and moreover \(H_r(0)\) is symmetric positive semidefinite with \(\lambda _{\text {max}}(H_r) \le n (k+1)/m\) by Lemma 1. Applying then the concentration bound (16) with the assumption \(m \ge \frac{32}{{{\lambda }_*}}\, n(k+1) \ln ( n(k +1)/\delta )\) gives the thesis.
We next upper bound the errors at initialization.
Proof
(Lemma 3). Note that for any \(x_i\), due the the assumption on the independence of the weights at initialization and the normalization of the data:
and similarly for the directional derivatives
We conclude the proof by using Jensen’s and Markov’s inequalities.
B Proof of Proposition 1
Consider the \(d\times (k+1)\) matrices \(\mathbf {X}_i = [x_i, V_i]\), and for define
and the \(d \times (k+1)n\) matrix:

which corresponds to a column permutation of \(\varOmega (w)\). Next observe that the matrix \(\widehat{H}^\infty = \mathbb {E}_{w \sim \mathcal {N}(0,I_d)}[\widehat{\varOmega }(w)^T \widehat{\varOmega }(w)]\) is similar to \(H^\infty \) and therefore has the same eigenvalues. In this section we lower bound \({\lambda }_*\) by analyzing \(\widehat{H}^\infty \).
We begin recalling some facts about the spectral properties of the products of matrices.
Definition 2
([8]). Let \(\mathbf {A} = [A_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) and \(\mathbf {B} = [B_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) be \(n p \times np\) matrices in which each block is in \({p \times p}\). Then we define the block Hadamard product of \(\mathbf {A} \square \mathbf {B}\) as the \(n p \times np\) matrix with:
where \(A_{\alpha \beta } B_{\alpha \beta }\) denotes the usual matrix product between \(A_{\alpha \beta }\) and \(B_{\alpha \beta }\).
Generalizing Schur’s Lemma one has the following regarding the eigenvalues of the block Hadamard product of two block matrices.
Proposition 2
([8]). Let \(\mathbf {A} = [A_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) and \(\mathbf {B} = [B_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) be \(n p \times np\) positive semidefinite matrices. Assume that every \(p\times p\) block of \(\mathbf {A}\) commutes with every \(p\times p\) block of \(\mathbf {B}\), then:
We finally recall the following on the eigenvalues of Kronecker product of matrices.
Proposition 3
([11]). Let with eigenvalues \(\{\lambda _i\}\) and
with eigenvalues \(\{\mu _i\}\), then Kronecker product \(A \otimes B\) between A and B has eigenvalues \(\{\lambda _i \mu _j \}\).
We next define the following random kernel matrix.
Definition 3
Let \(w \sim \mathcal {N}(0,I)\) then define the random matrix with entries \([\mathcal {M}(w)]_{ij}= \sigma '(w^T x_i) \sigma '(w^T x_j)\).
The next result from [12] establishes positive definiteness of this matrix in expectation, under the separation condition (7).
Lemma 5
([12]). Let \(x_1, \dots , x_d\) in with unit Euclidean norm and assume that (7) is satisfied for all \(i = 1, \dots d\). Then the following holds:
Finally let block matrix with \(d\times (k+1)\) blocks \(\mathbf {X}_i\). Thanks to the assumption (8) the following result on the Gram matrices \(\mathbf {X}_i^T\mathbf {X}_i\) holds.
Lemma 6
Assume that the condition (8) is satisfied, then for any \(i = 1, \dots , n\) we have \(\lambda _{\text {min}} (\mathbf {X}_i^T\mathbf {X}_i) \ge 1 - k \delta _2 > 0\).
Proof
The claim follows by observing that by Gershgorin’s Disk Theorem:
Finally observe that we can write:
so that Proposition 2, Proposition 3, Lemma 5 and Lemma 6 allow to derive the thesis of Proposition 1.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Cocola, J., Hand, P. (2020). Global Convergence of Sobolev Training for Overparameterized Neural Networks. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2020. Lecture Notes in Computer Science(), vol 12565. Springer, Cham. https://doi.org/10.1007/978-3-030-64583-0_51
Download citation
DOI: https://doi.org/10.1007/978-3-030-64583-0_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64582-3
Online ISBN: 978-3-030-64583-0
eBook Packages: Computer ScienceComputer Science (R0)