Skip to main content

Global Convergence of Sobolev Training for Overparameterized Neural Networks

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Data Science (LOD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12565))

  • 1778 Accesses


Sobolev loss is used when training a network to approximate the values and derivatives of a target function at a prescribed set of input points. Recent works have demonstrated its successful applications in various tasks such as distillation or synthetic gradient prediction. In this work we prove that an overparameterized two-layer relu neural network trained on the Sobolev loss with gradient flow from random initialization can fit any given function values and any given directional derivatives, under a separation condition on the input data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

    Notice that the introduction of the constants \(\alpha \) and \(\beta \) does not change the expressivity of the network.


  1. Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962 (2018)

  2. Arora, S., Du, S.S., Hu, W., Li, Z., Wang, R.: Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 (2019)

  3. Bietti, A., Mairal, J.: On the inductive bias of neural tangent kernels. In: Advances in Neural Information Processing Systems. pp. 12873–12884 (2019)

    Google Scholar 

  4. Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956 (2018)

  5. Czarnecki, W.M., Osindero, S., Jaderberg, M., Swirszcz, G., Pascanu, R.: Sobolev training for neural networks. In: Advances in Neural Information Processing Systems, pp. 4278–4287 (2017)

    Google Scholar 

  6. Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 (2018)

  7. Gühring, I., Kutyniok, G., Petersen, P.: Error bounds for approximations with deep relu neural networks in \(w^{s, p}\) norms. arXiv preprint arXiv:1902.07896 (2019)

  8. Günther, M., Klotz, L.: Schur’s theorem for a block Hadamard product. Linear Algebra Appl. 437(3), 948–956 (2012)

    Article  MathSciNet  Google Scholar 

  9. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991)

    Article  MathSciNet  Google Scholar 

  10. Jacot, A., Gabriel, F., Hongler, C.: Neural tangent Kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems, pp. 8571–8580 (2018)

    Google Scholar 

  11. Laub, A.J.: Matrix Analysis for Scientists and Engineers, vol. 91. SIAM (2005)

    Google Scholar 

  12. Oymak, S., Soltanolkotabi, M.: Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv preprint arXiv:1902.04674 (2019)

  13. Simard, P., Victorri, B., LeCun, Y., Denker, J.: Tangent prop-a formalism for specifying selected invariances in an adaptive network. In: Advances in Neural Information Processing Systems, pp. 895–903 (1992)

    Google Scholar 

  14. Srinivas, S., Fleuret, F.: Knowledge transfer with Jacobian matching. arXiv preprint arXiv:1803.00443 (2018)

  15. Tropp, J.A., et al.: An introduction to matrix concentration inequalities. Found. Trends® Mach. Learn. 8(1–2), 1–230 (2015)

    MATH  Google Scholar 

  16. Vlassis, N., Ma, R., Sun, W.: Geometric deep learning for computational mechanics part i: anisotropic hyperelasticity. arXiv preprint arXiv:2001.04292 (2020)

  17. Weinan, E., Ma, C., Wu, L.: A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics. Sci. China Math. 63, 1235–1258 (2020).

    Article  MathSciNet  MATH  Google Scholar 

  18. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)

  19. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

  20. Zou, D., Cao, Y., Zhou, D., Gu, Q.: Stochastic gradient descent optimizes over-parameterized deep ReLUnetworks. arXiv preprint arXiv:1811.08888 (2018)

  21. Zou, D., Gu, Q.: An improved analysis of training over-parameterized deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2053–2062 (2019)

    Google Scholar 

Download references


PH is supported in part by NSF CAREER Grant DMS-1848087.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jorio Cocola .

Editor information

Editors and Affiliations


A Supplementary proofs for Sect. 3.1

In this section we provide the remaining proofs of the results in Sect. 3.1. We begin recalling the following matrix Chernoff inequality (see for example [15, Theorem 5.1.1]).

Theorem 3

(Matrix Chernoff). Consider a finite sequence \(X_k\) of \(p \times p\) independent, random, Hermitian matrices with \(0 \preceq X_k \preceq L I\). Let \(X = \textstyle {\sum }_k X_k\), then for all \(\epsilon \in [0,1)\)

$$\begin{aligned} \mathbb {P}\Big [\lambda _{\text {min}}(X) \le \epsilon \lambda _{\text {min}}\big (\mathbb {E}[X]\big ) \Big ] \le p e^{-(1-\epsilon )^2 \lambda _{\text {min}}(\mathbb {E}[X]) /2L} \end{aligned}$$

In order to lower bound the smallest eigenvalue of H(0) we use Lemma 1 together with the previous concentration result.


(Lemma 2). We first note that \(\mathbb {E}[H(0)] = \mathbb {E}[\sum _r H_r(0)] = H^{\infty }\), and moreover \(H_r(0)\) is symmetric positive semidefinite with \(\lambda _{\text {max}}(H_r) \le n (k+1)/m\) by Lemma 1. Applying then the concentration bound (16) with the assumption \(m \ge \frac{32}{{{\lambda }_*}}\, n(k+1) \ln ( n(k +1)/\delta )\) gives the thesis.

We next upper bound the errors at initialization.


(Lemma 3). Note that for any \(x_i\), due the the assumption on the independence of the weights at initialization and the normalization of the data:

$$ \mathbb {E}[(f(W,x_i))^2] = \sum _{r=1}^m \frac{1}{m} \mathbb {E}[\sigma (w_r^T x_i)^2] \le 1 $$

and similarly for the directional derivatives

$$ \mathbb {E}[ \Vert \bar{F}(W,x_i)\Vert _2^2 ] = \mathbb {E}_{g \sim \mathcal {N}(0,I)} [\Vert \sigma '(g^T x_i ) V_i^T g\Vert _2^2 ] \le \sum _{j=1}^k \mathbb {E}[ (v_{i,j}^T g)^2 ] \le k. $$

We conclude the proof by using Jensen’s and Markov’s inequalities.

B Proof of Proposition 1

Consider the \(d\times (k+1)\) matrices \(\mathbf {X}_i = [x_i, V_i]\), and for define

$$ \hat{\psi }_w(x_i) = \sigma '(w^T x_i) \mathbf {X}_i. $$

and the \(d \times (k+1)n\) matrix:

which corresponds to a column permutation of \(\varOmega (w)\). Next observe that the matrix \(\widehat{H}^\infty = \mathbb {E}_{w \sim \mathcal {N}(0,I_d)}[\widehat{\varOmega }(w)^T \widehat{\varOmega }(w)]\) is similar to \(H^\infty \) and therefore has the same eigenvalues. In this section we lower bound \({\lambda }_*\) by analyzing \(\widehat{H}^\infty \).

We begin recalling some facts about the spectral properties of the products of matrices.

Definition 2

([8]). Let \(\mathbf {A} = [A_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) and \(\mathbf {B} = [B_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) be \(n p \times np\) matrices in which each block is in \({p \times p}\). Then we define the block Hadamard product of \(\mathbf {A} \square \mathbf {B}\) as the \(n p \times np\) matrix with:

$$ \mathbf {A} \square \mathbf {B} := [A_{\alpha \beta } B_{\alpha \beta } ]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n} $$

where \(A_{\alpha \beta } B_{\alpha \beta }\) denotes the usual matrix product between \(A_{\alpha \beta }\) and \(B_{\alpha \beta }\).

Generalizing Schur’s Lemma one has the following regarding the eigenvalues of the block Hadamard product of two block matrices.

Proposition 2

([8]). Let \(\mathbf {A} = [A_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) and \(\mathbf {B} = [B_{\alpha \beta }]_{\alpha = 1, \dots , n}^{\beta = 1, \dots , n}\) be \(n p \times np\) positive semidefinite matrices. Assume that every \(p\times p\) block of \(\mathbf {A}\) commutes with every \(p\times p\) block of \(\mathbf {B}\), then:

$$ \lambda _{\text {min}}(\mathbf {B} \square \mathbf {A})= \lambda _{\text {min}}(\mathbf {A} \square \mathbf {B}) \ge \lambda _{\text {min}}(A) \cdot \min _{\alpha }\lambda _{\text {min}}(B_{\alpha \alpha }) $$

We finally recall the following on the eigenvalues of Kronecker product of matrices.

Proposition 3

([11]). Let with eigenvalues \(\{\lambda _i\}\) and with eigenvalues \(\{\mu _i\}\), then Kronecker product \(A \otimes B\) between A and B has eigenvalues \(\{\lambda _i \mu _j \}\).

We next define the following random kernel matrix.

Definition 3

Let \(w \sim \mathcal {N}(0,I)\) then define the random matrix with entries \([\mathcal {M}(w)]_{ij}= \sigma '(w^T x_i) \sigma '(w^T x_j)\).

The next result from [12] establishes positive definiteness of this matrix in expectation, under the separation condition (7).

Lemma 5

([12]). Let \(x_1, \dots , x_d\) in with unit Euclidean norm and assume that (7) is satisfied for all \(i = 1, \dots d\). Then the following holds:

$$ \mathbb {E}_{w \sim \mathcal {N}(0,I)} [\mathcal {M}(w)] \succeq \frac{\delta _1 }{100 n^2} $$

Finally let block matrix with \(d\times (k+1)\) blocks \(\mathbf {X}_i\). Thanks to the assumption (8) the following result on the Gram matrices \(\mathbf {X}_i^T\mathbf {X}_i\) holds.

Lemma 6

Assume that the condition (8) is satisfied, then for any \(i = 1, \dots , n\) we have \(\lambda _{\text {min}} (\mathbf {X}_i^T\mathbf {X}_i) \ge 1 - k \delta _2 > 0\).


The claim follows by observing that by Gershgorin’s Disk Theorem:

$$ | \lambda _{\text {min}} (\mathbf {X}_i^T\mathbf {X}_i) - 1| \le \sum _{ 1 \le j \le k} |x_i^T v_{i,j}|\le k \delta _2. $$

Finally observe that we can write:

$$ \widehat{H}^\infty = \mathbb {E}_{w \sim \mathcal {N}(0,I)}\big [ (\mathbf {X}^T \mathbf {X})\square (\mathcal {M}(w) \otimes I ) \big ]. $$

so that Proposition 2, Proposition 3, Lemma 5 and Lemma 6 allow to derive the thesis of Proposition 1.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cocola, J., Hand, P. (2020). Global Convergence of Sobolev Training for Overparameterized Neural Networks. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2020. Lecture Notes in Computer Science(), vol 12565. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64582-3

  • Online ISBN: 978-3-030-64583-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics