Skip to main content

Improved Spectral Norm Regularization for Neural Networks

  • Conference paper
  • First Online:
Modeling Decisions for Artificial Intelligence (MDAI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13890))

Abstract

We improve on a line of research that seeks to regularize the spectral norm of the Jacobian of the input-output mapping for deep neural networks. While previous work rely on upper bounding techniques, we propose a scheme that targets the exact spectral norm. We evaluate this regularization method empirically with respect to its generalization performance and robustness.

Our results demonstrate that this improved spectral regularization scheme outperforms L2-regularization as well as the previously used upper bounding technique. Moreover, our results suggest that exact spectral norm regularization and exact Frobenius norm regularization have comparable performance. We analyze these empirical findings in the light of the mathematical relations that hold between the spectral and the Frobenius norms. Lastly, in light of our evaluation we revisit an argument concerning the strong adversarial protection that Jacobian regularization provides and show that it can be misleading.

In summary, we propose a new regularization method and contribute to the practical and theoretical understanding of when one regularization method should be preferred over another.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that the backward mode can be obtained by a standard backward pass to evaluate \(d(x^L \cdot u)/dx\). We refer to it here as backward mode to highlight the symmetry with the forward mode which does not have a standard equivalent counterpart.

  2. 2.

    It is possible to perform power iteration multiple times to get a better estimate but we found that performing it once gave sufficiently accurate estimates.

  3. 3.

    colab.research.google.com.

References

  1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). Software available from tensorflow.org

    Google Scholar 

  2. Andriushchenko, M., Croce, F., Flammarion, N., Hein, M.: Square attack: a query-efficient black-box adversarial attack via random search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 484–501. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_29

  3. Arani, E., Sarfraz, F., Zonooz, B.: Noise as a resource for learning in knowledge distillation. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, 3–8 January 2021, pp. 3128–3137. IEEE (2021)

    Google Scholar 

  4. Chen, S.-T., Cornelius, C., Martin, J., (Polo) Chau, D.H.: Shapeshifter: robust physical adversarial attack on faster R-CNN object detector. 11051, 52–68 (2018)

    Google Scholar 

  5. Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., Ha, D.: Deep learning for classical japanese literature. CoRR, abs/1812.01718 (2018)

    Google Scholar 

  6. Collins, M.: Lecture notes on computational graphs, and backpropagation. Colombia University (2018). http://www.cs.columbia.edu/~mcollins/ff2.pdf. Accessed 19 Mar 2023

  7. Dong, X., Luu, A.T., Lin, M., Yan, S., Zhang, H.: How should pre-trained language models be fine-tuned towards adversarial robustness? In: Ranzato, M., Beygelzimer, A., Dauphin, Y.N., Liang, P., Vaughan, J.W. (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, 6–14 December 2021, virtual, pp. 4356–4369 (2021)

    Google Scholar 

  8. Drucker, H., LeCun, Y.: Improving generalization performance using double backpropagation. IEEE Trans. Neural Networks 3(6), 991–997 (1992)

    Article  Google Scholar 

  9. Gu, S., Rigazio, L.: Towards deep neural network architectures robust to adversarial examples. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Workshop Track Proceedings (2015)

    Google Scholar 

  10. Hanin, B., Rolnick, D.: Deep relu networks have surprisingly few activation patterns. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp. 359–368 (2019)

    Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June, 2016, pp. 770–778. IEEE Computer Society (2016)

    Google Scholar 

  12. Hoffman, J., Roberts, D.A., Yaida, S.: Robust learning with Jacobian regularization. CoRR, abs/1908.02729 (2019)

    Google Scholar 

  13. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)

    Article  Google Scholar 

  14. Johnson, S.G.: Notes on the equivalence of norms

    Google Scholar 

  15. Kim, H.: Torchattacks: a pytorch repository for adversarial attacks. arXiv preprint arXiv:2010.01950 (2020)

  16. Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)

    Google Scholar 

  17. Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: Moody, J.E., Hanson, S.J., Lippmann, R. (eds.) Advances in Neural Information Processing Systems 4, NIPS Conference, Denver, Colorado, USA, December 2–5, 1991, pp. 950–957. Morgan Kaufmann (1991)

    Google Scholar 

  18. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  19. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010)

    Google Scholar 

  20. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net (2018)

    Google Scholar 

  21. Mathov, Y., Levy, E., Katzir, Z., Shabtai, A., Elovici, Y.: Not all datasets are born equal: on heterogeneous tabular data and adversarial examples. Knowl. Based Syst. 242, 108377 (2022)

    Article  Google Scholar 

  22. Morgulis, N., Kreines, A., Mendelowitz, S., Weisglass, Y.: Fooling a real car with adversarial traffic signs. CoRR, abs/1907.00374 (2019)

    Google Scholar 

  23. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Fürnkranz, J., Joachims, T. (eds.) Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21–24, 2010, Haifa, Israel, pp. 807–814. Omnipress (2010)

    Google Scholar 

  24. Papernot, N., McDaniel, P.D., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA, May 22–26, 2016, pp. 582–597. IEEE Computer Society (2016)

    Google Scholar 

  25. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp. 8024–8035 (2019)

    Google Scholar 

  26. Silva, S.H., Najafirad, P.: Opportunities and challenges in deep learning adversarial robustness: a survey. CoRR, abs/2007.00753 (2020)

    Google Scholar 

  27. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)

    Google Scholar 

  28. Sokolic, J., Giryes, R., Sapiro, G., Rodrigues, M.R.D.: Robust large margin deep neural networks. IEEE Trans. Signal Process. 65(16), 4265–4280 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  29. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)

    Google Scholar 

  30. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747 (2017)

    Google Scholar 

  31. Yoshida, Y., Miyato, T.: Spectral norm regularization for improving the generalizability of deep learning. CoRR, abs/1705.10941 (2017)

    Google Scholar 

  32. Zhang, H., Yu, Y., Jiao, J., Xing, E.P., El Ghaoui, L., Jordan, M.I.: Theoretically principled trade-off between robustness and accuracy. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9–15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 7472–7482. PMLR (2019)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP), funded by the Knut and Alice Wallenberg Foundation. The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Kebnekaise partially funded by the Swedish Research Council through grant agreement no. 2018-05973.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anton Johansson .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A Experimental details

Network Architectures. We will follow [12] and denote a convolutional-max-pool layer as a tuple (K, \(C_{in}\rightarrow C_{out}, S, P, M\)) where K is the width of the kernel, \(C_{in}\) is the number of in-channels, \(C_{out}\) the number of out-channels, S the stride, P the padding of the layer and M the size of the kernel of the max-pool following the convolutional layer. The case \(M=1\) can be seen as a convolutional layer followed by an identity function. Linear layers we will denote as the tuple (\(N_{in}\), \(N_{out}\)) where \(N_{in}\) is the dimension of the input and \(N_{out}\) the size of the output. For KMNIST and FashionMNIST we used the LeNet network which consist of a convolutional-maxpool layer (5, \(1 \rightarrow 6\), 1, 2, 2), convolutional-maxpool layer (5, \(6 \rightarrow 16\), 1, 0, 2), linear layer (400, 120), linear layer (120, 84) and linear layer (84, 10).

We use the VGG16 network as is available from the torchvision package. For this network we use batch-norm layers directly after every convolutional layers. This network consist of the layers (3, \(3 \rightarrow 64\), 1, 1, 1), (3, \(64 \rightarrow 64\), 1, 1, 2), (3, \(64 \rightarrow 128\), 1, 1, 1), (3, \(128 \rightarrow 128\), 1, 1, 1), (3, \(128 \rightarrow 256\), 1, 1, 2), (3, \(256 \rightarrow 256\), 1, 1, 1), (3, \(256 \rightarrow 256\), 1, 1, 1), (3, \(256 \rightarrow 512\), 1, 1, 2), (3, \(512 \rightarrow 512\), 1, 1, 1), (3, \(512 \rightarrow 512\), 1, 1, 1), (3, \(512 \rightarrow 512\), 1, 1, 2), (512, 10).

Training Details. We train the LeNet networks for 50 epochs with SGD (with momentum=0.8). For every regularization method we perform a hyperparameter search over these three following parameters and values.

  • Learning rate: [0.01, 0.001]

  • Batch size: [16, 32]

  • Weight factor \(\lambda \): [0.0001, 0.001, 0.01, 0.1]

For the VGG16 network we trained the network for 100 epochs with a batch size of 128, SGD with momentum of 0.8 and performed a hyperparameter search over these parameters and values

  • Weight factor \(\lambda \): [0.00001, 0.0001, 0.001, 0.01, 0.1]

For VGG16 we additionally used a cosine annealing learning rate scheduler with an initial learning rate of 0.1 and the data augmentation techniques of random cropping and horizontal flipping.

For each hyperparameter setting we repeat the training procedure 5 times to be able to obtain mean and standard deviation. We pick the final representative model for each regularization method as the one that achieves the lowest mean validation loss over these 5 training runs.

For the Frobenius regularization we set \(n_{proj} = 1\) and for the Spectral-Bound we estimate the spectral norm of the weight matrices through one power iteration.

Details for Figures. Figure 4: The model for each regularization method was chosen randomly among the 5 models from the hyperparameter setting that obtained the best results in Table 1. The distance is only calculated for the points in the validation set that all models predict correctly. In total the distance is predicted for between 8000–9000 validation points on FashionMNIST and KMNIST.

Figure 5 (left): The time for a batch was measured on a computer with NVIDIA K80 GPU as available through Google ColabFootnote 3. The analytical method works by sequentially calculating \(d(x^L \cdot e_i)/dx\) where \(e_i\) is a basis-vector for \(\mathbb {R}^{n_{out}}\) for \(i=1,2,...,n_{out}\). This yields the full Jacobian matrix which we then calculate the singular values of by using inbuilt functions in PyTorch.

Figure 5 (right): The upper bound was evaluated on a network trained with the Spectral-Bound regularization scheme for all data points in the training set. The curve for the spectral method was evaluated on a network trained with the spectral method for all data points in the training set. For the spectral method there was no significant difference in the shape of the curve when using a different network or by working with data points in the validation set.

1.2 Conversion Between Operators

In this section we detail how to convert between the forward F, backward \(F^T\) and regular operators G. These can be seen in Table 3 - 5. Other non-linearities such as Dropout can be incorporated identically to ReLU by simply storing the active neurons in a boolean matrix Z.

Skip-Connections. Utilizing networks with skip-connections does not change the forward and backward modes. Simple turn off the bias of all layer transformations and replace the activation functions with the matrices \(Z_R^i\) instead. That this is true follows from the definition of a network with skip-connections. For simplicity of presentation, we will assume that the skip-connections only skip one layer. Assume that we have a network with L layers and additionally have skip-connections between layers with indices in the set \(\mathcal {S} := \{s_1,s_2,...,s_m\},~1\le s_i \le L\). Then the network \(f_{\theta }\) is given recursively as before with

$$ x^l = {\left\{ \begin{array}{ll} f^l(G^l(x^{l-1}) + b^l) &{}\text{ if } l \in \mathcal {S}^C, \\ x^{l-1} + f^l(G^l(x^{l-1}) + b^l) &{} \text{ if } l \in \mathcal {S}. \end{array}\right. } $$

Assuming that we are only working with piecewise linear or linear operators \(G^l\), then for \(x \in R\) we know that each operator can be represented as a matrix and we can write the derivative of the two cases as

$$ \frac{dx^l}{dx^{l-1}} = {\left\{ \begin{array}{ll} Z^lW^l &{}\text{ if } l \in \mathcal {S}^C, \\ I + Z^lW^l &{} \text{ if } l \in \mathcal {S}, \end{array}\right. } $$

where I denotes a unit-matrix. The Jacobian-vector product \(W_Rv\) can thus be obtained as

$$\begin{aligned} W_Rv = \bigg (\prod _{l=1}^L (I - \mathbb {I}\{l \in \mathcal {S}^C\} + Z^lW^l)\bigg )v \end{aligned}$$
(13)

where \(\mathbb {I}\{l \in \mathcal {S}^C\}\) is an indicator for the unit-matrix so that we can concisely write the two cases. Thus we see that we can interpret this equation in the same manner as we did for the networks without skip-connections. We simply pass the input v through the network and turn off all the biases and replace the activation functions with \(Z^l\). The same is true for the backward mode (Table 4).

Table 3. Conversion table for the linear operator.
Table 4. Conversion table for the convolutional operator.
Table 5. Conversion table for the max-pool operator.

1.3 Time Efficiency and Relative Error

In this section we investigate the difference between targeting the exact spectral norm of the Jacobian compared to working with an upper bound. From Table 1 we saw that this yields an improved generalization performance and from Fig. 3 we observed that the two methods provide a similar protection against noise, with different strengths against different attacks on the two considered data sets.

While an improved generalization performance is beneficial, it cannot come at a too large of a computational cost. Additionally, with approximate methods it is also important to measure the trade-off between computational speed and accuracy of the approximated quantity. We thus analyze the computational overhead that they add to the training routine and the relative error with the analytical spectral norm.

Fig. 5.
figure 5

Time and error comparison between the Spectral and Spectral-Bound method for the LeNet network. (Left) Time taken to pass over one batch of data points. The Spectral method is slower than the Spectral-Bound method for larger batch sizes but still around two orders of magnitude faster than calculating the exact spectral norm analytically. (Right) The relative error as the number of power iterations is increased. The relative error decreases quickly and is significantly closer to the exact quantity compared to the upper bound \(\prod _l ||W^l||_2\).

In Fig. 5 (left) we can thus see the average time taken to optimize over a batch for the Spectral method, the Spectral-Bound method, an analytical method that calculates \(||W_R||_2\) exactly and a regular forward-pass. In Fig. 5 (right) the relative error for the power iteration scheme is visible.

From these plots we can see there is a small extra incurred cost of working with our method compared to regularizing with Spectral-Bound, but that our method has a significantly lower relative error while still being orders of magnitude faster than calculating the analytical spectral norm.

1.4 Proof for Extension Scheme

We will denote the directed acyclic graph which when summing the product of every edge element along every path from output to input yields \((df/dx)^T\) as G.

Theorem: Consider the graph F obtained by flipping the direction of all edges of G and adding a node at the end of F with edge elements given by components of v. Summing the product of every edge element along every path from output to input of F yields (df/dx)v.

Proof: We will follow the notation of [6], Theorem 1 and denote the Jacobian between variables \(y=f_{\theta }(x)\) and x as the sum of the product of all intermediate Jacobians, meaning

$$\begin{aligned} \frac{dy}{dx} = \sum _{p \in \mathcal {P}(x,y)} \prod _{(a,b) \in p} J^{a\rightarrow b}(\alpha ^b) \end{aligned}$$
(14)

where \(\mathcal {P}(x,y)\) is the set of all directed paths between x and y and (ab) is two successive edges on a given path.

In our scheme we flip the direction of all relevant edges and add a fictitious node at the end of the path the flipped paths. Since we preserve the edge elements, we can realize that flipping the direction of the edges simply transposes the local Jacobian, meaning that \(J^{b\rightarrow a}(\alpha ^b) = \big (J^{a\rightarrow b}(\alpha ^b)\big )^T\) with our scheme. Further, our added fictitious node has edge elements given by elements of v, and the Jacobian between that node and the subsequent layer is thus given by \(v^T\). For a path \(p=[(v_1,v_2), (v_2,v_3),...,(v_{n-1}, v_n)]\) we define the flipped path with the added fictitious node as \(p^T\) as \(p^T = [(v_n, v_{n-1}), ..., (v_2,v_1), (v_1, v_f)]\) and the reverse-order path \(\lnot p\) as \(\lnot p = [( v_{n-1}, v_n), ..., (v_2,v_3),(v_1,v_2)]\). For our modified graph we thus have the Jacobian for a path as

$$\begin{aligned} \prod _{(a,b) \in p^T} J^{a\rightarrow b}(\alpha ^b)&= \prod _{(a,b) \in p^T} \big (J^{b\rightarrow a}(\alpha ^b)\big )^T \end{aligned}$$
(15)
$$\begin{aligned}&=v^T\bigg (\prod _{(a,b) \in p} J^{a\rightarrow b}(\alpha ^b)\bigg )^T\end{aligned}$$
(16)
$$\begin{aligned}&=v^T\bigg (\prod _{(a,b) \in \lnot p} J^{a\rightarrow b}(\alpha ^b)^T\bigg ) \end{aligned}$$
(17)

Denoting the fictitious node as \(n_f\) and summing over all paths we thus get

$$\begin{aligned}&\sum _{p^T \in \mathcal {P}(y,n_f)} \prod _{(a,b) \in p^t} J^{a\rightarrow b}(\alpha ^b) \end{aligned}$$
(18)
$$\begin{aligned}&= \sum _{p^T \in \mathcal {P}(y,n_f)}v^T\bigg (\prod _{(a,b) \in \lnot p} J^{a\rightarrow b}(\alpha ^b)^T\bigg )\end{aligned}$$
(19)
$$\begin{aligned}&=v^T\sum _{p^T \in \mathcal {P}(y,n_f)}\bigg (\prod _{(a,b) \in \lnot p} J^{a\rightarrow b}(\alpha ^b)^T\bigg )\end{aligned}$$
(20)
$$\begin{aligned}&=v^T\big (\frac{dy}{dx}\big )^T = (\frac{dy}{dx}v)^T \end{aligned}$$
(21)

which proves that working with the modified graph will yield the desired matrix-vector product \(\frac{dy}{dx}v\) \(\square \).

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Johansson, A., Engsner, N., Strannegård, C., Mostad, P. (2023). Improved Spectral Norm Regularization for Neural Networks. In: Torra, V., Narukawa, Y. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2023. Lecture Notes in Computer Science(), vol 13890. Springer, Cham. https://doi.org/10.1007/978-3-031-33498-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-33498-6_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-33497-9

  • Online ISBN: 978-3-031-33498-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics