skip to main content
10.1145/3651671.3651740acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlcConference Proceedingsconference-collections
research-article

Better Generalization in Fast Training: Flat Trainable Weight in Subspace

Published: 07 June 2024 Publication History

Abstract

Compressing training time of deep neural networks (DNNs) is a critical task due to the huge scale of data and models. Different from most previous works that focus on using large batch size to reduce the training time, we consider to compress the training epoch through our designed training algorithm. It is well known that simply reducing the learning rate schedule results in a significant loss of generalization. In this paper, we propose to maintain test accuracy while compressing training epochs by optimizing in extended subspace generated by historical model parameters on SGD training trajectory. Although using historical information has been studied in Trainable Weight Averaging (TWA), we design a new algorithm called Flat Trainable Weights (FTW) that optimizes the weight coefficients using explicit sharpness loss function in extended low-dimensional subspace, which achieves better generalization performance. We show that this Flat Trainable Weights (FTW) achieves significant improvement in model generalization over TWA and SGD. In fast training, FTW accelerates the convergence and saves 15% time over TWA, 35% over SGD on CIFAR datasets.

References

[1]
Devansh Bisla, Jing Wang, and Anna Choromanska. 2022. Low-pass filtering SGD for recovering flat optima in the deep learning optimization landscape. In International Conference on Artificial Intelligence and Statistics. PMLR, 8299–8339.
[2]
Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. 2021. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems 34 (2021), 22405–22418.
[3]
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. 2019. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment 2019, 12 (2019), 124018.
[4]
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).
[5]
Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, and Yubo Chen. 2022. Contrastive Vision-Language Pre-training with Limited Resources. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 236–253.
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
[7]
Aditya Devarakonda, Maxim Naumov, and Michael Garland. 2017. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029 (2017).
[8]
Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).
[9]
Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent YF Tan. 2021. Efficient sharpness-aware minimization for improved training of neural networks. arXiv preprint arXiv:2110.03141 (2021).
[10]
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2020. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020).
[11]
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. 2018. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31 (2018).
[12]
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. 2019. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning. PMLR, 2232–2241.
[13]
Vipul Gupta, Santiago Akle Serrano, and Dennis DeCoste. 2020. Stochastic weight averaging in parallel: Large-batch training that generalizes well. arXiv preprint arXiv:2001.02312 (2020).
[14]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[15]
Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. 2020. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. arXiv preprint arXiv:2006.08217 (2020).
[16]
Joong-Won Hwang, Youngwan Lee, Sungchan Oh, and Yuseok Bae. 2021. Adversarial training with stochastic weight average. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 814–818.
[17]
Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018).
[18]
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. 2019. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178 (2019).
[19]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).
[20]
Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).
[21]
Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. 2021. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning. PMLR, 5905–5914.
[22]
Tao Li, Zhehao Huang, Qinghua Tao, Yingwen Wu, and Xiaolin Huang. 2022. Trainable Weight Averaging for Fast Convergence and Better Generalization. arXiv preprint arXiv:2205.13104 (2022).
[23]
Tao Li, Lei Tan, Qinghua Tao, Yipeng Liu, and Xiaolin Huang. 2021. Low dimensional landscape hypothesis is true: DNNs can be trained in tiny subspaces. arXiv preprint arXiv:2103.11154 (2021).
[24]
Yuanzhi Li, Colin Wei, and Tengyu Ma. 2019. Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems 32 (2019).
[25]
Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
[26]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[27]
Yu E Nesterov. 1983. A method for solving the convex programming problem with convergence rate Obigl(frac1k2bigr). In Dokl. Akad. Nauk SSSR, Vol. 269. 543–547.
[28]
Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. 2017. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564 (2017).
[29]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[30]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[31]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
[32]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
[33]
Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and Chris De Sa. 2019. SWALP: Stochastic weight averaging in low precision training. In International Conference on Machine Learning. PMLR, 7015–7024.
[34]
Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph Gonzalez, Kurt Keutzer, and Michael Mahoney. 2018. Large batch size training of neural networks with adversarial training and second-order information. arXiv preprint arXiv:1810.01021 (2018).
[35]
Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888 6, 12 (2017), 6.
[36]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).
[37]
Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. 2022. Surrogate gap minimization improves sharpness-aware training. arXiv preprint arXiv:2203.08065 (2022).

Index Terms

  1. Better Generalization in Fast Training: Flat Trainable Weight in Subspace

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICMLC '24: Proceedings of the 2024 16th International Conference on Machine Learning and Computing
    February 2024
    757 pages
    ISBN:9798400709234
    DOI:10.1145/3651671
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 June 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. accelerate convergence
    2. fast training
    3. generalization

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICMLC 2024

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 19
      Total Downloads
    • Downloads (Last 12 months)19
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media