research-article

Better Generalization in Fast Training: Flat Trainable Weight in Subspace

Authors:

Tao LiAuthors Info & Claims

ICMLC '24: Proceedings of the 2024 16th International Conference on Machine Learning and Computing

Pages 470 - 477

https://doi.org/10.1145/3651671.3651740

Published: 07 June 2024 Publication History

Abstract

Compressing training time of deep neural networks (DNNs) is a critical task due to the huge scale of data and models. Different from most previous works that focus on using large batch size to reduce the training time, we consider to compress the training epoch through our designed training algorithm. It is well known that simply reducing the learning rate schedule results in a significant loss of generalization. In this paper, we propose to maintain test accuracy while compressing training epochs by optimizing in extended subspace generated by historical model parameters on SGD training trajectory. Although using historical information has been studied in Trainable Weight Averaging (TWA), we design a new algorithm called Flat Trainable Weights (FTW) that optimizes the weight coefficients using explicit sharpness loss function in extended low-dimensional subspace, which achieves better generalization performance. We show that this Flat Trainable Weights (FTW) achieves significant improvement in model generalization over TWA and SGD. In fast training, FTW accelerates the convergence and saves 15% time over TWA, 35% over SGD on CIFAR datasets.

References

[1]

Devansh Bisla, Jing Wang, and Anna Choromanska. 2022. Low-pass filtering SGD for recovering flat optima in the deep learning optimization landscape. In International Conference on Artificial Intelligence and Statistics. PMLR, 8299–8339.

[2]

Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, and Sungrae Park. 2021. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems 34 (2021), 22405–22418.

[3]

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. 2019. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment 2019, 12 (2019), 124018.

[4]

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).

[5]

Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, and Yubo Chen. 2022. Contrastive Vision-Language Pre-training with Limited Resources. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 236–253.

Digital Library

[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.

[7]

Aditya Devarakonda, Maxim Naumov, and Michael Garland. 2017. Adabatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029 (2017).

[8]

Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).

[9]

Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent YF Tan. 2021. Efficient sharpness-aware minimization for improved training of neural networks. arXiv preprint arXiv:2110.03141 (2021).

[10]

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. 2020. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020).

[11]

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. 2018. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems 31 (2018).

[12]

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. 2019. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning. PMLR, 2232–2241.

[13]

Vipul Gupta, Santiago Akle Serrano, and Dennis DeCoste. 2020. Stochastic weight averaging in parallel: Large-batch training that generalizes well. arXiv preprint arXiv:2001.02312 (2020).

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[15]

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. 2020. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights. arXiv preprint arXiv:2006.08217 (2020).

[16]

Joong-Won Hwang, Youngwan Lee, Sungchan Oh, and Yuseok Bae. 2021. Adversarial training with stochastic weight average. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 814–818.

[17]

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407 (2018).

[18]

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. 2019. Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178 (2019).

[19]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2016. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).

[20]

Alex Krizhevsky, Geoffrey Hinton, 2009. Learning multiple layers of features from tiny images. (2009).

[21]

Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. 2021. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning. PMLR, 5905–5914.

[22]

Tao Li, Zhehao Huang, Qinghua Tao, Yingwen Wu, and Xiaolin Huang. 2022. Trainable Weight Averaging for Fast Convergence and Better Generalization. arXiv preprint arXiv:2205.13104 (2022).

[23]

Tao Li, Lei Tan, Qinghua Tao, Yipeng Liu, and Xiaolin Huang. 2021. Low dimensional landscape hypothesis is true: DNNs can be trained in tiny subspaces. arXiv preprint arXiv:2103.11154 (2021).

[24]

Yuanzhi Li, Colin Wei, and Tengyu Ma. 2019. Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems 32 (2019).

[25]

Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).

[26]

Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).

[27]

Yu E Nesterov. 1983. A method for solving the convex programming problem with convergence rate Obigl(frac1k2bigr). In Dokl. Akad. Nauk SSSR, Vol. 269. 543–547.

[28]

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. 2017. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564 (2017).

[29]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).

[30]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[31]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.

Digital Library

[32]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.

[33]

Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, and Chris De Sa. 2019. SWALP: Stochastic weight averaging in low precision training. In International Conference on Machine Learning. PMLR, 7015–7024.

[34]

Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph Gonzalez, Kurt Keutzer, and Michael Mahoney. 2018. Large batch size training of neural networks with adversarial training and second-order information. arXiv preprint arXiv:1810.01021 (2018).

[35]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888 6, 12 (2017), 6.

[36]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).

[37]

Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha Dvornek, Sekhar Tatikonda, James Duncan, and Ting Liu. 2022. Surrogate gap minimization improves sharpness-aware training. arXiv preprint arXiv:2203.08065 (2022).

Index Terms

Better Generalization in Fast Training: Flat Trainable Weight in Subspace
1. Networks
  1. Network algorithms

Recommendations

Adversarial self-training for robustness and generalization
Abstract
Adversarial training is currently one of the most promising ways to achieve adversarial robustness of deep models. However, even the most sophisticated training methods is far from satisfactory, as improvement in robustness requires either ...
Highlights
- An adversarial training technique using self-training is proposed.
- Consistency regularization is applied to suppress the distortion of representations in latent space.
- The proposed technique can be easily generalized to other ...
A feasibility study of an autoencoder meta-model for improving generalization capabilities on training sets of small sizes

A meta-model approach for autoencoders is presented.New modification of autoencoder's learning algorithm is proposed.Developed meta-model provides faster training in terms of consumed training samples. The problem of training autoencoders (with logistic ...
A new hybrid intelligent system for fast neural network training
ISNN'13: Proceedings of the 10th international conference on Advances in Neural Networks - Volume Part II

A major drawback of artificial neural network is long training time depending on a number of training data. Thus, the contribution of this work is to present the intelligent hybrid system for faster training on neural network. The concept of the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMLC '24: Proceedings of the 2024 16th International Conference on Machine Learning and Computing

February 2024

757 pages

ISBN:9798400709234

DOI:10.1145/3651671

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICMLC 2024

ICMLC 2024: 2024 16th International Conference on Machine Learning and Computing

February 2 - 5, 2024

Shenzhen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
19
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten