SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator

Lin, Shih-Ping; Wang, Sheng-De

doi:10.1007/978-3-031-28073-3_10

Shih-Ping Lin¹⁰ &
Sheng-De Wang¹⁰

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 652))

Included in the following conference series:

Future of Information and Communication Conference

885 Accesses

Abstract

Sequential Greedy Architecture Search (SGAS) reduces the discretization loss of Differentiable Architecture Search (DARTS). However, we observed that SGAS may lead to unstable searched results as DARTS. We referred to this problem as the cascade performance collapse issue. Therefore, we proposed Sequential Greedy Architecture Search with the Early Stopping Indicator (SGAS-es). We adopted the early stopping mechanism in each phase of SGAS to stabilize searched results and further improve the searching ability. The early stopping mechanism is based on the relation among Flat Minima, the largest eigenvalue of the Hessian matrix of the loss function, and performance collapse. We devised a mathematical derivation to show the relation between Flat Minima and the largest eigenvalue. The moving averaged largest eigenvalue is used as an early stopping indicator. Finally, we used NAS-Bench-201 and Fashion-MNIST to confirm the performance and stability of SGAS-es. Moreover, we used EMNIST-Balanced to verify the transferability of searched results. These experiments show that SGAS-es is a robust method and can derive the architecture with good performance and transferability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

New Indicators and Optimizations for Zero-Shot NAS Based on Feature Maps

MIG-DARTS: towards effective differentiable architecture search by gradually mitigating the initial-channel gap between search and evaluation

Article 09 January 2025

Dependency-Aware Differentiable Neural Architecture Search

References

Liu, H., Simonyan, K., Yang, Y.: ’Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018)
Yang, A., Esperança, P.M., Carlucci, F.M.: NAS evaluation is frustratingly hard. arXiv preprint arXiv:1912.12522 (2019)
Zela, A., Elsken, T., Saikia, T., Marrakchi, Y., Brox, T., Hutter, F.: Understanding and robustifying differentiable architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=H1gDNyrKDS
Xie, L., et al.: Weight-sharing neural architecture search: a battle to shrink the optimization gap. ACM Comput. Surv. (2022)
Google Scholar
Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1294–1303 (2019)
Google Scholar
Liang, H., et al.: Darts+: improved differentiable architecture search with early stopping, arXiv preprint arXiv:1909.06035 (2019)
Chen, X., Hsieh, C.-J.: Stabilizing differentiable architecture search via perturbation-based regularization. In: International Conference on Machine Learning, PMLR (2020)
Google Scholar
Xu, Y., et al.: PC-DARTS: partial channel connections for memory-efficient architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=BJlS634tPr
Dong, X., Yang, Y.: Nas-bench-201: extending the scope of reproducible neural architecture search. arXiv preprint arXiv:2001.00326 (2020)
Li, G., Qian, G., Delgadillo, I.C., Müller, M., Thabet, A., Ghanem, B.: Sgas: sequential greedy architecture search. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)
Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019)
Google Scholar
Chu, X., Zhou, T., Zhang, B., Li, J.: Fair DARTS: eliminating unfair advantages in differentiable architecture search. In: 16th Europoean Conference On Computer Vision (2020). https://arxiv.org/abs/1911.12126.pdf
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning, PMLR (2017)
Google Scholar
Dong, X., Liu, L., Musial, K., Gabrys, B.: NATS-bench: benchmarking NAS algorithms for architecture topology and size. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(7), 3634–3646 (2021)
Google Scholar
Chrabaszcz, P., Loshchilov, I., Hutter, F.: A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819 (2017)
Xiao, H., Rasul, K., Vollgraf, R.: ’Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
Guilin, L., Xing, Z., Zitong, W., Zhenguo, L., Tong, Z.: Stacnas: towards stable and consistent optimization for differentiable neural architecture search (2019)
Google Scholar
Mao, Y., Zhong, G., Wang, Y., Deng, Z.: Differentiable light-weight architecture search. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Google Scholar
Zhong, Z., et al.: ’Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34. no. 07 (2020)
Google Scholar
Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., Rodrigo, R.: Deepcaps: going deeper with capsule networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10 725–10 733 (2019)
Google Scholar
Nøkland, A., Eidnes, L.H.: Training neural networks with local error signals. In: International Conference on Machine Learning, pp. 4839–4850. PMLR (2019)
Google Scholar
Cohen, G., Afshar, S., Tapson, J., Van Schaik, A.: Emnist: extending mnist to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE (2017)
Google Scholar
Jeevan, P., Sethi, A.: WaveMix: resource-efficient token mixing for images. arXiv preprint arXiv:2203.03689 (2022)
Kabir, H., et al.: Spinalnet: deep neural network with gradual input. arXiv preprint arXiv:2007.03347 (2020)
Jayasundara, V., Jayasekara, S., Jayasekara, H., Rajasegaran, J., Seneviratne, S., Rodrigo, R.: Textcaps: handwritten character recognition with very small datasets. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 254–262. IEEE (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

National Taiwan University, Taipei, 106319, Taiwan
Shih-Ping Lin & Sheng-De Wang

Authors

Shih-Ping Lin
View author publications
You can also search for this author in PubMed Google Scholar
Sheng-De Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheng-De Wang .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai

Appendices

Appendix A: FashionMNIST Experiment Settings

In the searching stage, we use half of the images in the train set as training images and the other half as validation images. Some data augmentation tricks are listed as follows: for training images, we first do a random crop (RandomCrop()) with a height and a width of 32 and a padding of 4. Second, we do a random horizontal flip (RandomHorizontalFlip()). Third, we transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we further normalize (Normalize()) these image values with the mean and the standard deviation of Fashion-MNIST training images. For validation images, we do the last two steps (ToTensor() and Normalize()) of what we have done to the training images.

Hyperparameter settings of the searching stage are as follows: a batch size is 32 (128 for PC-DARTS since only $\frac{1}{4}$ channels will do the mixed operation). A batch increase is 8. An SGD optimizer is used for network weights with an initial learning rate of 0.025, a minimum learning rate of 0.001, a momentum of 0.9, and a weight decay of 0.0003. A cosine annealing scheduler is used for a learning rate decay.

An Adam optimizer is used for architecture parameters with a learning rate of 0.0003, a weight decay of 0.001, and beta values of 0.5 and 0.999. Besides, a gradient clipping is used with a max norm of 5. For supernet, the number of initial channels is 16, and the number of cells is 8. For SGAS-es, we set df, w, k, and T equal to 15, 3, 5, and 1.3. For other DARTS-based methods, a search epoch is set to 50.

We use the whole train set to train the supernet for 600 epochs in the retraining stage. Data augmentation tricks used for the train set and the test set are the same as those for the train set and the validation set in the searching stage. However, we add two more tricks for the train set: a cutout with a length of 16 and a random erase.

Hyperparameter settings of the retraining stage are as follows: a batch size is 72. An SGD optimizer has an initial learning rate of 0.025, a momentum of 0.9, and a weight decay of 0.0003. Not only the weight decay but also a drop path is used for regularization with a drop path probability of 0.2. The cosine annealing scheduler is also used for learning rate decay. The initial channel size is 36, and the number of cells is 20. An Auxiliary loss is used with an auxiliary weight of 0.4.

Other DARTS-based methods (like SGAS and PC-DARTS) and original DARTS in Table 2 follow the same settings. They used similar settings in their papers too.

Appendix B: EMNIST-Balanced Experiment Settings

Most experiment settings are the same as those of the retraining stage of Fashion-MNIST. Some differences are as follows: we use the whole train set to train the supernet for 200 epochs. The batch size of the SGD optimizer is 96. For each training image, we will first resize (Resize()) it to 32$\,\times \,$32. Second, we will do a random affine (RandomAffine()) with a degree of (–30, 30), a translate of (0.1, 0.1), a scale of (0.8, 1.2), and a shear of (–30, 30). Third, we will transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we will further normalize (Normalize()) inputs with mean and variance equal to 0.5. For each test image, we will only do ToTensor() and Normalize().

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, SP., Wang, SD. (2023). SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-28073-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-28073-3_10
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28072-6
Online ISBN: 978-3-031-28073-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator