Abstract
Sequential Greedy Architecture Search (SGAS) reduces the discretization loss of Differentiable Architecture Search (DARTS). However, we observed that SGAS may lead to unstable searched results as DARTS. We referred to this problem as the cascade performance collapse issue. Therefore, we proposed Sequential Greedy Architecture Search with the Early Stopping Indicator (SGAS-es). We adopted the early stopping mechanism in each phase of SGAS to stabilize searched results and further improve the searching ability. The early stopping mechanism is based on the relation among Flat Minima, the largest eigenvalue of the Hessian matrix of the loss function, and performance collapse. We devised a mathematical derivation to show the relation between Flat Minima and the largest eigenvalue. The moving averaged largest eigenvalue is used as an early stopping indicator. Finally, we used NAS-Bench-201 and Fashion-MNIST to confirm the performance and stability of SGAS-es. Moreover, we used EMNIST-Balanced to verify the transferability of searched results. These experiments show that SGAS-es is a robust method and can derive the architecture with good performance and transferability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Liu, H., Simonyan, K., Yang, Y.: ’Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018)
Yang, A., Esperança, P.M., Carlucci, F.M.: NAS evaluation is frustratingly hard. arXiv preprint arXiv:1912.12522 (2019)
Zela, A., Elsken, T., Saikia, T., Marrakchi, Y., Brox, T., Hutter, F.: Understanding and robustifying differentiable architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=H1gDNyrKDS
Xie, L., et al.: Weight-sharing neural architecture search: a battle to shrink the optimization gap. ACM Comput. Surv. (2022)
Chen, X., Xie, L., Wu, J., Tian, Q.: Progressive differentiable architecture search: bridging the depth gap between search and evaluation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1294–1303 (2019)
Liang, H., et al.: Darts+: improved differentiable architecture search with early stopping, arXiv preprint arXiv:1909.06035 (2019)
Chen, X., Hsieh, C.-J.: Stabilizing differentiable architecture search via perturbation-based regularization. In: International Conference on Machine Learning, PMLR (2020)
Xu, Y., et al.: PC-DARTS: partial channel connections for memory-efficient architecture search. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=BJlS634tPr
Dong, X., Yang, Y.: Nas-bench-201: extending the scope of reproducible neural architecture search. arXiv preprint arXiv:2001.00326 (2020)
Li, G., Qian, G., Delgadillo, I.C., Müller, M., Thabet, A., Ghanem, B.: Sgas: sequential greedy architecture search. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016)
Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019)
Chu, X., Zhou, T., Zhang, B., Li, J.: Fair DARTS: eliminating unfair advantages in differentiable architecture search. In: 16th Europoean Conference On Computer Vision (2020). https://arxiv.org/abs/1911.12126.pdf
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning, PMLR (2017)
Dong, X., Liu, L., Musial, K., Gabrys, B.: NATS-bench: benchmarking NAS algorithms for architecture topology and size. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 44(7), 3634–3646 (2021)
Chrabaszcz, P., Loshchilov, I., Hutter, F.: A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819 (2017)
Xiao, H., Rasul, K., Vollgraf, R.: ’Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
LeCun, Y.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
Guilin, L., Xing, Z., Zitong, W., Zhenguo, L., Tong, Z.: Stacnas: towards stable and consistent optimization for differentiable neural architecture search (2019)
Mao, Y., Zhong, G., Wang, Y., Deng, Z.: Differentiable light-weight architecture search. In: 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2021)
Zhong, Z., et al.: ’Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34. no. 07 (2020)
Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., Rodrigo, R.: Deepcaps: going deeper with capsule networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10 725–10 733 (2019)
Nøkland, A., Eidnes, L.H.: Training neural networks with local error signals. In: International Conference on Machine Learning, pp. 4839–4850. PMLR (2019)
Cohen, G., Afshar, S., Tapson, J., Van Schaik, A.: Emnist: extending mnist to handwritten letters. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. IEEE (2017)
Jeevan, P., Sethi, A.: WaveMix: resource-efficient token mixing for images. arXiv preprint arXiv:2203.03689 (2022)
Kabir, H., et al.: Spinalnet: deep neural network with gradual input. arXiv preprint arXiv:2007.03347 (2020)
Jayasundara, V., Jayasekara, S., Jayasekara, H., Rajasegaran, J., Seneviratne, S., Rodrigo, R.: Textcaps: handwritten character recognition with very small datasets. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 254–262. IEEE (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix A: FashionMNIST Experiment Settings
In the searching stage, we use half of the images in the train set as training images and the other half as validation images. Some data augmentation tricks are listed as follows: for training images, we first do a random crop (RandomCrop()) with a height and a width of 32 and a padding of 4. Second, we do a random horizontal flip (RandomHorizontalFlip()). Third, we transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we further normalize (Normalize()) these image values with the mean and the standard deviation of Fashion-MNIST training images. For validation images, we do the last two steps (ToTensor() and Normalize()) of what we have done to the training images.
Hyperparameter settings of the searching stage are as follows: a batch size is 32 (128 for PC-DARTS since only \(\frac{1}{4}\) channels will do the mixed operation). A batch increase is 8. An SGD optimizer is used for network weights with an initial learning rate of 0.025, a minimum learning rate of 0.001, a momentum of 0.9, and a weight decay of 0.0003. A cosine annealing scheduler is used for a learning rate decay.
An Adam optimizer is used for architecture parameters with a learning rate of 0.0003, a weight decay of 0.001, and beta values of 0.5 and 0.999. Besides, a gradient clipping is used with a max norm of 5. For supernet, the number of initial channels is 16, and the number of cells is 8. For SGAS-es, we set df, w, k, and T equal to 15, 3, 5, and 1.3. For other DARTS-based methods, a search epoch is set to 50.
We use the whole train set to train the supernet for 600 epochs in the retraining stage. Data augmentation tricks used for the train set and the test set are the same as those for the train set and the validation set in the searching stage. However, we add two more tricks for the train set: a cutout with a length of 16 and a random erase.
Hyperparameter settings of the retraining stage are as follows: a batch size is 72. An SGD optimizer has an initial learning rate of 0.025, a momentum of 0.9, and a weight decay of 0.0003. Not only the weight decay but also a drop path is used for regularization with a drop path probability of 0.2. The cosine annealing scheduler is also used for learning rate decay. The initial channel size is 36, and the number of cells is 20. An Auxiliary loss is used with an auxiliary weight of 0.4.
Other DARTS-based methods (like SGAS and PC-DARTS) and original DARTS in Table 2 follow the same settings. They used similar settings in their papers too.
Appendix B: EMNIST-Balanced Experiment Settings
Most experiment settings are the same as those of the retraining stage of Fashion-MNIST. Some differences are as follows: we use the whole train set to train the supernet for 200 epochs. The batch size of the SGD optimizer is 96. For each training image, we will first resize (Resize()) it to 32\(\,\times \,\)32. Second, we will do a random affine (RandomAffine()) with a degree of (–30, 30), a translate of (0.1, 0.1), a scale of (0.8, 1.2), and a shear of (–30, 30). Third, we will transform the format of inputs into tensors and normalize the values of images between 0 and 1 (ToTensor()). Last, we will further normalize (Normalize()) inputs with mean and variance equal to 0.5. For each test image, we will only do ToTensor() and Normalize().
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, SP., Wang, SD. (2023). SGAS-es: Avoiding Performance Collapse by Sequential Greedy Architecture Search with the Early Stopping Indicator. In: Arai, K. (eds) Advances in Information and Communication. FICC 2023. Lecture Notes in Networks and Systems, vol 652. Springer, Cham. https://doi.org/10.1007/978-3-031-28073-3_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-28073-3_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28072-6
Online ISBN: 978-3-031-28073-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)