A hierarchical and data-efficient network based on patch-based representation

Yao, Dazhi; Shao, Yunxue

doi:10.1007/s11760-023-02488-0

A hierarchical and data-efficient network based on patch-based representation

Original Paper
Published: 15 February 2023

Volume 17, pages 2713–2719, (2023)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

281 Accesses
1 Altmetric
Explore all metrics

Abstract

With the rise of Transformers in computer vision, more and more people believe Transformer-based models would serve as a standard in various vision tasks. However, due to its unprecedented scale and large amount of training data, it is difficult for researchers with less data and limited computing resources to use the Transformer-based model. Recently, a paper proposed ConvMixer to prove the excellent performance of Transformer-based models due to their patch-based representation. Although ConvMixer performs well on image classification, its isotropic architecture is inefficient and unsuitable for other vision tasks. This paper proposes a hierarchical and data-efficient network based on patch-based representation, which we call HEConvMixer. Unlike original Transformer-based models, we use some unsophisticated convolutional blocks to replace Transformer blocks and add two downsample layers in our network. We trained our network on small datasets from scratch by using one GPU. The empirical result shows that our HEConvMixer trained on CIFAR-10 with no extra data for 200 epochs achieves \(97.07\%\) accuracy, outperforming previous Transformer-based models and ConvNets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel Framework for Image Classification Based on Patch-Based CNN Model

CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer

Article Open access 22 September 2023

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Article 22 December 2022

Data availability

All data generated or analyzed during this study are included in this published article, and the dataset used or analyzed during the current study is open online.

Code availability

https://github.com/ygdr2020/HEConvMixer.

References

Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 4(1), 541–551 (1989)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Simonyan, K., Zisserman, A.:Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2015)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, G., Liu, Z., Van, L., Maaten, D., Weinberger, K, Q.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: The International Conference on Machine Learning, pp. 6105–6114 (2019)
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Hu, J., Shen, L., Sun, G.:Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A, N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
Deng, J., Dong, W., Socher, R., Li, L, J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition, pp. 248–255 (2009)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S. Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Wang, W., Xie, E., Li, L., Fan, D, P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357 (2020)
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808 (2021)
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S. J.: Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11936–11945 (2021)
Trockman, A.,Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T.: Mlp-mixer: an all-mlp architecture for vision. Advances in Neural Information Processing Systems 34 (2021)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Ba, J.L., Kiros, J.R., Hinton, G.:Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Le, Y., Yang, X.: Tiny imagenet visual recognition challenge. CS 231N 7(7), 3 (2015)
Google Scholar
Ilya, L., Frank, H.: Fixing weight decay regularization in adam (2018)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q, V.: Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Müller, R., Kornblith, S., Hinton, G, E.: When does label smoothing help? Advances in Neural Information Processing Systems, 32 (2019)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2018)
Yun, S., Han, D., Oh, S, J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: IEEE International Conference on Computer Vision, pp. 6023–6032 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: The National Conference on Artificial Intelligence, pp. 13001–13008 (2020)

Download references

Acknowledgements

We thank many colleagues at our laboratory for their help.

Funding

No funding.

Author information

Authors and Affiliations

School of Computer Science and Technology, Nanjing Tech University, Nanjing, China
Dazhi Yao & Yunxue Shao

Authors

Dazhi Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yunxue Shao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunxue Shao.

Ethics declarations

Conflict of interest

There is no conflict of interest or competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yao, D., Shao, Y. A hierarchical and data-efficient network based on patch-based representation. SIViP 17, 2713–2719 (2023). https://doi.org/10.1007/s11760-023-02488-0

Download citation

Received: 16 March 2022
Revised: 23 October 2022
Accepted: 31 December 2022
Published: 15 February 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11760-023-02488-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hierarchical and data-efficient network based on patch-based representation

Abstract

Access this article

Similar content being viewed by others

Novel Framework for Image Classification Based on Patch-Based CNN Model

CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hierarchical and data-efficient network based on patch-based representation

Abstract

Access this article

Similar content being viewed by others

Novel Framework for Image Classification Based on Patch-Based CNN Model

CMNet: a novel model and design rationale based on comparison studies and synergy of CNN and MetaFormer

Conv-PVT: a fusion architecture of convolution and pyramid vision transformer

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation