Abstract
In recent years, convolutional neural networks (CNNs) have proven their effectiveness in many challenging computer vision-based tasks, including small object classification. However, according to recent literature, this task is mainly based on 2D CNNs, and the small size of object instances makes their recognition a challenging task. Since 3D CNNs are extremely tedious and time-consuming to learn, they cannot be used in a way that requires a trade-off between accuracy and efficiency. Moreover, due to the great success of Transformers in the field of natural language processing (NLP), a spatial Transformer can also be used as a robust feature transformer and has recently been successfully applied to computer vision tasks, including image classification. By incorporating attention mechanisms into the Transformers, many NLP and computer vision tasks can achieve excellent performance and help learn the contextual encoding of the input patches. However, the complexity of these tasks generally increases with the dimension of the input feature space. In this paper, we propose a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for better performance on a low-resolution dataset. First, the combination of a pre-trained deep CNN and a 3D CNN can significantly reduce the complexity and result in an accurate learning algorithm. Second, a pre-trained deep CNN model is used as a robust feature extractor and combined with a spatial Transformer to improve the representational power of the developed model and take advantage of the powerful global modeling capabilities of Transformers. Finally, spatial attention and channel attention are adaptively fused by focusing on all components in the input space to capture local and global spatial correlations on non-overlapping regions of the input representation. Experimental results show that the proposed framework has significant relevance in terms of efficiency and accuracy.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03696-y/MediaObjects/11760_2024_3696_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03696-y/MediaObjects/11760_2024_3696_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03696-y/MediaObjects/11760_2024_3696_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03696-y/MediaObjects/11760_2024_3696_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03696-y/MediaObjects/11760_2024_3696_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03696-y/MediaObjects/11760_2024_3696_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11760-024-03696-y/MediaObjects/11760_2024_3696_Fig7_HTML.png)
Similar content being viewed by others
Data availability
The dataset used and analyzed in the current study is available in the GTSRB repository (https://benchmark.ini.rub.de/gtsrb_news.html).
References
Shawahna, A., Sait, S.M., El-Maleh, A.: FPGA-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 7, 7823–7859 (2019)
Wei, X.-S., Song, Y.-Z., Aodha, O.M., et al.: Fine-grained image analysis with deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8927–8948 (2022)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Li, Z., Liu, F., Yang, W., Peng, S., et al.: A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 33, 6999–7019 (2022)
Hara, K., Kataoka, H., Satoh, Y.: Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6546-6555 (2018)
Mittal, S.: Vibhu: a survey of accelerator architectures for 3D convolution neural networks. J. Syst. Architect. 115, 102041 (2021)
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3, 9 (2016)
Bayoudh, K., Hamdaoui, F., Mtibaa, A.: Transfer learning based hybrid 2D–3D CNN for traffic sign recognition and semantic road detection applied in advanced driver assistance systems. Appl. Intell. 51, 124–142 (2021)
Guo, M.-H., Xu, T.-X., Liu, J.-J., et al.: Attention mechanisms in computer vision: a survey. Comp. Visual Media. 8, 331–368 (2022)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is All you Need. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2017)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial Transformer Networks. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2015)
Georgiou, T., Liu, Y., Chen, W., et al.: A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int. J. Multimed. Info. Retr. 9, 135–170 (2020)
Saleem, R., Yuan, B., Kurugollu, F., et al.: Explaining deep neural networks: a survey on the global interpretation methods. Neurocomputing 513, 165–180 (2022)
Liu, Y., Sun, P., Wergeles, N., et al.: A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172, 114602 (2021)
Lim, X.R., Lee, C.P., Lim, K.M., et al.: Recent Advances in Traffic Sign Recognition: approaches and Datasets. Sensors 23, 4674 (2023)
Zhuang, F., Qi, Z., Duan, K., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2021)
Woo, S., Park, J., Lee, J.-Y., et al.: CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision - ECCV 2018, pp. 3–19. Springer International Publishing, Cham (2018)
Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11531–11539 (2020)
Li, W., Liu, K., Zhang, L., et al.: Object detection based on an adaptive attention mechanism. Sci. Rep. 10, 11307 (2020)
Li, S., Zhu, B., Guo, X., et al.: Multi-scale high and low feature fusion attention network for intestinal image classification. SIViP 17, 2877–2886 (2023)
Khan, S., Naseer, M., Hayat, M., et al.: Transformers in vision: a survey. ACM Comput. Surv. 54, 200 (2022)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2021) arXiv:2010.11929
Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45, 87–110 (2023)
Aleissaee, A.A., Kumar, A., Anwer, R.M., et al.: Transformers in remote sensing: a survey. Remote Sensing 15, 1860 (2023)
Huang, H., Liu, P., Wang, Y., et al.: Multi-feature aggregation network for salient object detection. SIViP 17, 1043–1051 (2023)
Stallkamp, J., Schlipsing, M., Salmen, J., et al.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012)
Chung, J.H., Kim, D.W., Kang, T.K., et al.: Traffic sign recognition in harsh environment using attention based convolutional pooling neural network. Neural Process. Lett. 51, 2551–2573 (2020)
Manzari, O.N., Kashiani, H., Dehkordi, H.A., et al.: Robust transformer with locality inductive bias and feature normalization. Eng. Sci. Technol. Int. J. 38, 101320 (2023)
Bayoudh, K., Hamdaoui, F., Mtibaa, A.: Hybrid-COVID: a novel hybrid 2D/3D CNN based on cross-domain adaptation approach for COVID-19 screening from chest X-ray images. Phys. Eng. Sci. Med. 43, 1415–1431 (2020)
Alzubaidi, L., Bai, J., Al-Sabaawi, A., et al.: A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications. J. Big Data 10, 46 (2023)
Chakraborty, S., Uzkent, B., Ayush, K., Tanmay, K., Sheehan, E., Ermon, S.: Efficient Conditional Pre-training for Transfer Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4240–4249 (2022)
Bayoudh, K.: A survey of multimodal hybrid deep learning for computer vision: architectures, applications, trends, and challenges. Inf. Fusion. 105, 102217 (2024)
Funding
This study did not receive external funding.
Author information
Authors and Affiliations
Contributions
All authors were involved in the conception and design of this study. Documentation, data collection, and analysis were performed by [KB]. The first draft of the manuscript was written by [KB]. The final manuscript was read and approved by all authors.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Bayoudh, K., Mtibaa, A. Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification. SIViP 19, 133 (2025). https://doi.org/10.1007/s11760-024-03696-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11760-024-03696-y