Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification

Bayoudh, Khaled; Mtibaa, Abdellatif

doi:10.1007/s11760-024-03696-y

Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification

Original Paper
Published: 15 December 2024

Volume 19, article number 133, (2025)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Khaled Bayoudh^1,3 &
Abdellatif Mtibaa²

330 Accesses
Explore all metrics

Abstract

In recent years, convolutional neural networks (CNNs) have proven their effectiveness in many challenging computer vision-based tasks, including small object classification. However, according to recent literature, this task is mainly based on 2D CNNs, and the small size of object instances makes their recognition a challenging task. Since 3D CNNs are extremely tedious and time-consuming to learn, they cannot be used in a way that requires a trade-off between accuracy and efficiency. Moreover, due to the great success of Transformers in the field of natural language processing (NLP), a spatial Transformer can also be used as a robust feature transformer and has recently been successfully applied to computer vision tasks, including image classification. By incorporating attention mechanisms into the Transformers, many NLP and computer vision tasks can achieve excellent performance and help learn the contextual encoding of the input patches. However, the complexity of these tasks generally increases with the dimension of the input feature space. In this paper, we propose a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for better performance on a low-resolution dataset. First, the combination of a pre-trained deep CNN and a 3D CNN can significantly reduce the complexity and result in an accurate learning algorithm. Second, a pre-trained deep CNN model is used as a robust feature extractor and combined with a spatial Transformer to improve the representational power of the developed model and take advantage of the powerful global modeling capabilities of Transformers. Finally, spatial attention and channel attention are adaptively fused by focusing on all components in the input space to capture local and global spatial correlations on non-overlapping regions of the input representation. Experimental results show that the proposed framework has significant relevance in terms of efficiency and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TransMCGC: a recast vision transformer for small-scale image classification tasks

Article 04 January 2023

Map-and-acquisition networks

Article 07 August 2024

Improving the Performance of Convolutional Neural Networks for Image Classification

Article 01 January 2021

Data availability

The dataset used and analyzed in the current study is available in the GTSRB repository (https://benchmark.ini.rub.de/gtsrb_news.html).

References

Shawahna, A., Sait, S.M., El-Maleh, A.: FPGA-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 7, 7823–7859 (2019)
Article MATH Google Scholar
Wei, X.-S., Song, Y.-Z., Aodha, O.M., et al.: Fine-grained image analysis with deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8927–8948 (2022)
Article MATH Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Article MATH Google Scholar
Li, Z., Liu, F., Yang, W., Peng, S., et al.: A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 33, 6999–7019 (2022)
Article MathSciNet MATH Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6546-6555 (2018)
Mittal, S.: Vibhu: a survey of accelerator architectures for 3D convolution neural networks. J. Syst. Architect. 115, 102041 (2021)
Article MATH Google Scholar
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3, 9 (2016)
Article MATH Google Scholar
Bayoudh, K., Hamdaoui, F., Mtibaa, A.: Transfer learning based hybrid 2D–3D CNN for traffic sign recognition and semantic road detection applied in advanced driver assistance systems. Appl. Intell. 51, 124–142 (2021)
Article Google Scholar
Guo, M.-H., Xu, T.-X., Liu, J.-J., et al.: Attention mechanisms in computer vision: a survey. Comp. Visual Media. 8, 331–368 (2022)
Article MATH Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is All you Need. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2017)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial Transformer Networks. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2015)
Georgiou, T., Liu, Y., Chen, W., et al.: A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int. J. Multimed. Info. Retr. 9, 135–170 (2020)
Article MATH Google Scholar
Saleem, R., Yuan, B., Kurugollu, F., et al.: Explaining deep neural networks: a survey on the global interpretation methods. Neurocomputing 513, 165–180 (2022)
Article MATH Google Scholar
Liu, Y., Sun, P., Wergeles, N., et al.: A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172, 114602 (2021)
Article MATH Google Scholar
Lim, X.R., Lee, C.P., Lim, K.M., et al.: Recent Advances in Traffic Sign Recognition: approaches and Datasets. Sensors 23, 4674 (2023)
Article MATH Google Scholar
Zhuang, F., Qi, Z., Duan, K., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2021)
Article MATH Google Scholar
Woo, S., Park, J., Lee, J.-Y., et al.: CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision - ECCV 2018, pp. 3–19. Springer International Publishing, Cham (2018)
Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11531–11539 (2020)
Li, W., Liu, K., Zhang, L., et al.: Object detection based on an adaptive attention mechanism. Sci. Rep. 10, 11307 (2020)
Article MATH Google Scholar
Li, S., Zhu, B., Guo, X., et al.: Multi-scale high and low feature fusion attention network for intestinal image classification. SIViP 17, 2877–2886 (2023)
Khan, S., Naseer, M., Hayat, M., et al.: Transformers in vision: a survey. ACM Comput. Surv. 54, 200 (2022)
Article MATH Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2021) arXiv:2010.11929
Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45, 87–110 (2023)
Article MATH Google Scholar
Aleissaee, A.A., Kumar, A., Anwer, R.M., et al.: Transformers in remote sensing: a survey. Remote Sensing 15, 1860 (2023)
Article MATH Google Scholar
Huang, H., Liu, P., Wang, Y., et al.: Multi-feature aggregation network for salient object detection. SIViP 17, 1043–1051 (2023)
Article MATH Google Scholar
Stallkamp, J., Schlipsing, M., Salmen, J., et al.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012)
Article Google Scholar
Chung, J.H., Kim, D.W., Kang, T.K., et al.: Traffic sign recognition in harsh environment using attention based convolutional pooling neural network. Neural Process. Lett. 51, 2551–2573 (2020)
Article MATH Google Scholar
Manzari, O.N., Kashiani, H., Dehkordi, H.A., et al.: Robust transformer with locality inductive bias and feature normalization. Eng. Sci. Technol. Int. J. 38, 101320 (2023)
Google Scholar
Bayoudh, K., Hamdaoui, F., Mtibaa, A.: Hybrid-COVID: a novel hybrid 2D/3D CNN based on cross-domain adaptation approach for COVID-19 screening from chest X-ray images. Phys. Eng. Sci. Med. 43, 1415–1431 (2020)
Article MATH Google Scholar
Alzubaidi, L., Bai, J., Al-Sabaawi, A., et al.: A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications. J. Big Data 10, 46 (2023)
Article MATH Google Scholar
Chakraborty, S., Uzkent, B., Ayush, K., Tanmay, K., Sheehan, E., Ermon, S.: Efficient Conditional Pre-training for Transfer Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4240–4249 (2022)
Bayoudh, K.: A survey of multimodal hybrid deep learning for computer vision: architectures, applications, trends, and challenges. Inf. Fusion. 105, 102217 (2024)
Article MATH Google Scholar

Download references

Funding

This study did not receive external funding.

Author information

Authors and Affiliations

National Engineering School of Monastir (ENIM), Laboratory of Electronics and Micro-electronics (LR99ES30), Faculty of Sciences of Monastir (FSM), University of Monastir, Monastir, Tunisia
Khaled Bayoudh
National Engineering School of Sfax (ENIS), Systems Integration & Emerging Energies (SI2E) Laboratory (LR 21 ES 14), National Engineering School of Sfax (ENIS), University of Sfax, Sfax, Tunisia
Abdellatif Mtibaa
Laboratoire Imagerie et Vision Artificielle, ImViA, UR 7535, Université de Bourgogne, Dijon, France
Khaled Bayoudh

Authors

Khaled Bayoudh
View author publications
You can also search for this author in PubMed Google Scholar
Abdellatif Mtibaa
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors were involved in the conception and design of this study. Documentation, data collection, and analysis were performed by [KB]. The first draft of the manuscript was written by [KB]. The final manuscript was read and approved by all authors.

Corresponding author

Correspondence to Khaled Bayoudh.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bayoudh, K., Mtibaa, A. Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification. SIViP 19, 133 (2025). https://doi.org/10.1007/s11760-024-03696-y

Download citation

Received: 13 September 2023
Revised: 29 August 2024
Accepted: 13 September 2024
Published: 15 December 2024
DOI: https://doi.org/10.1007/s11760-024-03696-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

TransMCGC: a recast vision transformer for small-scale image classification tasks

Map-and-acquisition networks

Improving the Performance of Convolutional Neural Networks for Image Classification

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

TransMCGC: a recast vision transformer for small-scale image classification tasks

Map-and-acquisition networks

Improving the Performance of Convolutional Neural Networks for Image Classification

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation