Skip to main content

Advertisement

Log in

Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

In recent years, convolutional neural networks (CNNs) have proven their effectiveness in many challenging computer vision-based tasks, including small object classification. However, according to recent literature, this task is mainly based on 2D CNNs, and the small size of object instances makes their recognition a challenging task. Since 3D CNNs are extremely tedious and time-consuming to learn, they cannot be used in a way that requires a trade-off between accuracy and efficiency. Moreover, due to the great success of Transformers in the field of natural language processing (NLP), a spatial Transformer can also be used as a robust feature transformer and has recently been successfully applied to computer vision tasks, including image classification. By incorporating attention mechanisms into the Transformers, many NLP and computer vision tasks can achieve excellent performance and help learn the contextual encoding of the input patches. However, the complexity of these tasks generally increases with the dimension of the input feature space. In this paper, we propose a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for better performance on a low-resolution dataset. First, the combination of a pre-trained deep CNN and a 3D CNN can significantly reduce the complexity and result in an accurate learning algorithm. Second, a pre-trained deep CNN model is used as a robust feature extractor and combined with a spatial Transformer to improve the representational power of the developed model and take advantage of the powerful global modeling capabilities of Transformers. Finally, spatial attention and channel attention are adaptively fused by focusing on all components in the input space to capture local and global spatial correlations on non-overlapping regions of the input representation. Experimental results show that the proposed framework has significant relevance in terms of efficiency and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The dataset used and analyzed in the current study is available in the GTSRB repository (https://benchmark.ini.rub.de/gtsrb_news.html).

References

  1. Shawahna, A., Sait, S.M., El-Maleh, A.: FPGA-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 7, 7823–7859 (2019)

    Article  MATH  Google Scholar 

  2. Wei, X.-S., Song, Y.-Z., Aodha, O.M., et al.: Fine-grained image analysis with deep learning: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44, 8927–8948 (2022)

    Article  MATH  Google Scholar 

  3. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)

    Article  MATH  Google Scholar 

  4. Li, Z., Liu, F., Yang, W., Peng, S., et al.: A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 33, 6999–7019 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  5. Hara, K., Kataoka, H., Satoh, Y.: Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6546-6555 (2018)

  6. Mittal, S.: Vibhu: a survey of accelerator architectures for 3D convolution neural networks. J. Syst. Architect. 115, 102041 (2021)

    Article  MATH  Google Scholar 

  7. Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data 3, 9 (2016)

    Article  MATH  Google Scholar 

  8. Bayoudh, K., Hamdaoui, F., Mtibaa, A.: Transfer learning based hybrid 2D–3D CNN for traffic sign recognition and semantic road detection applied in advanced driver assistance systems. Appl. Intell. 51, 124–142 (2021)

    Article  Google Scholar 

  9. Guo, M.-H., Xu, T.-X., Liu, J.-J., et al.: Attention mechanisms in computer vision: a survey. Comp. Visual Media. 8, 331–368 (2022)

    Article  MATH  Google Scholar 

  10. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is All you Need. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2017)

  11. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial Transformer Networks. In: Advances in Neural Information Processing Systems. Curran Associates, Inc. (2015)

  12. Georgiou, T., Liu, Y., Chen, W., et al.: A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int. J. Multimed. Info. Retr. 9, 135–170 (2020)

    Article  MATH  Google Scholar 

  13. Saleem, R., Yuan, B., Kurugollu, F., et al.: Explaining deep neural networks: a survey on the global interpretation methods. Neurocomputing 513, 165–180 (2022)

    Article  MATH  Google Scholar 

  14. Liu, Y., Sun, P., Wergeles, N., et al.: A survey and performance evaluation of deep learning methods for small object detection. Expert Syst. Appl. 172, 114602 (2021)

    Article  MATH  Google Scholar 

  15. Lim, X.R., Lee, C.P., Lim, K.M., et al.: Recent Advances in Traffic Sign Recognition: approaches and Datasets. Sensors 23, 4674 (2023)

    Article  MATH  Google Scholar 

  16. Zhuang, F., Qi, Z., Duan, K., et al.: A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2021)

    Article  MATH  Google Scholar 

  17. Woo, S., Park, J., Lee, J.-Y., et al.: CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision - ECCV 2018, pp. 3–19. Springer International Publishing, Cham (2018)

  18. Wang, Q., Wu, B., Zhu, P., et al.: ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11531–11539 (2020)

  19. Li, W., Liu, K., Zhang, L., et al.: Object detection based on an adaptive attention mechanism. Sci. Rep. 10, 11307 (2020)

    Article  MATH  Google Scholar 

  20. Li, S., Zhu, B., Guo, X., et al.: Multi-scale high and low feature fusion attention network for intestinal image classification. SIViP 17, 2877–2886 (2023)

  21. Khan, S., Naseer, M., Hayat, M., et al.: Transformers in vision: a survey. ACM Comput. Surv. 54, 200 (2022)

    Article  MATH  Google Scholar 

  22. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2021) arXiv:2010.11929

  23. Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45, 87–110 (2023)

    Article  MATH  Google Scholar 

  24. Aleissaee, A.A., Kumar, A., Anwer, R.M., et al.: Transformers in remote sensing: a survey. Remote Sensing 15, 1860 (2023)

    Article  MATH  Google Scholar 

  25. Huang, H., Liu, P., Wang, Y., et al.: Multi-feature aggregation network for salient object detection. SIViP 17, 1043–1051 (2023)

    Article  MATH  Google Scholar 

  26. Stallkamp, J., Schlipsing, M., Salmen, J., et al.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012)

    Article  Google Scholar 

  27. Chung, J.H., Kim, D.W., Kang, T.K., et al.: Traffic sign recognition in harsh environment using attention based convolutional pooling neural network. Neural Process. Lett. 51, 2551–2573 (2020)

    Article  MATH  Google Scholar 

  28. Manzari, O.N., Kashiani, H., Dehkordi, H.A., et al.: Robust transformer with locality inductive bias and feature normalization. Eng. Sci. Technol. Int. J. 38, 101320 (2023)

    Google Scholar 

  29. Bayoudh, K., Hamdaoui, F., Mtibaa, A.: Hybrid-COVID: a novel hybrid 2D/3D CNN based on cross-domain adaptation approach for COVID-19 screening from chest X-ray images. Phys. Eng. Sci. Med. 43, 1415–1431 (2020)

    Article  MATH  Google Scholar 

  30. Alzubaidi, L., Bai, J., Al-Sabaawi, A., et al.: A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications. J. Big Data 10, 46 (2023)

    Article  MATH  Google Scholar 

  31. Chakraborty, S., Uzkent, B., Ayush, K., Tanmay, K., Sheehan, E., Ermon, S.: Efficient Conditional Pre-training for Transfer Learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 4240–4249 (2022)

  32. Bayoudh, K.: A survey of multimodal hybrid deep learning for computer vision: architectures, applications, trends, and challenges. Inf. Fusion. 105, 102217 (2024)

    Article  MATH  Google Scholar 

Download references

Funding

This study did not receive external funding.

Author information

Authors and Affiliations

Authors

Contributions

All authors were involved in the conception and design of this study. Documentation, data collection, and analysis were performed by [KB]. The first draft of the manuscript was written by [KB]. The final manuscript was read and approved by all authors.

Corresponding author

Correspondence to Khaled Bayoudh.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bayoudh, K., Mtibaa, A. Hybrid-CT: a novel hybrid 2D/3D CNN-Transformer based on transfer learning and attention mechanisms for small object classification. SIViP 19, 133 (2025). https://doi.org/10.1007/s11760-024-03696-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11760-024-03696-y

Keywords

Navigation