Skip to main content

Advertisement

Log in

ZoomViT: an observation behavior-based fine-grained recognition scheme

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Fine-grained image recognition aims to distinguish many images with subtle differences and identify the sub-categories to which they belong. Recently, vision transformer (ViT) has achieved promising results in many computer vision tasks. In this paper, we introduce human observation behavior into ViT and propose a novel transformer-based network, named ZoomViT. We divide the fine-grained recognition into two steps "look closer" and "contrast." Firstly, looking closer is to observe finer local regions and multi-scale features, and avoid the adverse effect of background on recognition. We design the zoom-in module to track the attention flow by integrating the attention weights to zoom in the discriminative foreground regions. Subsequently, the straight image splitting like ViT may harm recognition adversely. Therefore, we design the zoom-out module combining overlapping cutting and downsampling to maintain the integrity of local neighboring structures and the running efficiency of the model in recognition. Finally, we propose to contrast the features of known sub-categories to supervise the model to learn subtle differences among different sub-categories. The consistency of features extracted from different batches increases over time; for this reason, we proposed a variable-length queue to store features from different batches to efficiently and fully conduct contrastive learning. We experimentally demonstrate the state-of-the-art performance of our model on four popular fine-grained benchmarks: CUB-200-2011, Stanford Dogs, NABirds, and iNat2017.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The article uses publicly available datasets; all of datasets are available on the relevant websites.

References

  1. Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, Perona P, Belongie S (2015) Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 595–604

  2. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset

  3. Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151

  4. Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561

  5. Khosla A, Jayadevaprakash N, Yao B, Li F-F (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proceeding of CVPR Workshop on Fine-grained visual categorization (FGVC), vol. 2

  6. Ge W, Lin X, Yu Y (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3034–3043

  7. Liu C, Xie H, Zha Z-J, Ma L, Yu L, Zhang Y (202) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 11555–11562

  8. Ding Y, Zhou Y, Zhu Y, Ye Q, Jiao J (2019) Selective sparse sampling for fine-grained image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6599–6608

  9. Zheng H, Fu J, Zha Z-J, Luo J (2019) Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5012–5021

  10. Yang Z, Luo T, Wang D, Hu Z, Gao J, Wang L (2018) Learning to navigate for fine-grained classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 420–435

  11. Guo Y, Yu H, Ma L, Zeng L, Luo X (2023) Thfe: a triple-hierarchy feature enhancement method for tiny boat detection. Eng Appl Artif Intell 123:106271

    Article  Google Scholar 

  12. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn, D, Zhai X, Unterthiner T, Houlsby N (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929

  13. Gupta A, Narayan S, Joseph K, Khan S, Khan FS, Shah M (2022) Ow-detr: Open-world detection transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9235–9244

  14. Kotar K, Mottaghi R (2022) Interactron: embodied adaptive object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14860–14869

  15. Tu D, Min X, Duan H, Guo G, Zhai G, Shen W (2022) End-to-end human-gaze-target detection with transformers. arXiv preprint arXiv:2203.10433

  16. Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J, Ni LM, Shum H-Y (2022) Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605

  17. Li Z, Wang W, Xie E, Yu Z, Anandkumar A, Alvarez JM, Luo P, Lu T (2022) Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1280–1289

  18. Hoyer L, Dai D, Van Gool L (2022) Daformer: improving network architectures and training strategies for domain-adaptive semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9924–9935

  19. Xu L, Ouyang W, Bennamoun M, Boussaid F, Xu D (2022) Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4310–4319

  20. Zhang J, Yang K, Ma C, Reiß S, Peng K, Stiefelhagen R (2022) Bending reality: distortion-aware transformers for adapting to panoramic semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16917–16927

  21. He J, Chen J-N, Liu S, Kortylewski A, Yang C, Bai Y, Wang C (2022) Transfg: a transformer architecture for fine-grained recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 36, pp. 852–860

  22. Wang J, Yu X, Gao Y (2021) Feature fusion vision transformer for fine-grained visual categorization. British machine vision conference

  23. Hu Y, Jin X, Zhang Y, Hong H, Zhang J, He Y, Xue H (2021) Rams-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM international conference on multimedia, pp. 4239–4248

  24. Robinson-Riegler, B., & Robinson-Riegler, G. (2016). Cognitive psychology: Applying the science of the mind. Pearson

  25. Wei X-S, Xie C-W, Wu J, Shen C (2018) Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recogn 76:704–714

    Article  Google Scholar 

  26. He X, Peng Y, Zhao J (2019) Mask-CNN: localizing parts and selecting descriptors for fine-grained bird species categorization. Int J Comput Vis 127(9):1235–1255

    Article  Google Scholar 

  27. He X, Peng Y (2017) Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In: Thirty-first AAAI conference on artificial intelligence

  28. Song J, Yang R (2021) Feature boosting, suppression, and diversification for fine-grained visual classification. In: 2021 International joint conference on neural networks (IJCNN), pp. 1–8

  29. Wang C, Fu H, Ma H (2024) Learning mutually exclusive part representations for fine-grained image classification. IEEE Trans Multimed 26:3113–3124

    Article  Google Scholar 

  30. Yu C, Zhao X, Zheng Q, Zhang P, You X (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: Proceedings of the European conference on computer vision (ECCV), pp. 574–589

  31. Zheng H, Fu J, Zha Z-J, Luo J (2019) Learning deep bilinear transformation for fine-grained image representation. Adv Neural Inform Process Syst 32

  32. Zhao Y, Yan K, Huang F, Li J (2021) Graph-based high-order relation discovery for fine-grained recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15079–15088

  33. Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A, Adam H, Perona P, Belongie S (2018) The inaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778

  34. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778

  35. Fu J, Zheng H, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4438–4446

  36. Wei X, Zhang Y, Gong Y, Zhang J, Zheng N (2018) Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 355–370

  37. Dubey A, Gupta O, Raskar R, Naik N (2018) Maximum-entropy fine grained classification. Adv neural inf proc system 31

  38. Wang Y, Morariu VI, Davis LS (2018) Learning a discriminative filter bank within a CNN for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4148–4157

  39. Luo W, Yang X, Mo X, Lu Y, Davis LS, Li J, Yang J, Lim S-N (2019) Cross-x learning for fine-grained visual categorization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8242–8251

  40. Chen Y, Bai Y, Zhang W, Mei T (2019) Destruction and construction learning for fine-grained image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5157–5166

  41. Gao Y, Han X, Wang X, Huang W, Scott M (2020) Channel interaction networks for fine-grained image categorization. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 10818–10825

  42. Ji R, Wen L, Zhang L, Du D, Wu Y, Zhao C, Liu X, Huang F (2020) Attention convolutional binary neural tree for fine-grained visual categorization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10468–10477

  43. Du R, Chang D, Bhunia AK, Xie J, Ma Z, Song Y-Z, Guo J (2020) Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In: European conference on computer vision, pp. 153–168

  44. Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 13130–13137

  45. Behera A, Wharton Z, Hewage PR, Bera A (2021) Context-aware attentional pooling (cap) for fine-grained visual classification. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 929–937

  46. Liu C, Xie H, Zha Z, Yu L, Chen Z, Zhang Y (2019) Bidirectional attention-recognition model for fine-grained object classification. IEEE Trans Multimed 22(7):1785–1795

    Article  Google Scholar 

  47. Zhang L, Huang S, Liu W (2021) Enhancing mixture-of-experts by leveraging attention for fine-grained recognition. IEEE Transactions on Multimedia 24:4409–4421

    Article  Google Scholar 

  48. Liu H, Li J, Li D, See J, Lin W (2021) Learning scale-consistent attention part network for fine-grained image recognition. IEEE Trans Multimed 24:2902–2913

    Article  Google Scholar 

  49. Zhang C, Lin G, Wang Q, Shen F, Yao Y, Tang Z (2022) Guided by meta-set: a data-driven method for fine-grained visual recognition. IEEE Transactions on Multimedia

  50. Min S, Yao H, Xie H, Zha Z-J, Zhang Y (2020) Multi-objective matrix normalization for fine-grained visual recognition. IEEE Trans Image Process 29:4996–5009

    Article  Google Scholar 

  51. Zheng X, Qi L, Ren Y, Lu X (2020) Fine-grained visual categorization by localizing object parts with single image. IEEE Trans Multimed 23:1187–1199

    Article  Google Scholar 

  52. Zhang Y, Sun Y, Wang N, Gao Z, Zhu J, Tang J (2023) Multi-scale confusion and filling mechanism for pressure footprint recognition. Neural Comput Appl 35(1):375–392

    Article  Google Scholar 

  53. Hou Y, Zhang W, Liu Q, Ge H, Meng J, Zhang Q, Wei X (2022) Adaptive kernel selection network with attention constraint for surgical instrument classification. Neural Comput Appl 1-15

  54. Zhang Y, Cao J, Zhang L, Liu X, Wang Z, Ling F, Chen W (2022) A free lunch from ViT: adaptive attention multi-scale fusion transformer for fine-grained visual recognition. In: ICASSP 2022-2022 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 3234–3238

  55. Zhu H, Ke W, Li D, Liu J, Tian L, Shan Y (2022) Dual cross-attention learning for fine-grained visual categorization and object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4692–4702

  56. Zhang Z-C, Chen Z-D, Wang Y, Luo X, Xu X-S (2024) A vision transformer for fine-grained classification by reducing noise and enhancing discriminative information. Pattern Recognit 145:109979

    Article  Google Scholar 

  57. Xu Q, Wang J, Jiang B, Luo B (2023) Fine-grained visual classification via internal ensemble learning transformer. IEEE Transactions on Multimedia

  58. Luo W, Zhang H, Li J, Wei X-S (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549

    Article  Google Scholar 

  59. Korsch D, Bodesheim P, Denzler J (2019) Classification-specific parts for improving fine-grained visual categorization. In: German conference on pattern recognition, pp. 62–75

  60. Zhang L, Huang S, Liu W, Tao D (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8331–8340

  61. Touvron H, Vedaldi A, Douze M, Jégou H (2019) Fixing the train-test resolution discrepancy. Adv neural inf process syst 32

  62. Korsch D, Bodesheim P, Denzler J (2021) End-to-end learning of fisher vector encodings for part features in fine-grained recognition. In: DAGM German conference on pattern recognition, Springer, pp. 142–158

  63. Liu X, Wang L, Han X (2022) Transformer with peak suppression and knowledge guidance for fine-grained image recognition. Neurocomputing 492:137–149

    Article  Google Scholar 

  64. Recasens A, Kellnhofer P, Stent S, Matusik W, Torralba A (2018) Learning to zoom: a saliency-based sampling layer for neural networks. In: Proceedings of the European conference on computer vision (ECCV), pp. 51–66

  65. Huang Z, Li Y (2020) Interpretable and accurate fine-grained recognition via region grouping. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8662–8672

  66. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

Download references

Acknowledgements

This work is supported by the Shandong Provincial Natural Science Foundation (ZR2023MF033); National Natural Science Foundation of China ( No. 61872326 ). This work got the GPU computation support from Center for High Performance Computing and System Simulation, Qingdao National Laboratory for Marine Science and Technology.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yongquan Yang or Lei Huang.

Ethics declarations

Conflict of interest.

The authors declared that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Z., Yang, Y., Wang, H. et al. ZoomViT: an observation behavior-based fine-grained recognition scheme. Neural Comput & Applic 36, 12775–12789 (2024). https://doi.org/10.1007/s00521-024-09961-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09961-y

Keywords

Navigation