Abstract
Fine-grained image recognition aims to distinguish many images with subtle differences and identify the sub-categories to which they belong. Recently, vision transformer (ViT) has achieved promising results in many computer vision tasks. In this paper, we introduce human observation behavior into ViT and propose a novel transformer-based network, named ZoomViT. We divide the fine-grained recognition into two steps "look closer" and "contrast." Firstly, looking closer is to observe finer local regions and multi-scale features, and avoid the adverse effect of background on recognition. We design the zoom-in module to track the attention flow by integrating the attention weights to zoom in the discriminative foreground regions. Subsequently, the straight image splitting like ViT may harm recognition adversely. Therefore, we design the zoom-out module combining overlapping cutting and downsampling to maintain the integrity of local neighboring structures and the running efficiency of the model in recognition. Finally, we propose to contrast the features of known sub-categories to supervise the model to learn subtle differences among different sub-categories. The consistency of features extracted from different batches increases over time; for this reason, we proposed a variable-length queue to store features from different batches to efficiently and fully conduct contrastive learning. We experimentally demonstrate the state-of-the-art performance of our model on four popular fine-grained benchmarks: CUB-200-2011, Stanford Dogs, NABirds, and iNat2017.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The article uses publicly available datasets; all of datasets are available on the relevant websites.
References
Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, Perona P, Belongie S (2015) Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 595–604
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Maji S, Rahtu E, Kannala J, Blaschko M, Vedaldi A (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 554–561
Khosla A, Jayadevaprakash N, Yao B, Li F-F (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proceeding of CVPR Workshop on Fine-grained visual categorization (FGVC), vol. 2
Ge W, Lin X, Yu Y (2019) Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3034–3043
Liu C, Xie H, Zha Z-J, Ma L, Yu L, Zhang Y (202) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 11555–11562
Ding Y, Zhou Y, Zhu Y, Ye Q, Jiao J (2019) Selective sparse sampling for fine-grained image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6599–6608
Zheng H, Fu J, Zha Z-J, Luo J (2019) Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5012–5021
Yang Z, Luo T, Wang D, Hu Z, Gao J, Wang L (2018) Learning to navigate for fine-grained classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 420–435
Guo Y, Yu H, Ma L, Zeng L, Luo X (2023) Thfe: a triple-hierarchy feature enhancement method for tiny boat detection. Eng Appl Artif Intell 123:106271
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn, D, Zhai X, Unterthiner T, Houlsby N (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Gupta A, Narayan S, Joseph K, Khan S, Khan FS, Shah M (2022) Ow-detr: Open-world detection transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9235–9244
Kotar K, Mottaghi R (2022) Interactron: embodied adaptive object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14860–14869
Tu D, Min X, Duan H, Guo G, Zhai G, Shen W (2022) End-to-end human-gaze-target detection with transformers. arXiv preprint arXiv:2203.10433
Zhang H, Li F, Liu S, Zhang L, Su H, Zhu J, Ni LM, Shum H-Y (2022) Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605
Li Z, Wang W, Xie E, Yu Z, Anandkumar A, Alvarez JM, Luo P, Lu T (2022) Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1280–1289
Hoyer L, Dai D, Van Gool L (2022) Daformer: improving network architectures and training strategies for domain-adaptive semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9924–9935
Xu L, Ouyang W, Bennamoun M, Boussaid F, Xu D (2022) Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4310–4319
Zhang J, Yang K, Ma C, Reiß S, Peng K, Stiefelhagen R (2022) Bending reality: distortion-aware transformers for adapting to panoramic semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16917–16927
He J, Chen J-N, Liu S, Kortylewski A, Yang C, Bai Y, Wang C (2022) Transfg: a transformer architecture for fine-grained recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 36, pp. 852–860
Wang J, Yu X, Gao Y (2021) Feature fusion vision transformer for fine-grained visual categorization. British machine vision conference
Hu Y, Jin X, Zhang Y, Hong H, Zhang J, He Y, Xue H (2021) Rams-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM international conference on multimedia, pp. 4239–4248
Robinson-Riegler, B., & Robinson-Riegler, G. (2016). Cognitive psychology: Applying the science of the mind. Pearson
Wei X-S, Xie C-W, Wu J, Shen C (2018) Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recogn 76:704–714
He X, Peng Y, Zhao J (2019) Mask-CNN: localizing parts and selecting descriptors for fine-grained bird species categorization. Int J Comput Vis 127(9):1235–1255
He X, Peng Y (2017) Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In: Thirty-first AAAI conference on artificial intelligence
Song J, Yang R (2021) Feature boosting, suppression, and diversification for fine-grained visual classification. In: 2021 International joint conference on neural networks (IJCNN), pp. 1–8
Wang C, Fu H, Ma H (2024) Learning mutually exclusive part representations for fine-grained image classification. IEEE Trans Multimed 26:3113–3124
Yu C, Zhao X, Zheng Q, Zhang P, You X (2018) Hierarchical bilinear pooling for fine-grained visual recognition. In: Proceedings of the European conference on computer vision (ECCV), pp. 574–589
Zheng H, Fu J, Zha Z-J, Luo J (2019) Learning deep bilinear transformation for fine-grained image representation. Adv Neural Inform Process Syst 32
Zhao Y, Yan K, Huang F, Li J (2021) Graph-based high-order relation discovery for fine-grained recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15079–15088
Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard A, Adam H, Perona P, Belongie S (2018) The inaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778
Fu J, Zheng H, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4438–4446
Wei X, Zhang Y, Gong Y, Zhang J, Zheng N (2018) Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In: Proceedings of the European conference on computer vision (ECCV), pp. 355–370
Dubey A, Gupta O, Raskar R, Naik N (2018) Maximum-entropy fine grained classification. Adv neural inf proc system 31
Wang Y, Morariu VI, Davis LS (2018) Learning a discriminative filter bank within a CNN for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4148–4157
Luo W, Yang X, Mo X, Lu Y, Davis LS, Li J, Yang J, Lim S-N (2019) Cross-x learning for fine-grained visual categorization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8242–8251
Chen Y, Bai Y, Zhang W, Mei T (2019) Destruction and construction learning for fine-grained image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5157–5166
Gao Y, Han X, Wang X, Huang W, Scott M (2020) Channel interaction networks for fine-grained image categorization. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 10818–10825
Ji R, Wen L, Zhang L, Du D, Wu Y, Zhao C, Liu X, Huang F (2020) Attention convolutional binary neural tree for fine-grained visual categorization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10468–10477
Du R, Chang D, Bhunia AK, Xie J, Ma Z, Song Y-Z, Guo J (2020) Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In: European conference on computer vision, pp. 153–168
Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the AAAI conference on artificial intelligence, vol. 34, pp. 13130–13137
Behera A, Wharton Z, Hewage PR, Bera A (2021) Context-aware attentional pooling (cap) for fine-grained visual classification. In: Proceedings of the AAAI conference on artificial intelligence, vol. 35, pp. 929–937
Liu C, Xie H, Zha Z, Yu L, Chen Z, Zhang Y (2019) Bidirectional attention-recognition model for fine-grained object classification. IEEE Trans Multimed 22(7):1785–1795
Zhang L, Huang S, Liu W (2021) Enhancing mixture-of-experts by leveraging attention for fine-grained recognition. IEEE Transactions on Multimedia 24:4409–4421
Liu H, Li J, Li D, See J, Lin W (2021) Learning scale-consistent attention part network for fine-grained image recognition. IEEE Trans Multimed 24:2902–2913
Zhang C, Lin G, Wang Q, Shen F, Yao Y, Tang Z (2022) Guided by meta-set: a data-driven method for fine-grained visual recognition. IEEE Transactions on Multimedia
Min S, Yao H, Xie H, Zha Z-J, Zhang Y (2020) Multi-objective matrix normalization for fine-grained visual recognition. IEEE Trans Image Process 29:4996–5009
Zheng X, Qi L, Ren Y, Lu X (2020) Fine-grained visual categorization by localizing object parts with single image. IEEE Trans Multimed 23:1187–1199
Zhang Y, Sun Y, Wang N, Gao Z, Zhu J, Tang J (2023) Multi-scale confusion and filling mechanism for pressure footprint recognition. Neural Comput Appl 35(1):375–392
Hou Y, Zhang W, Liu Q, Ge H, Meng J, Zhang Q, Wei X (2022) Adaptive kernel selection network with attention constraint for surgical instrument classification. Neural Comput Appl 1-15
Zhang Y, Cao J, Zhang L, Liu X, Wang Z, Ling F, Chen W (2022) A free lunch from ViT: adaptive attention multi-scale fusion transformer for fine-grained visual recognition. In: ICASSP 2022-2022 IEEE International conference on acoustics, speech and signal processing (ICASSP), pp. 3234–3238
Zhu H, Ke W, Li D, Liu J, Tian L, Shan Y (2022) Dual cross-attention learning for fine-grained visual categorization and object re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4692–4702
Zhang Z-C, Chen Z-D, Wang Y, Luo X, Xu X-S (2024) A vision transformer for fine-grained classification by reducing noise and enhancing discriminative information. Pattern Recognit 145:109979
Xu Q, Wang J, Jiang B, Luo B (2023) Fine-grained visual classification via internal ensemble learning transformer. IEEE Transactions on Multimedia
Luo W, Zhang H, Li J, Wei X-S (2020) Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process Lett 27:1545–1549
Korsch D, Bodesheim P, Denzler J (2019) Classification-specific parts for improving fine-grained visual categorization. In: German conference on pattern recognition, pp. 62–75
Zhang L, Huang S, Liu W, Tao D (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8331–8340
Touvron H, Vedaldi A, Douze M, Jégou H (2019) Fixing the train-test resolution discrepancy. Adv neural inf process syst 32
Korsch D, Bodesheim P, Denzler J (2021) End-to-end learning of fisher vector encodings for part features in fine-grained recognition. In: DAGM German conference on pattern recognition, Springer, pp. 142–158
Liu X, Wang L, Han X (2022) Transformer with peak suppression and knowledge guidance for fine-grained image recognition. Neurocomputing 492:137–149
Recasens A, Kellnhofer P, Stent S, Matusik W, Torralba A (2018) Learning to zoom: a saliency-based sampling layer for neural networks. In: Proceedings of the European conference on computer vision (ECCV), pp. 51–66
Huang Z, Li Y (2020) Interpretable and accurate fine-grained recognition via region grouping. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8662–8672
Szegedy C, Ioffe S, Vanhoucke V, Alemi AA (2017) Inception-v4, inception-ResNet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Acknowledgements
This work is supported by the Shandong Provincial Natural Science Foundation (ZR2023MF033); National Natural Science Foundation of China ( No. 61872326 ). This work got the GPU computation support from Center for High Performance Computing and System Simulation, Qingdao National Laboratory for Marine Science and Technology.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest.
The authors declared that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, Z., Yang, Y., Wang, H. et al. ZoomViT: an observation behavior-based fine-grained recognition scheme. Neural Comput & Applic 36, 12775–12789 (2024). https://doi.org/10.1007/s00521-024-09961-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09961-y