Skip to main content

Recombining Vision Transformer Architecture forĀ Fine-Grained Visual Categorization

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13834))

Included in the following conference series:

  • 1265 Accesses

Abstract

Fine-grained visual categorization (FGVC) is a challenging task in the image analysis field which requires comprehensive discriminative feature extraction and representation. To get around this problem, previous works focus on designing complex modules, the so-called necks and heads, over simple backbones, while bringing a huge computational burden. In this paper, we bring a new insight: Vision Transformer itself is an all-in-one FGVC framework that consists of basic Backbone for feature extraction, Neck for further feature enhancement and Head for selecting discriminative feature. We delve into the feature extraction and representation pattern of ViT for FGVC and empirically show that simply recombining the original ViT structure to leverage multi-level semantic representation without introducing any other parameters is able to achieve higher performance. Under such insight, we proposed RecViT, a simple recombination and modification of original ViT, which can capture multi-level semantic features and facilitate fine-grained recognition. In RecViT, the deep layers of the original ViT are served as Head, a few middle layers as Neck and shallow layers as Backbone. In addition, we adopt an optional Feature Processing Module to enhance discriminative feature representation at each semantic level and align them for final recognition. With the above simple modifications, RecViT obtains significant improvement in accuracy in FGVC benchmarks: CUB-200-2011, Stanford Cars and Stanford Dogs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD birds-200-2011 dataset (2011)

    Google ScholarĀ 

  2. Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: Proceedings of CVPR Workshop on Fine-Grained Visual Categorization (FGVC), vol. 2. Citeseer (2011)

    Google ScholarĀ 

  3. Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554ā€“561 (2013)

    Google ScholarĀ 

  4. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-CNN: towards real-time object detection with region proposal networks. In: 28th Proceedings Conference on Advances in Neural Information Processing Systems (2015)

    Google ScholarĀ 

  5. Cai, S., Zuo, W., Zhang, L.: Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 511ā€“520 (2017)

    Google ScholarĀ 

  6. Zheng, H., Fu, J., Zha, Z.J., Luo, J.: Learning deep bilinear transformation for fine-grained image representation. In: 32nd Proceedings of the Conference on Advances in Neural Information Processing Systems (2019)

    Google ScholarĀ 

  7. Min, S., Yao, H., Xie, H., Zha, Z.J., Zhang, Y.: Multi-objective matrix normalization for fine-grained visual recognition. IEEE Trans. Image Process. 29, 4996ā€“5009 (2020)

    ArticleĀ  MATHĀ  Google ScholarĀ 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. Zhuang, P., Wang, Y., Qiao, Y.: Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 13130ā€“13137 (2020)

    Google ScholarĀ 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770ā€“778 (2016)

    Google ScholarĀ 

  11. Zhang, Y., et al.: A free lunch from VIT: adaptive attention multi-scale fusion transformer for fine-grained visual recognition. In: ICASSP 2022ā€“2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3234ā€“3238. IEEE (2022)

    Google ScholarĀ 

  12. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700ā€“4708 (2017)

    Google ScholarĀ 

  13. Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4438ā€“4446 (2017)

    Google ScholarĀ 

  14. Wei, X.S., Xie, C.W., Wu, J., Shen, C.: Mask-CNN: localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recogn. 76, 704ā€“714 (2018)

    ArticleĀ  Google ScholarĀ 

  15. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431ā€“3440 (2015)

    Google ScholarĀ 

  16. Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5209ā€“5217 (2017)

    Google ScholarĀ 

  17. He, J., et al.: A transformer architecture for fine-grained recognition. arXiv preprint arXiv:2103.07976 (2021)

  18. Wang, J., Yu, X., Gao, Y.: Feature fusion vision transformer for fine-grained visual categorization. arXiv preprint arXiv:2107.02341 (2021)

  19. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998ā€“6008 (2017)

    Google ScholarĀ 

  20. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., JĆ©gou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347ā€“10357. PMLR (2021)

    Google ScholarĀ 

  21. Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527 (2022)

  22. Ali, A., et al.: XCIT: cross-covariance image transformers. In: 34th Proceedings of the Conference on Advances in Neural Information Processing Systems (2021)

    Google ScholarĀ 

  23. Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., Yuan, L.: Davit: dual attention vision transformer. arXiv preprint arXiv:2204.03645 (2022)

  24. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113ā€“123 (2019)

    Google ScholarĀ 

  25. Wang, Y., Morariu, V.I., Davis, L.S.: Learning a discriminative filter bank within a CNN for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4148ā€“4157 (2018)

    Google ScholarĀ 

  26. Liu, C., Xie, H., Zha, Z.J., Ma, L., Yu, L., Zhang, Y.: Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 11555ā€“11562 (2020)

    Google ScholarĀ 

Download references

Acknowledgments

This work is supported by the National Nature Science Foundation of China 62272436, the China Postdoctoral Science Foundation 2021M703081, the Fundamental Research Funds for the Central Universities WK2100000026, Anhui Provincial Natural Science Foundation 2208085QF190

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiying Lu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Deng, X., Liu, C., Lu, Z. (2023). Recombining Vision Transformer Architecture forĀ Fine-Grained Visual Categorization. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13834. Springer, Cham. https://doi.org/10.1007/978-3-031-27818-1_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-27818-1_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-27817-4

  • Online ISBN: 978-3-031-27818-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics