skip to main content
10.1145/3651671.3651752acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlcConference Proceedingsconference-collections
research-article

Convolutionally Enhanced Feature Fusion Visual Transformer for Fine-Grained Visual Classification

Published: 07 June 2024 Publication History

Abstract

Fine-grained image classification is a popular research topic in computer vision and pattern recognition, where the goal is to recognize and classify subclasses of objects in images at the fine-grained level. In recent years, Transformer's self-attention mechanism has been increasingly introduced into fine-grained image classification tasks due to its ability to naturally focus on the most discriminative regions of the object. In this paper, a new Convolutionally Enhanced Feature Fusion Visual Transformer method is proposed based on the Feature Fusion Visual Transformer by introducing convolutional operations. Firstly, for the original input image, patches are not directly labeled, but extracted from the generated low-level features; Secondly, the computational complexity at the multi-head attention layer is reduced through spatial-reduction attention, which also reduces memory consumption; Finally, the inverted residual feed-forward network is applied to each encoder to improve the network's expression ability. Comparative experiments on four datasets show that the method improves the accuracy of fine-grained image feature extraction and reduces the computation and memory consumption by improving the self-attention layer to improve the efficiency and performance of the model.

References

[1]
Zhang, N., Donahue, J., Girshick, R., Darrell, T. Part-based R-CNNs for Fine-grained Category Detection. in Proc. ECCV, Zurich, Switzerland, 2014, pp. 834–849.
[2]
Wei, X. S., Xie, C. W., Wu, J. X., Shen, C. H. Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-grained Bird Species Categorization. Pattern Recognition, vol. 76, pp. 704–714, Apr. 2018, 10.1016/j.patcog.2017.10.002.
[3]
Lin, T. Y., Roychowdhury, A., Maji, S. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Trans. PA MI, vol. 40, no. 6, pp. 1309–1322, Jun. 2018, 10.1109/TPAMI.2017.2723400.
[4]
Fu, J. L., Zheng, H. L., Mei, T. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. in Proc. CVPR, Honolulu, HAWAII, USA, 2017, pp. 4476–4484.
[5]
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X. H., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. 2020, arXiv:2010.11929.
[6]
Li, Y. H., Mao, H. Z., Girshick, R., He, K. M. Exploring plain vision transformer backbones for object detection. in Proc. ECCV, Tel Aviv, Israel, 2022, pp. 280–296.
[7]
Guo, R. H., Niu, D. T., Qu, L., Li, Z. B. SOTR: Segmenting objects with transformers. in Proc. ICCV, Montreal, Canada, 2021, pp. 7137–7146.
[8]
He, J., Chen, J. N., Liu, S., Kortylewski, A., Yang, C., Bai, Y. T., Wang, C. H. TransFG: A transformer architecture for fine-grained recognition. in Proc. AAAI, Vancouver, Canada, 2022, pp. 852–860.
[9]
Wang, J., Yu, X. H., Gao, Y. S. Feature fusion vision transformer for fine-grained visual categorization. 2021, arXiv:2107.02341.
[10]
Zhang, Y., Cao, J., Zhang, L., Liu, X. C., Wang, Z. Y., Ling, F., Chen, W. Q. A free lunch from vit: Adaptive attention multi-scale fusion transformer for fine-grained visual recognition. in Proc. ICASSP, Singapore, 2022, pp. 3234–3238.
[11]
Wang, Y. M., Morariu, V. I., Davis, L. S. Learning a discriminative filter bank within a CNN for fine-grained recognition. in Proc. CVPR, Salt Lake City, UT, USA, 2018, pp. 4148–4157.
[12]
Luo, W., Zhang, H. M., Li, J., Wei, X. S. Learning semantically enhanced feature for fine-grained image classification. IEEE Trans. SPL, vol. 27, pp. 1545–1549, 2022, 10.1109/LSP.2020.3020227.
[13]
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. End-to-end object detection with transformers. 2020, arXiv:2005.12872.
[14]
Wang, H. Y., Zhu, Y. K., Adam, H., Yuille, A., Chen, L. C. Max-deeplab: End-to-end panoptic segmentation with mask transformers. in Proc. CVPR, 2021, pp. 5459–5470.
[15]
Wang, W. H., Xie, E. Z., Li, X., Fan, D. P., Song, K. T., Liang, D., Lu, T., Luo, P., Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. in Proc. ICCV, Montreal, Canada, 2021, pp. 548–558.
[16]
Guo, J. Y., Han, K., Wu, H., Tang, Y. H., Chen, X. H., Wang, Y. H., Xu, C. CMT: Convolutional neural networks meet vision transformers. in Proc. CVPR, New Orleans, LA, USA, 2022, pp. 12165–12175.
[17]
Yuan, K., Guo, S. P., Liu, Z. W., Zhou, A. J., Yu, F. W., Wu, W. Incorporating convolution designs into visual transformers. in Proc. ICCV, Montreal, Canada, 2021, pp. 559–568.
[18]
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S. The Caltech-UCSD Birds-200-2011 Dataset. California Institute of Technology, CNS-TR-2011-001, 2011.
[19]
Yu, X. H., Zhao, Y., Gao, Y. S., Xiong, S. W., Yuan, X. H. Patchy image structure classification using multi-orientation region transform. in Proc. AAAI, New York, NY, 2020, pp. 12741–12748.
[20]
Du, R. Y., Chang, D. L., Bhunia, A. K., Xie, J. Y., Ma, Z. Y., Song, Z. Y., Guo, J. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. 2020, arXiv:2003.03836.
[21]
Rao, Y. M., Chen, G. Y., Lu, J. W., Zhou, J. Counterfactual attention learning for fine-grained visual categorization and re-identification. in Proc. ICCV, Montreal, Canada, 2021, pp. 1005–1014.
[22]
Sun, H. B., He, X. T., Peng, Y. X. SIM-trans: Structure information modeling transformer for fine-grained visual categorization. 2022, arXiv:2208.14607.
[23]
Krizhevsky, A., Sutskever, I., Hinton, G. E. ImageNet classification with deep convolutional neural networks. ACM, vol. 60, no. 6, pp. 84–90, June. 2017.
[24]
He, X. T., Peng, P. X. Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. in Proc. AAAI, San Francisco, CA, 2017, pp. 4075–4081.
[25]
Li, P. H., Xie, J. T., Wang, Q. L., Gao, Z. L. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. in Proc. CVPR, Salt Lake City, UT, USA, 2018, pp. 947–955.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMLC '24: Proceedings of the 2024 16th International Conference on Machine Learning and Computing
February 2024
757 pages
ISBN:9798400709234
DOI:10.1145/3651671
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Convolutional neural network
  2. Fine-grained images
  3. Spatial-reduction attention
  4. Visual transformer

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMLC 2024

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 33
    Total Downloads
  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)7
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media