MT-ASM: a multi-task attention strengthening model for fine-grained object recognition

Liu, Dichao; Wang, Yu; Mase, Kenji; Kato, Jien

doi:10.1007/s00530-024-01446-1

MT-ASM: a multi-task attention strengthening model for fine-grained object recognition

Regular Paper
Published: 28 September 2024

Volume 30, article number 297, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Dichao Liu^1,2,5,
Yu Wang³,
Kenji Mase² &
…
Jien Kato⁴

133 Accesses
Explore all metrics

Abstract

Fine-Grained Object Recognition (FGOR) equips intelligent systems with recognition capabilities at or even beyond the level of human experts, making it a core technology for numerous applications such as biodiversity monitoring systems and advanced driver assistance systems. FGOR is highly challenging, and recent research has primarily focused on identifying discriminative regions to tackle this task. However, these methods often require extensive manual labor or expensive algorithms, which may lead to irreversible information loss and pose significant barriers to their practical application. Instead of learning region capturing, this work enhances networks’ response to discriminative regions. We propose a multitask attention-strengthening model (MT-ASM), inspired by the human ability to effectively utilize experiences from related tasks when solving a specific task. When faced with an FGOR task, humans naturally compare images from the same and different categories to identify discriminative and non-discriminative regions. MT-ASM employs two networks during the training phase: the major network, tasked with the main goal of category classification, and a subordinate task that involves comparing images from the same and different categories to find discriminative and non-discriminative regions. The subordinate network evaluates the major network’s performance on the subordinate task, compelling the major network to improve its subordinate task performance. Once training is complete, the subordinate network is removed, ensuring no additional overhead during inference. Experimental results on CUB-200-2011, Stanford Cars, and FGVC-Aircraft datasets demonstrate that MT-ASM significantly outperforms baseline methods. Given its simplicity and low overhead, it remains highly competitive with state-of-the-art methods. The code is available at https://github.com/Dichao-Liu/Find-Attention-with-Comparison.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

Dynamic attention guider network

Article 30 July 2024

Data availability

All datasets used in the work are publicly available and can be accessed as described in each referenced paper. The source code for the approach proposed in this paper is available at the link provided within the manuscript.

References

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology (2011)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia (2013)
He, X., Peng, Y., Zhao, J.: Fast fine-grained image classification via weakly supervised discriminative localization. IEEE Trans. Circ. Syst. Video Technol. 29(5), 1394–1407 (2018)
Article Google Scholar
Guo, P., Farrell, R.: Aligned to the object, not to the image: a unified pose-aligned representation for fine-grained recognition. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1876–1885. IEEE (2019)
Jaderberg, M., Simonyan, K., Zisserman, A., : Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., Metaxas, D.: SPDA-CNN: Unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1143–1152 (2016)
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: European Conference on Computer Vision, pp. 834–849. Springer (2014)
Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4438–4446 (2017)
Zhao, J., Du, B., Sun, L., Lv, W., Liu, Y., Xiong, H.: Deep multi-task learning with relational attention for business success prediction. Pattern Recogn. 110, (2020)
Article Google Scholar
Gao, F., Yoon, H., Wu, T., Chu, X.: A feature transfer enabled multi-task deep learning model on medical imaging. Expert Syst. Appl. 143, 112957 (2020)
Article Google Scholar
Liu, D., Wang, Y., Kato, J., Mase, K.: Contrastively-reinforced attention convolutional neural network for fine-grained image recognition. In: BMVC (2020)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
Chen, Y., Pu, Y., Zhao, Z., Xu, D., Man, Qian, W.: Image aesthetic assessment based on emotion-assisted multi-task learning network. In: Proceedings of the 2021 6th International Conference on Multimedia Systems and Signal Processing, pp. 15–21 (2021)
Hu, T., Xiang, X., Qin, J., Tan, Y.: Audio–text retrieval based on contrastive learning and collaborative attention mechanism. Multimed. Syst. 29, 1–14 (2023)
Article Google Scholar
Wong, W.J., Lai, S.-H.: Multi-task CNN for restoring corrupted fingerprint images. Pattern Recogn. 101, 107203 (2020)
Article Google Scholar
Zheng, Q., Deng, J., Zhu, Z., Li, Y., Zafeiriou, S.: Decoupled multi-task learning with cyclical self-regulation for face parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4156–4165 (2022)
Zheng, H., Fu, J., Zha, Z.-J., Luo, J.: Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5012–5021 (2019)
Lu, J., Zhang, W., Zhao, Y., Sun, C.: Image local structure information learning for fine-grained visual classification. Sci. Rep. 12(1), 19205 (2022)
Article Google Scholar
Ge, W., Lin, X., Yu, Y.: Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3043 (2019)
Qi, L., Lu, X., Li, X.: Exploiting spatial relation for fine-grained image classification. Pattern Recogn. 91, 47–55 (2019)
Article Google Scholar
Yang, Z., Luo, T., Wang, D., Hu, Z., Gao, J., Wang, L.: Learning to navigate for fine-grained classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 420–435 (2018)
Sun, M., Yuan, Y., Zhou, F., Ding, E.: Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 805–821 (2018)
Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R., Naik, N.: Pairwise confusion for fine-grained visual classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 70–86 (2018)
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR, Long Beach, California, USA (2019)
Li, P., Xie, J., Wang, Q., Gao, Z.: Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 947–955 (2018)
Luo, W., Zhang, H., Li, J., Wei, X.-S.: Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process. Lett. 27, 1545–1549 (2020)
Article Google Scholar
Gao, Z., Wu, Y., Bu, X., Yu, T., Yuan, J., Jia, Y.: Learning a robust representation via a deep network on symmetric positive definite manifolds. Pattern Recogn. 92, 1–12 (2019)
Article Google Scholar
Xu, J., An, W., Zhang, L., Zhang, D.: Sparse, collaborative, or nonnegative representation: which helps pattern classification? Pattern Recogn. 88, 679–688 (2019)
Article Google Scholar
Gao, Y., Han, X., Wang, X., Huang, W., Scott, M.: Channel interaction networks for fine-grained image categorization. In: AAAI, pp. 10818–10825 (2020)
Hu, T., Qi, H., Huang, Q., Lu, Y.: See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891 (2019)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 113–123 (2019)
Guo, C., Lin, Y., Xu, M., Shao, M., Yao, J.: Inverse transformation sampling-based attentive cutout for fine-grained visual recognition. Vis. Comput. 39, 1–12 (2022)
Google Scholar
Cui, Y., Song, Y., Sun, C., Howard, A., Belongie, S.: Large scale fine-grained categorization and domain-specific transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4109–4118 (2018)
Ruan, M., Yu, X., Zhang, N., Hu, C., Wang, S., Li, X.: Video-based contrastive learning on decision trees: from action recognition to autism diagnosis. In: Proceedings of the 14th Conference on ACM Multimedia Systems, pp. 289–300 (2023)
Xiao, T., Wang, X., Efros, A.A., Darrell, T.: What should not be contrastive in contrastive learning. arXiv preprint arXiv:2008.05659 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. arXiv: 2002.05709 (2020)
Gao, J., Xu, C.: Learning video moment retrieval without a single annotated video. IEEE Trans. Circ. Syst. Video Technol. 32(3), 1646–1657 (2022). https://doi.org/10.1109/TCSVT.2021.3075470
Article Google Scholar
Gao, J., Chen, M., Xu, C.: Vectorized evidential learning for weakly-supervised temporal action localization. IEEE Trans. Pattern Anal. Mach. Intell. 45(12), 15949–15963 (2023). https://doi.org/10.1109/TPAMI.2023.3311447
Article Google Scholar
Gao, J., Zhang, T., Xu, C.: Learning to model relationships for zero-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3476–3491 (2021). https://doi.org/10.1109/TPAMI.2020.2985708
Article Google Scholar
Hu, Y., Gao, J., Dong, J., Fan, B., Liu, H.: Exploring rich semantics for open-set action recognition. IEEE Trans. Multimed. 26, 5410–5421 (2024). https://doi.org/10.1109/TMM.2023.3333206
Article Google Scholar
Lopez, P.R., Dorta, D.V., Preixens, G.C., Sitjes, J.M.G., Marva, F.X.R., Gonzalez, J.: Pay attention to the activations: a modular attention mechanism for fine-grained image recognition. IEEE Trans. Multimed. 22, 502–514 (2019)
Google Scholar
Shu, C., Chen, X., Yu, C., Han, H.: A refined spatial transformer network. In: International Conference on Neural Information Processing, pp. 151–161. Springer (2018)
Yu, Y., Chan, K.H.R., You, C., Song, C., Ma, Y.: Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Adv. Neural Inf. Process. Syst. 33, 9422–9434 (2020)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Hanselmann, H., Ney, H.: Elope: Fine-grained visual classification with efficient localization, pooling and embedding. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 1247–1256 (2020)
Tan, M., Wang, G., Zhou, J., Peng, Z., Zheng, M.: Fine-grained classification via hierarchical bilinear pooling with aggregated slack mask. IEEE Access 7, 117944–117953 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Navier, Inc., Nibancho, Chiyoda-ku, Tokyo, 102-0084, Japan
Dichao Liu
Graduate School of Informatics, Nagoya University, Chikusa-ku, Nagoya-shi, Aichi-ken, 464-8601, Japan
Dichao Liu & Kenji Mase
Center for Information and Communication Technology, Information Systems Management Headquarters, Hitotsubashi University, 2-chōme-1 Naka, Kunitachi-shi, Tokyo, 186-8601, Japan
Yu Wang
School of Data and Innovation, Kochi University of Technology, 185 Miyanokuchi, Tosayamada, Kami-shi, Kochi-ken, 782-8502, Japan
Jien Kato
Institute of Innovative Research, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama-shi, Kanagawa-ken, 226-8503, Japan
Dichao Liu

Authors

Dichao Liu
View author publications
You can also search for this author inPubMed Google Scholar
Yu Wang
View author publications
You can also search for this author inPubMed Google Scholar
Kenji Mase
View author publications
You can also search for this author inPubMed Google Scholar
Jien Kato
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

DL conceived, designed, and supervised the whole study, developed the proposed approach, planned the experiments, drafted the manuscript, and designed the figures and tables. YW, KM, and JK were responsible for supervising this project. All authors reviewed the manuscript.

Corresponding author

Correspondence to Dichao Liu.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Communicated by Junyu Gao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, D., Wang, Y., Mase, K. et al. MT-ASM: a multi-task attention strengthening model for fine-grained object recognition. Multimedia Systems 30, 297 (2024). https://doi.org/10.1007/s00530-024-01446-1

Download citation

Received: 21 January 2024
Accepted: 01 August 2024
Published: 28 September 2024
DOI: https://doi.org/10.1007/s00530-024-01446-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MT-ASM: a multi-task attention strengthening model for fine-grained object recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization

Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition

Dynamic attention guider network

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now