Abstract
Fine-Grained Object Recognition (FGOR) equips intelligent systems with recognition capabilities at or even beyond the level of human experts, making it a core technology for numerous applications such as biodiversity monitoring systems and advanced driver assistance systems. FGOR is highly challenging, and recent research has primarily focused on identifying discriminative regions to tackle this task. However, these methods often require extensive manual labor or expensive algorithms, which may lead to irreversible information loss and pose significant barriers to their practical application. Instead of learning region capturing, this work enhances networks’ response to discriminative regions. We propose a multitask attention-strengthening model (MT-ASM), inspired by the human ability to effectively utilize experiences from related tasks when solving a specific task. When faced with an FGOR task, humans naturally compare images from the same and different categories to identify discriminative and non-discriminative regions. MT-ASM employs two networks during the training phase: the major network, tasked with the main goal of category classification, and a subordinate task that involves comparing images from the same and different categories to find discriminative and non-discriminative regions. The subordinate network evaluates the major network’s performance on the subordinate task, compelling the major network to improve its subordinate task performance. Once training is complete, the subordinate network is removed, ensuring no additional overhead during inference. Experimental results on CUB-200-2011, Stanford Cars, and FGVC-Aircraft datasets demonstrate that MT-ASM significantly outperforms baseline methods. Given its simplicity and low overhead, it remains highly competitive with state-of-the-art methods. The code is available at https://github.com/Dichao-Liu/Find-Attention-with-Comparison.





Similar content being viewed by others
Data availability
All datasets used in the work are publicly available and can be accessed as described in each referenced paper. The source code for the approach proposed in this paper is available at the link provided within the manuscript.
References
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology (2011)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia (2013)
He, X., Peng, Y., Zhao, J.: Fast fine-grained image classification via weakly supervised discriminative localization. IEEE Trans. Circ. Syst. Video Technol. 29(5), 1394–1407 (2018)
Guo, P., Farrell, R.: Aligned to the object, not to the image: a unified pose-aligned representation for fine-grained recognition. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1876–1885. IEEE (2019)
Jaderberg, M., Simonyan, K., Zisserman, A., : Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)
Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., Metaxas, D.: SPDA-CNN: Unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1143–1152 (2016)
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: European Conference on Computer Vision, pp. 834–849. Springer (2014)
Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4438–4446 (2017)
Zhao, J., Du, B., Sun, L., Lv, W., Liu, Y., Xiong, H.: Deep multi-task learning with relational attention for business success prediction. Pattern Recogn. 110, (2020)
Gao, F., Yoon, H., Wu, T., Chu, X.: A feature transfer enabled multi-task deep learning model on medical imaging. Expert Syst. Appl. 143, 112957 (2020)
Liu, D., Wang, Y., Kato, J., Mase, K.: Contrastively-reinforced attention convolutional neural network for fine-grained image recognition. In: BMVC (2020)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
Chen, Y., Pu, Y., Zhao, Z., Xu, D., Man, Qian, W.: Image aesthetic assessment based on emotion-assisted multi-task learning network. In: Proceedings of the 2021 6th International Conference on Multimedia Systems and Signal Processing, pp. 15–21 (2021)
Hu, T., Xiang, X., Qin, J., Tan, Y.: Audio–text retrieval based on contrastive learning and collaborative attention mechanism. Multimed. Syst. 29, 1–14 (2023)
Wong, W.J., Lai, S.-H.: Multi-task CNN for restoring corrupted fingerprint images. Pattern Recogn. 101, 107203 (2020)
Zheng, Q., Deng, J., Zhu, Z., Li, Y., Zafeiriou, S.: Decoupled multi-task learning with cyclical self-regulation for face parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4156–4165 (2022)
Zheng, H., Fu, J., Zha, Z.-J., Luo, J.: Looking for the devil in the details: learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5012–5021 (2019)
Lu, J., Zhang, W., Zhao, Y., Sun, C.: Image local structure information learning for fine-grained visual classification. Sci. Rep. 12(1), 19205 (2022)
Ge, W., Lin, X., Yu, Y.: Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3043 (2019)
Qi, L., Lu, X., Li, X.: Exploiting spatial relation for fine-grained image classification. Pattern Recogn. 91, 47–55 (2019)
Yang, Z., Luo, T., Wang, D., Hu, Z., Gao, J., Wang, L.: Learning to navigate for fine-grained classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 420–435 (2018)
Sun, M., Yuan, Y., Zhou, F., Ding, E.: Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 805–821 (2018)
Dubey, A., Gupta, O., Guo, P., Raskar, R., Farrell, R., Naik, N.: Pairwise confusion for fine-grained visual classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 70–86 (2018)
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR, Long Beach, California, USA (2019)
Li, P., Xie, J., Wang, Q., Gao, Z.: Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 947–955 (2018)
Luo, W., Zhang, H., Li, J., Wei, X.-S.: Learning semantically enhanced feature for fine-grained image classification. IEEE Signal Process. Lett. 27, 1545–1549 (2020)
Gao, Z., Wu, Y., Bu, X., Yu, T., Yuan, J., Jia, Y.: Learning a robust representation via a deep network on symmetric positive definite manifolds. Pattern Recogn. 92, 1–12 (2019)
Xu, J., An, W., Zhang, L., Zhang, D.: Sparse, collaborative, or nonnegative representation: which helps pattern classification? Pattern Recogn. 88, 679–688 (2019)
Gao, Y., Han, X., Wang, X., Huang, W., Scott, M.: Channel interaction networks for fine-grained image categorization. In: AAAI, pp. 10818–10825 (2020)
Hu, T., Qi, H., Huang, Q., Lu, Y.: See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891 (2019)
Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 113–123 (2019)
Guo, C., Lin, Y., Xu, M., Shao, M., Yao, J.: Inverse transformation sampling-based attentive cutout for fine-grained visual recognition. Vis. Comput. 39, 1–12 (2022)
Cui, Y., Song, Y., Sun, C., Howard, A., Belongie, S.: Large scale fine-grained categorization and domain-specific transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4109–4118 (2018)
Ruan, M., Yu, X., Zhang, N., Hu, C., Wang, S., Li, X.: Video-based contrastive learning on decision trees: from action recognition to autism diagnosis. In: Proceedings of the 14th Conference on ACM Multimedia Systems, pp. 289–300 (2023)
Xiao, T., Wang, X., Efros, A.A., Darrell, T.: What should not be contrastive in contrastive learning. arXiv preprint arXiv:2008.05659 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. arXiv: 2002.05709 (2020)
Gao, J., Xu, C.: Learning video moment retrieval without a single annotated video. IEEE Trans. Circ. Syst. Video Technol. 32(3), 1646–1657 (2022). https://doi.org/10.1109/TCSVT.2021.3075470
Gao, J., Chen, M., Xu, C.: Vectorized evidential learning for weakly-supervised temporal action localization. IEEE Trans. Pattern Anal. Mach. Intell. 45(12), 15949–15963 (2023). https://doi.org/10.1109/TPAMI.2023.3311447
Gao, J., Zhang, T., Xu, C.: Learning to model relationships for zero-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3476–3491 (2021). https://doi.org/10.1109/TPAMI.2020.2985708
Hu, Y., Gao, J., Dong, J., Fan, B., Liu, H.: Exploring rich semantics for open-set action recognition. IEEE Trans. Multimed. 26, 5410–5421 (2024). https://doi.org/10.1109/TMM.2023.3333206
Lopez, P.R., Dorta, D.V., Preixens, G.C., Sitjes, J.M.G., Marva, F.X.R., Gonzalez, J.: Pay attention to the activations: a modular attention mechanism for fine-grained image recognition. IEEE Trans. Multimed. 22, 502–514 (2019)
Shu, C., Chen, X., Yu, C., Han, H.: A refined spatial transformer network. In: International Conference on Neural Information Processing, pp. 151–161. Springer (2018)
Yu, Y., Chan, K.H.R., You, C., Song, C., Ma, Y.: Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Adv. Neural Inf. Process. Syst. 33, 9422–9434 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Hanselmann, H., Ney, H.: Elope: Fine-grained visual classification with efficient localization, pooling and embedding. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 1247–1256 (2020)
Tan, M., Wang, G., Zhou, J., Peng, Z., Zheng, M.: Fine-grained classification via hierarchical bilinear pooling with aggregated slack mask. IEEE Access 7, 117944–117953 (2019)
Author information
Authors and Affiliations
Contributions
DL conceived, designed, and supervised the whole study, developed the proposed approach, planned the experiments, drafted the manuscript, and designed the figures and tables. YW, KM, and JK were responsible for supervising this project. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Communicated by Junyu Gao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, D., Wang, Y., Mase, K. et al. MT-ASM: a multi-task attention strengthening model for fine-grained object recognition. Multimedia Systems 30, 297 (2024). https://doi.org/10.1007/s00530-024-01446-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01446-1