Skip to main content

Advertisement

Log in

Which and How Many Regions to Gaze: Focus Discriminative Regions for Fine-Grained Visual Categorization

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Fine-grained visual categorization (FGVC) aims to discriminate similar subcategories that belong to the same superclass. Since the distinctions among similar subcategories are quite subtle and local, it is highly challenging to distinguish them from each other even for humans. So the localization of distinctions is essential for fine-grained visual categorization, and there are two pivotal problems: (1) Which regions are discriminative and representative to distinguish from other subcategories? (2) How many discriminative regions are necessary to achieve the best categorization performance? It is still difficult to address these two problems adaptively and intelligently. Artificial prior and experimental validation are widely used in existing mainstream methods to discover which and how many regions to gaze. However, their applications extremely restrict the usability and scalability of the methods. To address the above two problems, this paper proposes a multi-scale and multi-granularity deep reinforcement learning approach (M2DRL), which learns multi-granularity discriminative region attention and multi-scale region-based feature representation. Its main contributions are as follows: (1) Multi-granularity discriminative localization is proposed to localize the distinctions via a two-stage deep reinforcement learning approach, which discovers the discriminative regions with multiple granularities in a hierarchical manner (“which problem”), and determines the number of discriminative regions in an automatic and adaptive manner (“how many problem”). (2) Multi-scale representation learning helps to localize regions in different scales as well as encode images in different scales, boosting the fine-grained visual categorization performance. (3) Semantic reward function is proposed to drive M2DRL to fully capture the salient and conceptual visual information, via jointly considering attention and category information in the reward function. It allows the deep reinforcement learning to localize the distinctions in a weakly supervised manner or even an unsupervised manner. (4) Unsupervised discriminative localization is further explored to avoid the heavy labor consumption of annotating, and extremely strengthen the usability and scalability of our M2DRL approach. Compared with state-of-the-art methods on two widely-used fine-grained visual categorization datasets, our M2DRL approach achieves the best categorization accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Explore related subjects

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Notes

  1. http://www.vision.caltech.edu/visipedia/CUB-200-2011.html.

  2. http://ai.stanford.edu/~jkrause/cars/car_dataset.html.

References

  • Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual attention. In International conference on learning representations (ICLR). arXiv:1412.7755.

  • Berg, T., & Belhumeur, P. (2013). Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 955–962).

  • Branson, S., Van Horn, G., Belongie, S., & Perona, P. (2014a). Bird species categorization using pose normalized deep convolutional nets. arXiv:1406.2952.

  • Branson, S., Van Horn, G., Wah, C., Perona, P., & Belongie, S. (2014b). The ignorant led by the blind: A hybrid human-machine vision system for fine-grained categorization. International Journal of Computer Vision (IJCV), 108(1–2), 3–29.

    MathSciNet  MATH  Google Scholar 

  • Cai, S., Zuo, W., & Zhang, L. (2017). Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 511–520).

  • Caicedo, J. C., & Lazebnik, S. (2015). Active object localization with deep reinforcement learning. In International conference of computer vision (ICCV), IEEE (pp. 2488–2496).

  • Chai, Y., Lempitsky, V., & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In International conference of computer vision (ICCV) (pp. 321–328).

  • Cui, Y., Zhou, F., Lin, Y., & Belongie, S. (2016). Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1153–1162).

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 248–255).

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2015). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision (IJCV), 111(1), 98–136.

    Article  Google Scholar 

  • Fu, J., Zheng, H., & Mei, T. (2017). Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Girshick, R. (2015). Fast R-CNN. In International conference of computer vision (ICCV).

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 580–587).

  • Gonzalez-Garcia, A., Modolo, D., & Ferrari, V. (2018). Do semantic parts emerge in convolutional neural networks? International Journal of Computer Vision (IJCV), 126(5), 476–494.

    Article  MathSciNet  Google Scholar 

  • He, X., & Peng, Y. (2017a). Fine-grained image classification via combining vision and language. In IEEE conference on computer vision and pattern recognition (CVPR).

  • He, X., & Peng, Y. (2017b). Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In: AAAI conference on artificial intelligence (AAAI) (pp. 4075–4081).

  • Huang, S., Xu, Z., Tao, D., & Zhang, Y. (2016). Part-stacked cnn for fine-grained visual categorization. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1173–1182).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.

  • Itti, L., & Koch, C. (2001). Computational modelling of visual attention. Nature Reviews Neuroscience, 2(3), 194–203.

    Article  Google Scholar 

  • Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial transformer networks. In Neural information processing systems (NIPS) (NIPS) (pp. 2017–2025).

  • Jie, Z., Liang, X., Feng, J., Jin, X., Lu, W., & Yan, S. (2016). Tree-structured reinforcement learning for sequential object localization. In Neural information processing systems (NIPS) (pp. 127–135).

  • Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research (JAIR), 4, 237–285.

    Article  Google Scholar 

  • Kong, S., & Fowlkes, C. (2017). Low-rank bilinear pooling for fine-grained classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7025–7034). IEEE.

  • Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In International conference of computer vision (ICCV) (pp. 554–561).

  • Krause, J., Gebru, T., Deng, J., Li, L. J., & Fei-Fei, L. (2014). Learning features and parts for fine-grained recognition. In International conference on pattern recognition (ICPR) (pp. 26–33).

  • Krause, J., Jin, H., Yang, J., Fei-Fei, L. (2015). Fine-grained recognition without part annotations. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5546–5555).

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

    Article  Google Scholar 

  • Lin, D., Shen, X., Lu, C., & Jia, J. (2015a). Deep lac: Deep localization, alignment and classification for fine-grained recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1666–1674).

  • Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015b). Bilinear CNN models for fine-grained visual recognition. In: International conference of computer vision (ICCV) (pp. 1449–1457).

  • Maji, S., Rahtu, E., Kannala, J., Blaschko, M., & Vedaldi, A. (2013). Fine-grained visual classification of aircraft. arXiv:1306.5151.

  • Mathe, S., Pirinen, A., & Sminchisescu, C. (2016). Reinforcement learning for visual object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2894–2902).

  • Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529.

    Article  Google Scholar 

  • Neider, M. B., & Zelinsky, G. J. (2006). Searching for camouflaged targets: Effects of target-background similarity on visual search. Vision Research, 46(14), 2217–2235.

    Article  Google Scholar 

  • Nilsback, M. E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In Sixth Indian conference on computer vision, graphics & image processing (pp. 722–729).

  • Otsu, N. (1979). A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics, 9(1), 62–66.

    Article  Google Scholar 

  • Parkhurst, D., Law, K., & Niebur, E. (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107–123.

    Article  Google Scholar 

  • Peng, Y., He, X., & Zhao, J. (2018). Object-part attention model for fine-grained image classification. IEEE Transactions on Image Processing (TIP), 27(3), 1487–1500.

    Article  MathSciNet  MATH  Google Scholar 

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems (NIPS) (pp. 91–99).

  • Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv:1511.05952.

  • Sfar, A. R., Boujemaa, N., & Geman, D. (2015). Confidence sets for fine-grained categorization and plant species identification. International Journal of Computer Vision (IJCV), 111(3), 255–275.

    Article  MathSciNet  Google Scholar 

  • Simon, M., & Rodner, E. (2015). Neural activation constellations: Unsupervised part model discovery with convolutional networks. In International conference of computer vision (ICCV) (pp. 1143–1151).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  • Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1). Cambridge: MIT Press.

    MATH  Google Scholar 

  • Tatler, B. W., Baddeley, R. J., & Vincent, B. T. (2006). The long and the short of it: Spatial statistics at fixation vary with saccade amplitude and task. Vision Research, 46(12), 1857–1862.

    Article  Google Scholar 

  • Uijlings, J. R., van de Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision (IJCV), 104(2), 154–171.

    Article  Google Scholar 

  • Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. In AAAI conference on artificial intelligence (AAAI) (Vol. 16, pp. 2094–2100).

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset. California Inst. Technol., Pasadena, CA, USA, Tech. Rep (CNS-TR-2011-001).

  • Wang, D., Shen, Z., Shao, J., Zhang, W., Xue, X., & Zhang, Z. (2015). Multiple granularity descriptors for fine-grained categorization. In International conference of computer vision (ICCV) (pp. 2399–2406).

  • Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3360–3367).

  • Wang, Y., Choi, J., Morariu, V., & Davis, L. S. (2016a). Mining discriminative triplets of patches for fine-grained classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1163–1172).

  • Wang, Y., Morariu, V. I., & Davis, L. S. (2016b). Weakly-supervised discriminative patch learning via CNN for fine-grained recognition. arXiv:1611.09932.

  • Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016c). Dueling network architectures for deep reinforcement learning. In International conference on machine learning (ICML) (pp. 1995–2003).

  • Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., & Zhang, Z. (2015). The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 842–850).

  • Xie, L., Tian, Q., Hong, R., Yan, S., & Zhang, B. (2013). Hierarchical part matching for fine-grained visual categorization. In International conference of computer vision (ICCV) (pp. 1641–1648).

  • Xie, L., Zheng, L., Wang, J., Yuille, A. L., & Tian, Q. (2016). Interactive: Inter-layer activeness propagation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 270–279).

  • Xie, S., Yang, T., Wang, X., & Lin, Y. (2015). Hyper-class augmented and regularized deep learning for fine-grained image classification. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2645–2654).

  • Xu, Z., Huang, S., Zhang, Y., & Tao, D. (2018). Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(5), 1100–1113.

    Article  Google Scholar 

  • Xu, Z., Tao, D., Huang, S., & Zhang, Y. (2017). Friend or foe: Fine-grained categorization with weak supervision. IEEE Transactions on Image Processing (TIP), 26(1), 135–146.

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, S., Bo, L., Wang, J., & Shapiro, L. G. (2012). Unsupervised template learning for fine-grained object recognition. In Neural information processing systems (NIPS) (pp. 3122–3130).

  • Yao, H., Zhang, S., Zhang, Y., Li, J., & Tian, Q. (2016). Coarse-to-fine description for fine-grained visual categorization. IEEE Transactions on Image Processing (TIP), 25(10), 4858–4872.

    Article  MathSciNet  Google Scholar 

  • Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., & Metaxas, D. (2016a). Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1143–1152).

  • Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision (IJCV), 73(2), 213–238.

    Article  Google Scholar 

  • Zhang, L., Yang, Y., Wang, M., Hong, R., Nie, L., & Li, X. (2016b). Detecting densely distributed graph patterns for fine-grained image categorization. IEEE Transactions on Image Processing (TIP), 25(2), 553–565.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, N., Farrell, R., Iandola, F., & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In International conference of computer vision (ICCV) (pp. 729–736).

  • Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In International conference on machine learning (ICML) (pp. 834–849).

  • Zhang, X., Xiong, H., Zhou, W., Lin, W., & Tian, Q. (2016c). Picking deep filter responses for fine-grained image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1134–1142).

  • Zhang, X., Xiong, H., Zhou, W., & Tian, Q. (2016d). Fused one-vs-all features with semantic alignments for fine-grained visual categorization. IEEE Transactions on Image Processing (TIP), 25(2), 878–892.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, X., Xiong, H., Zhou, W., Lin, W., & Tian, Q. (2017). Picking neural activations for fine-grained recognition. IEEE Transactions on Multimedia (TMM), 19(12), 2736–2750.

    Google Scholar 

  • Zhang, Y., Wei, X. S., Wu, J., Cai, J., Lu, J., Nguyen, V. A., et al. (2016e). Weakly supervised fine-grained categorization with part-based image representation. IEEE Transactions on Image Processing (TIP), 25(4), 1713–1725.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao, B., Wu, X., Feng, J., Peng, Q., & Yan, S. (2017a). Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia (TMM), 19(6), 1245–1256.

    Article  Google Scholar 

  • Zhao, D., Chen, Y., & Lv, L. (2017b). Deep reinforcement learning with visual attention for vehicle classification. IEEE Transactions on Cognitive and Developmental Systems, 9(4), 356–367.

    Article  Google Scholar 

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene CNNs. In International conference on learning representations (ICLR).

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2921–2929).

  • Zhou, F., & Lin, Y. (2016). Fine-grained image classification by exploring bipartite-graph labels. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1124–1133).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuxin Peng.

Additional information

Communicated by Dr. S.-C. Zhu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by National Natural Science Foundation of China under the Grant 61771025 and Grant 61532005.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, X., Peng, Y. & Zhao, J. Which and How Many Regions to Gaze: Focus Discriminative Regions for Fine-Grained Visual Categorization. Int J Comput Vis 127, 1235–1255 (2019). https://doi.org/10.1007/s11263-019-01176-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01176-2

Keywords