Skip to main content
Log in

From coarse to fine: multi-level feature fusion network for fine-grained image retrieval

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Fine-grained image retrieval (FGIR) has received extensive attention in academia and industry. Despite the tremendous progress, the issue of large intra-class differences and small inter-class differences is still open. Existing fine-grained image classification works, similar to FGIR, focus on learning discriminative local features to solve the above-motioned challenge. Based on this observation, it is unreasonable to use only the global features(i.e. object features or image features) and ignore the discriminable local features(i.e., patch features) for FGIR. In this paper, we propose a novel coarse-to-fine multiple-level feature fusion network (MFFN) that conquers the problem described above via utilizing multi-level features extracting and fusion. MFFN first adopts object-level features for coarse retrieval, a step that reduces the scope of the retrieval. For the fine retrieval stage, we designed the converged multi-level features to deeply mine the intrinsic correlation and complementary information between patch-level and image-level features through a deep belief network (DBN). In addition, for patch-level features, we designed a new constraint to select discriminative patches and proposed a weighted max-polling method to aggregate these distinguishing patches. We achieve the new state-of-the-art performance of the proposed framework on widely-used benchmarks, including CUB-200-2011 and Oxford-Flower-102 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Bell, S., Bala, K.: Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. 34(4), 98:1–98:10 (2015). https://doi.org/10.1145/2766959

  2. Daras, P., Manolopoulou, S., Axenopoulos, A.: Search and retrieval of rich media objects supporting multiple multimodal queries. IEEE Trans. Multimed. 14(3–2), 734–746 (2012)

    Article  Google Scholar 

  3. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20–25 June 2009, Miami, Florida, USA, pp. 248–255 (2009). https://doi.org/10.1109/CVPRW.2009.5206848

  4. Fu, C., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD: Deconvolutional single shot detector. CoRR abs/1701.06659 (2017). http://arxiv.org/abs/1701.06659

  5. Fu, J., Zheng, H., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 4476–4484 (2017). https://doi.org/10.1109/CVPR.2017.476

  6. Gao, S., Tsang, I.W., Ma, Y.: Learning category-specific dictionary and shared dictionary for fine-grained image categorization. IEEE Trans. Image Process. 23(2), 623–634 (2014)

  7. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014, pp. 580–587 (2014). https://doi.org/10.1109/CVPR.2014.81

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  9. He, X., Peng, Y.: Weakly supervised learning of part selection model with spatial constraints for fine-grained image classification. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, February 4–9, 2017, San Francisco, California, USA., pp. 4075–4081 (2017). http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14629

  10. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)

    Article  MathSciNet  Google Scholar 

  11. Huang, C., Loy, C.C., Tang, X.: Local similarity-aware deep feature embedding. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pp. 1262–1270 (2016). http://papers.nips.cc/paper/6368-local-similarity-aware-deep-feature-embedding

  12. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 675–678 (2014). https://doi.org/10.1145/2647868.2654889

  13. Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part I, pp. 685–701 (2016). https://doi.org/10.1007/978-3-319-46604-0_48

  14. Krause, J., Jin, H., Yang, J., Li, F.: Fine-grained recognition without part annotations. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 5546–5555 (2015). https://doi.org/10.1109/CVPR.2015.7299194

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pp. 1106–1114 (2012). http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks

  16. Li, K., Zou, C., Bu, S., Liang, Y., Zhang, J., Gong, M.: Multi-modal feature fusion for geographic image annotation. Pattern Recognit. 73, 1–14 (2018)

  17. Lin, T., Roy Chowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1449–1457 (2015). https://doi.org/10.1109/ICCV.2015.170

  18. Liu, G., Xiao, L., Xiong, C.: Image classification with deep belief networks and improved gradient descent. In: 2017 IEEE International Conference on Computational Science and Engineering, CSE 2017, and IEEE International Conference on Embedded and Ubiquitous Computing, EUC 2017, Guangzhou, China, July 21–24, 2017, Volume 1, pp. 375–380 (2017). https://doi.org/10.1109/CSE-EUC.2017.74

  19. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.: SSD: single shot multibox detector. In: Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I, pp. 21–37 (2016). https://doi.org/10.1007/978-3-319-46448-0_2

  20. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

  21. Manmatha, R., Wu, C., Smola, A.J., Krähenbühl, P.: Sampling matters in deep embedding learning. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 2859–2867. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.309

  22. Ng, J.Y., Yang, F., Davis, L.S.: Exploiting local features from deep networks for image retrieval. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2015, Boston, MA, USA, June 7–12, 2015

  23. Peng, Y., He, X., Zhao, J.: Object-part attention model for fine-grained image classification. IEEE Trans. Image Process. 27(3), 1487–1500 (2018)

  24. Peng, Y., Huang, X., Qi, J.: Cross-media shared representation by hierarchical learning with multiple deep networks. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, pp. 3846–3853 (2016). http://www.ijcai.org/Abstract/16/541

  25. Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: Contrast based filtering for salient region detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16–21, 2012, pp. 733–740 (2012). https://doi.org/10.1109/CVPR.2012.6247743

  26. Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th International Conference on Multimedia 2010, Firenze, Italy, October 25–29, 2010, pp. 251–260 (2010). https://doi.org/10.1145/1873951.1873987

  27. Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 779–788 (2016). https://doi.org/10.1109/CVPR.2016.91

  28. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 6517–6525 (2017). https://doi.org/10.1109/CVPR.2017.690

  29. Ren, J.S.J., Chen, X., Liu, J., Sun, W., Pang, J., Yan, Q., Tai, Y., Xu, L.: Accurate single stage detector using recurrent rolling convolution. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp. 752–760 (2017). https://doi.org/10.1109/CVPR.2017.87

  30. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)

  31. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 618–626 (2017). https://doi.org/10.1109/ICCV.2017.74

  32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1409.1556

  33. Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 4004–4012 (2016). https://doi.org/10.1109/CVPR.2016.434

  34. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015, pp. 1–9 (2015). https://doi.org/10.1109/CVPR.2015.7298594

  35. Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (2016). http://arxiv.org/abs/1511.05879

  36. Ustinova, E., Lempitsky, V.S.: Learning deep embeddings with histogram loss. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 4170–4178 (2016). http://papers.nips.cc/paper/6464-learning-deep-embeddings-with-histogram-loss

  37. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011)

  38. Wang, D., Cui, P., Ou, M., Zhu, W.: Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multimed. 17(9), 1404–1416 (2015)

  39. Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23–28, 2014, pp. 1386–1393 (2014). https://doi.org/10.1109/CVPR.2014.180

  40. Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 5022–5030. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPR.2019.00516. http://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Multi-Similarity_Loss_With_General_Pair_Weighting_for_Deep_Metric_Learning_CVPR_2019_paper.html

  41. Wang, Z., Li, Z., Sun, J., Xu, Y.: Selective convolutional features based generalized-mean pooling for fine-grained image retrieval. In: IEEE Visual Communications and Image Processing, VCIP 2018, Taichung, Taiwan, December 9-12, 2018, pp. 1–4 (2018). https://doi.org/10.1109/VCIP.2018.8698729

  42. Wei, X., Luo, J., Wu, J., Zhou, Z.: Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Trans. Image Process. 26(6), 2868–2881 (2017)

  43. Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 842–850 (2015). https://doi.org/10.1109/CVPR.2015.7298685

  44. Xie, L., Tian, Q., Wang, M., Zhang, B.: Spatial pooling of heterogeneous features for image classification. IEEE Trans. Image Process. 23(5), 1994–2008 (2014)

  45. Xie, L., Wang, J., Zhang, B., Tian, Q.: Fine-grained image search. IEEE Trans. Multimed. 17(5), 636–647 (2015)

  46. Zhang, L., Ma, B., Li, G., Huang, Q., Tian, Q.: Cross-modal retrieval using multiordered discriminative structured subspace learning. IEEE Trans. Multimed. 19(6), 1220–1233 (2017)

  47. Zhang, N., Donahue, J., Girshick, R.B., Darrell, T.: Part-based r-cnns for fine-grained category detection. In: Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part I, pp. 834–849 (2014). https://doi.org/10.1007/978-3-319-10590-1_54

  48. Zhang, X., Xiong, H., Zhou, W., Lin, W., Tian, Q.: Picking deep filter responses for fine-grained image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 1134–1142 (2016). https://doi.org/10.1109/CVPR.2016.128

  49. Zhang, Y., Wei, X., Wu, J., Cai, J., Lu, J., Nguyen, V.A., Do, M.N.: Weakly supervised fine-grained categorization with part-based image representation. IEEE Trans. Image Process. 25(4), 1713–1725 (2016)

  50. Zheng, H., Fu, J., Mei, T., Luo, J.: Learning multi-attention convolutional neural network for fine-grained image recognition. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017, pp. 5219–5227 (2017). https://doi.org/10.1109/ICCV.2017.557

  51. Zheng, W., Lu, J., Zhou, J.: Hardness-aware deep metric learning. IEEE Trans. Pattern Anal. Mach. Intell. 43(9), 3214–3228 (2021)

  52. Zheng, W., Wang, C., Lu, J., Zhou, J.: Deep compositional metric learning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 9320–9329. Computer Vision Foundation / IEEE (2021). https://openaccess.thecvf.com/content/CVPR2021/html/Zheng_Deep_Compositional_Metric_Learning_CVPR_2021_paper.html

  53. Zheng, X., Ji, R., Sun, X., Wu, Y., Huang, F., Yang, Y.: Centralized ranking loss with weakly supervised localization for fine-grained object retrieval. In: J. Lang (ed.) Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13–19, 2018, Stockholm, Sweden, pp. 1226–1233. ijcai.org (2018). https://doi.org/10.24963/ijcai.2018/171

  54. Zheng, X., Ji, R., Sun, X., Zhang, B., Wu, Y., Huang, F.: Towards optimal fine grained retrieval via decorrelated centralized loss with normalize-scale layer. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, 2019, pp. 9291–9298. AAAI Press (2019). https://doi.org/10.1609/aaai.v33i01.33019291

  55. Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016, pp. 2921–2929 (2016). https://doi.org/10.1109/CVPR.2016.319

  56. Zhu, Y., Bai, Y., Wei, Y.: Spherical feature transform for deep metric learning. In: A. Vedaldi, H. Bischof, T. Brox, J. Frahm (eds.) Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX, Lecture Notes in Computer Science, vol. 12364, pp. 420–436. Springer (2020). https://doi.org/10.1007/978-3-030-58529-7_25

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant Nos. 61772108, 61932020 and 61976038.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhihui Wang.

Additional information

Communicated by B-K Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Wang, Z., Wang, N. et al. From coarse to fine: multi-level feature fusion network for fine-grained image retrieval. Multimedia Systems 28, 1515–1528 (2022). https://doi.org/10.1007/s00530-022-00899-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00899-6

Keywords

Navigation