Abstract
Image retrieval is a challenging task of searching images similar to the query one from a database. Previous learning-based methods adopt various ingenious designs to increase the representatively positive and negative sample pairs in training. Still, these methods are performance immanently limited by the size of the mini-batch. To this end, we here introduce the learnable descriptor graph convolutional network (LDGC-Net), which effectively enhances the hard mining ability of the model and clears the boundary between different categories. We present an analysis of why our LDGC-Net can aggregate relationships between original descriptors in a constrained size of the mini-batch. Also, we propose an innovative end-to-end training framework with the LDGC-Net for image retrieval to accelerate model convergence. In particular, our LDGC-Net can be conveniently integrated into other current methods as a plug-and-play module with inappreciable computational cost. Experimental results in three benchmark datasets show that the proposed LDGC-Net can improve performance compared with several state-of-the-art approaches.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The datasets used in this paper are public datasets and can be obtained by contacting the relevant providers.
References
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004). https://doi.org/10.1023/B:VISI.0000029664.99615.94
Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008). https://doi.org/10.1016/j.cviu.2007.09.014
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5297–5307) (2016). https://doi.org/10.1109/TPAMI.2017.2711011
Hausler, S., Garg, S., Xu, M., Milford, M., Fischer, T.: Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14141–14152) (2021). https://doi.org/10.48550/arXiv.2103.01486
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: Learning global representations for image search. In: European Conference on Computer Vision (pp. 241–257). Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_15
Cakir, F., He, K., Xia, X., Kulis, B., Sclaroff, S.: Deep metric learning to rank. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1861–1870) (2019). https://doi.org/10.1109/CVPR.2019.00196
Brown, A., Xie, W., Kalogeiton, V., Zisserman, A.: Smooth-ap: Smoothing the path towards large-scale image retrieval. In: European Conference on Computer Vision (pp. 677–694). Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_39
Ramzi, E., Thome, N., Rambour, C., Audebert, N., Bitot, X.: Robust and Decomposable Average Precision for Image Retrieval. Adv. Neural Inf. Process. Syst., 34 (2021). https://doi.org/10.48550/arXiv.2110.01445
Suh, Y., Han, B., Kim, W., Lee, K.M.: Stochastic class-based hard example mining for deep metric learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7251–7259) (2019). https://doi.org/10.1109/CVPR.2019.00742
Ge, W.: Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 269–285) (2018)
Wang, X., Zhang, H., Huang, W., Scott, M.: Cross-batch memory for embedding learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6388–6397) (2020). https://doi.org/10.1109/CVPR42600.2020.00642
Jiang, B., Zhang, Z., Lin, D., Tang, J., Luo, B.: Semi-supervised learning with graph learning-convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11313–11320) (2019). https://doi.org/10.1109/CVPR.2019.01157
Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: Computer Vision, IEEE International Conference on (Vol. 3, pp. 1470–1470). IEEE Computer Society (2003). https://doi.org/10.1109/ICCV.2003.1238663
Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In: European Conference on Computer Vision (pp. 3–20). Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-01
Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2011). https://doi.org/10.1109/TPAMI.2011.235
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 3304–3311). IEEE (2010). https://doi.org/10.1109/CVPR.2010.5540039
Dong, B., Zeng, F., Wang, T., Zhang, X., Wei, Y.: Solq: segmenting objects by learning queries (2021). https://doi.org/10.48550/arXiv.2106.02351
Liao, X., Li, K., Zhu, X., Liu, K.J.R.: Robust detection of image operator chain with two-stream convolutional neural network. IEEE J. Selected Top. Signal Process., 99, 1–1 (2020). https://doi.org/10.1109/JSTSP.2020.3002391
Hu, J., Liao, X., Wang, W., Qin, Z.: Detecting compressed deepfake videos in social networks using frame-temporality two-stream convolutional network. IEEE Trans. Circuits Syst. Video Technol. 99, 1–1 (2021). https://doi.org/10.1109/TCSVT.2021.3074259
Tolias, G., Jenicek, T., Chum, O.: Learning and aggregating deep local descriptors for instance-level recognition. In: European Conference on Computer Vision (pp. 460–477). Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_27
Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1655–1668 (2018). https://doi.org/10.1109/TPAMI.2018.2846566
Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) (Vol. 2, pp. 2161–2168). IEEE (2006). https://doi.org/10.1109/CVPR.2006.264
Noh, H., Araujo, A., Sim, J., Weyand, T., Han, B.: Large-scale image retrieval with attentive deep local features. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 3456–3465) (2017). https://doi.org/10.1109/ICCV.2017.374
Qin, Q., Hu, W., Liu, B.: Feature projection for improved text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8161–8171) (2020). https://doi.org/10.18653/v1/2020.acl-main.726
Weyand, T., Araujo, A., Cao, B., Sim, J.: Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2575–2584) (2020). https://doi.org/10.48550/arXiv.2004.01804
Wu, C. Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 2840–2848) (2017). https://doi.org/10.1109/ICCV.2017.309
Revaud, J., Almazán, J., Rezende, R.S., Souza, C.R.D. Learning with average precision: Training image retrieval with a listwise loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5107–5116) (2019). https://doi.org/10.1109/ICCV.2019.00521
Hamilton, W. L., Ying, R., Leskovec, J.: Representation Learning on Graphs: Methods and Applications (2017). arXiv preprint: https://doi.org/10.48550/arXiv.1709.05584
Bruna, J., Zaremba, W., Szlam, A., Lecun, Y.: Spectral networks and locally connected networks on graphs. Computer Science (2013). https://doi.org/10.48550/arXiv.1312.6203
Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph-structured data (2015). arXiv preprint: https://doi.org/10.48550/arXiv.1506.05163
Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. Adv. Neural Inf. Process. Syst., 29 (2016). https://doi.org/10.48550/arXiv.1606.09375
Atwood, J., Towsley, D.: Diffusion-convolutional neural networks. Adv. Neural Inf. Process. Syst., 29 (2016). https://doi.org/10.48550/arXiv.1511.02136
Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: International Conference on Machine Learning (pp. 1263–1272) (2017). PMLR. https://doi.org/10.48550/arXiv.1704.01212
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (pp. 10347–10357) (2021). PMLR. https://doi.org/10.48550/arXiv.2012.12877
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision (pp. 630–645). Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-038
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International CONFERENCE ON MACHINE LEARNING (pp. 448–456). PMLR (2015). https://doi.org/10.5555/3045118.3045167
Oh Song, H., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4004–4012) (2016). https://doi.org/10.1109/CVPR.2016.434
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200–2011 dataset (2011)
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. Comput. Vis. Pattern Recogn. IEEE (2016). https://doi.org/10.1109/CVPR.2016.124
Roth, K., Brattoli, B., Ommer, B.: Mic: mining interclass characteristics for improved metric learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8000–8009) (2019). https://doi.org/10.48550/arXiv.1909.11574
Zhang, B., Zheng, W., Zhou, J., Lu, J.: Attributable visual similarity learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7532–7541) (2022). https://doi.org/10.1109/cvpr52688.2022.00738
Zheng, W., Zhang, B., Lu, J., Zhou, J.:. Deep relational metric learning (2021). https://doi.org/10.48550/arXiv.2108.10026
Rolinek, M., Musil, V., Paulus, A., Vlastelica, M., Michaelis, C., Martius, G.: Optimizing rank-based metrics with blackbox differentiation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7620–7630) (2020). https://doi.org/10.1109/CVPR42600.2020.00764
Venkataramanan, S., Psomas, B., Avrithis, Y., Kijak, E., Karantzalos, K.: It takes two to tango: mixup for deep metric learning (2021). https://doi.org/10.48550/arXiv.2106.04990
El-Nouby, A., Neverova, N., Laptev, I., Jégou, H.: Training vision transformers for image retrieval (2021). arXiv preprint: https://doi.org/10.48550/arXiv.2102.05644
Zhao, W., Rao, Y., Wang, Z., Lu, J., Zhou, J.: Towards interpretable deep metric learning with structural matching (2021). https://doi.org/10.48550/arXiv.2108.05889
Teh, E.W., Devries, T., Taylor, G.W.: ProxyNCA++: revisiting and revitalizing proxy neighborhood component analysis (2020). https://doi.org/10.48550/arXiv.2004.01113
Acknowledgements
This work is supported by a grant from Key Laboratory of Avionics System Integrated Technology, Fundamental Research Funds for the Central Universities in China, Grant No. 3072022JC0601, and the National Natural Science Foundation of China under Grant No. 41876110.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, X., Wang, J., Kang, M. et al. LDGC-Net: learnable descriptor graph convolutional network for image retrieval. Vis Comput 39, 6639–6653 (2023). https://doi.org/10.1007/s00371-022-02753-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-022-02753-2