End-to-End Large-Scale Image Retrieval Network with Convolution and Vision Transformers

Zhang, Qing; Bao, Feilong; Su, Xiangdong; Wang, Weihua; Gao, Guanglai

doi:10.1007/978-3-031-15937-4_52

Qing Zhang^12,13,14,
Feilong Bao^12,13,14,
Xiangdong Su^12,13,14,
Weihua Wang^12,13,14 &
…
Guanglai Gao^12,13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13532))

Included in the following conference series:

International Conference on Artificial Neural Networks

1907 Accesses

Abstract

There has been significant progress in content-based image retrieval with the development of convolutional neural networks and visual transformers. However, there are semantic gaps between high-level semantic information and low-level visual features. To solve this problem, we propose a high-performance image retrieval method based on the convolutional neural network (CNN) and vision transformers, which takes advantage of the local characteristics of the CNN and the long-range dependence characteristics of vision transformers. The proposed convolution and vision transformers network (CVTNet) firstly uses the CNN backbone network to extract the feature representation of the image. Secondly, it uses the vision transformers to enhance the semantic relationship among the feature layer to reduce the semantic gap. Finally, we propose an adaptive weight loss function that fuses triplet loss and second-order similarity loss to capture more image structure information. Extensive experimental results demonstrated that CVTNet achieves significant performance improvement on Revisited Oxford and Paris datasets compared with the baselines.

Supported by the National Key Research and Development Program (2018YFE0122900), the National Natural Science Foundation of China (61773224, 62066033), the Applied Technology Research and Development Foundation of Inner Mongolia Autonomous Region (2019GG372, 2020GG0046, 2021GG0158, 2020PT0002), the Achievements Transformation Project of Inner Mongolia Autonomous Region (2019CG028), the Natural Science Foundation of Inner Mongolia Autonomous Region (2020BS06001), the Science Foundation of Inner Mongolia College and University (NJZY20008).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Lanchantin, J., Wang, T., Ordonez, V., Qi, Y.: General multi-label image classification with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16478–16488 (2021)
Google Scholar
Miech, A., Alayrac, J.B., Laptev, I., Sivic, J., Zisserman, A.: Thinking fast and slow: efficient text-to-visual retrieval with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9826–9836 (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Wu, B., et al.: Visual transformers: token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677 (2020)
Ng, T., Balntas, V., Tian, Y., Mikolajczyk, K.: SOLAR: second-order loss and attention for image retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 253–270. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_16
Chapter Google Scholar
El-Nouby, A., Neverova, N., Laptev, I., Jégou, H.: Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 579–588 (2021)
Google Scholar
Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, vol. 3, pp. 1470–1470. IEEE Computer Society (2003)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Article Google Scholar
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_38
Chapter Google Scholar
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015)
Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1269–1277 (2015)
Google Scholar
Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 685–701. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46604-0_48
Chapter Google Scholar
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_15
Chapter Google Scholar
Yang, T.Y., Kien Nguyen, D., Heijnen, H., Balntas, V.: Dame web: dynamic mean with whitening ensemble binarization for landmark retrieval without human annotation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Radenović, F., Tolias, G., Chum, O.: Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 1655–1668 (2018)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Cao, B., Araujo, A., Sim, J.: Unifying deep local and global features for image search. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 726–743. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_43
Chapter Google Scholar
Nie, X., Lu, H., Wang, Z., Liu, J., Guo, Z.: Weakly supervised image retrieval via coarse-scale feature fusion and multi-level attention blocks. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 48–52 (2019)
Google Scholar
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5222–5229. IEEE (2021)
Google Scholar
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482–7491 (2018)
Google Scholar
Tian, Y., Yu, X., Fan, B., Wu, F., Heijnen, H., Balntas, V.: Sosnet: second order similarity regularization for local descriptor learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11016–11025 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Inner Mongolia University, Hohhot, China
Qing Zhang, Feilong Bao, Xiangdong Su, Weihua Wang & Guanglai Gao
Inner Mongolia Key Laboratory of Mongolian Information Processing Technology, Hohhot, China
Qing Zhang, Feilong Bao, Xiangdong Su, Weihua Wang & Guanglai Gao
National & Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, Hohhot, China
Qing Zhang, Feilong Bao, Xiangdong Su, Weihua Wang & Guanglai Gao

Authors

Qing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Feilong Bao
View author publications
You can also search for this author in PubMed Google Scholar
Xiangdong Su
View author publications
You can also search for this author in PubMed Google Scholar
Weihua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guanglai Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feilong Bao .

Editor information

Editors and Affiliations

University of the West of England, Bristol, UK
Elias Pimenidis
Lancaster University, Lancaster, UK
Plamen Angelov
Digital Innovation, Teeside University, Middlesbrough, UK
Chrisina Jayne
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
The University of the West of England, Bristol, UK
Mehmet Aydin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Q., Bao, F., Su, X., Wang, W., Gao, G. (2022). End-to-End Large-Scale Image Retrieval Network with Convolution and Vision Transformers. In: Pimenidis, E., Angelov, P., Jayne, C., Papaleonidas, A., Aydin, M. (eds) Artificial Neural Networks and Machine Learning – ICANN 2022. ICANN 2022. Lecture Notes in Computer Science, vol 13532. Springer, Cham. https://doi.org/10.1007/978-3-031-15937-4_52

Download citation

DOI: https://doi.org/10.1007/978-3-031-15937-4_52
Published: 07 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15936-7
Online ISBN: 978-3-031-15937-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

End-to-End Large-Scale Image Retrieval Network with Convolution and Vision Transformers