Diverse image search with explanations

Zhu, Xinying; Liu, Linhu

doi:10.1007/s11042-023-16393-8

Diverse image search with explanations

Published: 10 August 2023

Volume 83, pages 23067–23082, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

72 Accesses
Explore all metrics

Abstract

In this paper, we propose a novel content based image search framework with explanations, which can not only compare the similarity among images from different perspectives, but also describe the commonalities of two images with language. Specifically, we develop a graph matching method to calculate the similarity of two images and locate their commonalities, where each graph includes perceptual information, conceptual information and relational information. Furthermore, we utilize a language model based method to generate sentences to describe the similarities of two images. Comparing with different perspectives, we follow the principle that rich structured representations are more important than simple ones. To evaluate this principle, we conduct the experiment on the Visual Genome dataset, where each image contains lots of objects and multiple object relationships. The experimental results demonstrate the effectiveness of the principle. We also evaluate our method in the explanation of similar images, and the experimental results demonstrate that our method can obtain comparable performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compact-VG: A Small-scale Dataset for Scene Graph Generation

Discovering Respects for Visual Similarity

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

Data availability

The data are openly available at https://visualgenome.org/api/v0/api_home.html.

References

Berretti S, Del Bimbo A, Vicario E (2001) Efficient matching and indexing of graph models in content-based retrieval. IEEE Trans Pattern Anal Mach Intell 23(10):1089–1105
Article Google Scholar
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer, pp 213–229
Chen W, Liu Y, Wang W, Bakker E, Georgiou T, Fieguth P, Liu L, Lew MS (2021) Deep learning for instance retrieval: a survey. Preprint at http://arxiv.org/abs/2101.11282
Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. Int J Pattern Recognit Artif Intell 18(03):265–298
Article Google Scholar
Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):5
Article Google Scholar
Elhoseiny M, Cohen S, Chang W, Price B, Elgammal A (2016) Sherlock: scalable fact learning in images
Eshera M, Fu K-S (1986) An image understanding system using attributed symbolic representation and inexact graph-matching. IEEE Trans Pattern Anal Mach Intell 5:604–618
Article ADS Google Scholar
Gentner D, Medina J (1998) Similarity and the development of rules. Cognition 65(2):263–297
Article CAS PubMed Google Scholar
Goldstone RL (1994) Similarity, interactive activation, and mapping. J Exp Psychol Learn Mem Cogn 20(1):3–28
Article Google Scholar
Gong Y, Lazebnik S, Gordo A, Perronnin F (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35(12):2916–2929
Article PubMed Google Scholar
Grauman K (2010) Efficiently searching for similar images. Commun ACM 53(6):84–94
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 770–778
Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vision 87(3):316–336
Article Google Scholar
Jegou H, Perronnin F, Douze M, Nchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716
Article PubMed Google Scholar
Johnson J, Krishna R, Stark M, Li L-J, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 3668–3678
Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 4976–4984
Kawano Y, Yanai K (2014) Food image recognition with deep convolutional features. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. ACM, pp 589–593
Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. pp 541–547
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. Preprint at http://arxiv.org/abs/1602.07332
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp 1097–1105
Li X, Uricchio T, Ballan L, Bertini M, Snoek CG, Bimbo AD (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surv (CSUR) 49(1):14
Google Scholar
Li X, Xu C, Wang X, Lan W, Jia Z, Yang G, Xu J (2019) Coco-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans Multimedia 21(9):2347–2360
Article ADS Google Scholar
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp 2980–2988
Li Y, Ouyang W, Wang X (2017) VIP-CNN: a visual phrase reasoning convolutional neural network for visual relationship detection. Preprint at http://arxiv.org/abs/1702.07191
Li Y, Pan Y, Yao T, Mei T (2022) Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 17990–17999
Liu G-H, Yang J-Y (2013) Content-based image retrieval using color difference histogram. Pattern Recogn 46(1):188–198
Article ADS Google Scholar
Liu Y, Zhang D, Lu G, Ma W-Y (2007) A survey of content-based image retrieval with high-level semantics. Pattern Recogn 40(1):262–282
Article ADS Google Scholar
Liu L, Shen F, Shen Y, Liu X, Shao L (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 2862–2871
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European Conference on Computer Vision. Springer, pp 852–869
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems. pp 289–297
Matsui Y, Ito K, Aramaki Y, Fujimoto A, Ogawa T, Yamasaki T, Aizawa K (2017) Sketch-based manga retrieval using manga109 dataset. Multimed Tools Appl 76(20):21811–21838
Article Google Scholar
Min W, Mei S, Li Z, Jiang S (2020) A two-stage triplet network training framework for image retrieval. IEEE Trans Multimedia 22(12):3128–3138
Article Google Scholar
Navon D (1977) Forest before trees: the precedence of global features in visual perception. Cogn Psychol 9(3):353–383
Article ADS Google Scholar
Paulin M, Douze M, Harchaoui Z, Mairal J, Perronin F, Schmid C (2015) Local convolutional features with unsupervised training for image retrieval. In: IEEE International Conference on Computer Vision. pp 91–99
Rahman T, Chou S-H, Sigal L, Carenini G (2021) An improved attention for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1653–1662
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp 91–99
Roth K, Milbich T, Sinha S, Gupta P, Ommer B, Cohen JP (2020) Revisiting training strategies and generalization performance in deep metric learning. In: International Conference on Machine Learning. PMLR, pp 8242–8252
Sadeghi F, Kumar Divvala SK, Farhadi A (2015) Viske: visual knowledge extraction and question answering by visual verification of relation phrases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 1456–1464
Sagi E, Gentner D, Lovett A (2012) What difference reveals about similarity. Cogn Sci 36(6):1019–1050
Article PubMed Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Preprint at http://arxiv.org/abs/1409.1556
Smeulders AW, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Article Google Scholar
Suhail M, Mittal A, Siddiquie B, Broaddus C, Eledath J, Medioni G, Sigal L (2021) Energy-based learning for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 13936–13945
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild
Tzelepi M, Tefas A (2018) Deep convolutional learning for content based image retrieval. Neurocomputing 275:2467–2478
Article Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 3156–3164
Wang S, Jiang S (2015) Instre: a new benchmark for instance-level object retrieval and recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 11(3):37
Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pp 157–166
Wu L, Wang Y, Shao L (2018) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans Image Process 28(4):1602–1612
Article ADS MathSciNet PubMed Google Scholar
Wu Z, Ke Q, Isard M, Sun J (2009) Bundling features for large scale partial-duplicate web image search. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. (CVPR 2009). IEEE, pp 25–32
Xu H, Wang J, Hua X-S, Li S (2010) Image search by concept map. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp 275–282
Yang X, Zhang H, Qi G, Cai J (2021) Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 9847–9857
Yu J, Amores J, Sebe N, Radeva P, Tian Q (2008) Distance learning for similarity estimation. IEEE Trans Pattern Anal Mach Intell 30(3):451–462
Article PubMed Google Scholar
Zhang D-Q, Chang S-F (2004) Detecting image near-duplicate by stochastic attributed relational graph matching with learning. In: Proceedings of the 12th Annual ACM International Conference on Multimedia. ACM, pp 877–884
Zhang H, Zha Z-J, Yang Y, Yan S, Gao Y, Chua T-S (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: Proceedings of the 21st ACM International Conference on Multimedia. ACM, pp 33–42
Zhao W-L, Ngo C-W, Tan H-K, Wu X (2007) Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Trans Multimedia 9(5):1037–1048
Article Google Scholar
Zhou F, De la Torre F (2016) Factorized graph matching. IEEE Trans Pattern Anal Mach Intell 38(9):1774–1789
Article PubMed Google Scholar
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems. pp 487–495
Zhou W, Lu Y, Li H, Song Y, Tian Q (2010) Spatial coding for large scale partial-duplicate web image search. In: Proceedings of the 18th ACM International Conference on Multimedia. ACM, pp 511–520
Zhu Y, Jiang S (2018) Deep structured learning for visual relationship detection. In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhu Y, Jiang S, Li X (2017) Visual relationship detection with object spatial distribution. In: IEEE International Conference on Multimedia and Expo. pp 379–384

Download references

Author information

Authors and Affiliations

College of Business Administration, Nanchang Institute of Technology, Nanchang, 330044, China
Xinying Zhu
Lenovo Research, Beijing, 100085, China
Linhu Liu

Authors

Xinying Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Linhu Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Linhu Liu.

Ethics declarations

Conflict of interest

There is no interest dispute with this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhu, X., Liu, L. Diverse image search with explanations. Multimed Tools Appl 83, 23067–23082 (2024). https://doi.org/10.1007/s11042-023-16393-8

Download citation

Received: 06 September 2022
Revised: 25 May 2023
Accepted: 17 July 2023
Published: 10 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16393-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Diverse image search with explanations

Abstract

Access this article

Similar content being viewed by others

Compact-VG: A Small-scale Dataset for Scene Graph Generation

Discovering Respects for Visual Similarity

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Diverse image search with explanations

Abstract

Access this article

Similar content being viewed by others

Compact-VG: A Small-scale Dataset for Scene Graph Generation

Discovering Respects for Visual Similarity

Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation