Abstract
In this paper, we propose a novel content based image search framework with explanations, which can not only compare the similarity among images from different perspectives, but also describe the commonalities of two images with language. Specifically, we develop a graph matching method to calculate the similarity of two images and locate their commonalities, where each graph includes perceptual information, conceptual information and relational information. Furthermore, we utilize a language model based method to generate sentences to describe the similarities of two images. Comparing with different perspectives, we follow the principle that rich structured representations are more important than simple ones. To evaluate this principle, we conduct the experiment on the Visual Genome dataset, where each image contains lots of objects and multiple object relationships. The experimental results demonstrate the effectiveness of the principle. We also evaluate our method in the explanation of similar images, and the experimental results demonstrate that our method can obtain comparable performance.
Similar content being viewed by others
Data availability
The data are openly available at https://visualgenome.org/api/v0/api_home.html.
References
Berretti S, Del Bimbo A, Vicario E (2001) Efficient matching and indexing of graph models in content-based retrieval. IEEE Trans Pattern Anal Mach Intell 23(10):1089–1105
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer, pp 213–229
Chen W, Liu Y, Wang W, Bakker E, Georgiou T, Fieguth P, Liu L, Lew MS (2021) Deep learning for instance retrieval: a survey. Preprint at http://arxiv.org/abs/2101.11282
Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. Int J Pattern Recognit Artif Intell 18(03):265–298
Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):5
Elhoseiny M, Cohen S, Chang W, Price B, Elgammal A (2016) Sherlock: scalable fact learning in images
Eshera M, Fu K-S (1986) An image understanding system using attributed symbolic representation and inexact graph-matching. IEEE Trans Pattern Anal Mach Intell 5:604–618
Gentner D, Medina J (1998) Similarity and the development of rules. Cognition 65(2):263–297
Goldstone RL (1994) Similarity, interactive activation, and mapping. J Exp Psychol Learn Mem Cogn 20(1):3–28
Gong Y, Lazebnik S, Gordo A, Perronnin F (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35(12):2916–2929
Grauman K (2010) Efficiently searching for similar images. Commun ACM 53(6):84–94
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 770–778
Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vision 87(3):316–336
Jegou H, Perronnin F, Douze M, Nchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716
Johnson J, Krishna R, Stark M, Li L-J, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 3668–3678
Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 4976–4984
Kawano Y, Yanai K (2014) Food image recognition with deep convolutional features. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. ACM, pp 589–593
Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. pp 541–547
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. Preprint at http://arxiv.org/abs/1602.07332
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp 1097–1105
Li X, Uricchio T, Ballan L, Bertini M, Snoek CG, Bimbo AD (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surv (CSUR) 49(1):14
Li X, Xu C, Wang X, Lan W, Jia Z, Yang G, Xu J (2019) Coco-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans Multimedia 21(9):2347–2360
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp 2980–2988
Li Y, Ouyang W, Wang X (2017) VIP-CNN: a visual phrase reasoning convolutional neural network for visual relationship detection. Preprint at http://arxiv.org/abs/1702.07191
Li Y, Pan Y, Yao T, Mei T (2022) Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 17990–17999
Liu G-H, Yang J-Y (2013) Content-based image retrieval using color difference histogram. Pattern Recogn 46(1):188–198
Liu Y, Zhang D, Lu G, Ma W-Y (2007) A survey of content-based image retrieval with high-level semantics. Pattern Recogn 40(1):262–282
Liu L, Shen F, Shen Y, Liu X, Shao L (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 2862–2871
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European Conference on Computer Vision. Springer, pp 852–869
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems. pp 289–297
Matsui Y, Ito K, Aramaki Y, Fujimoto A, Ogawa T, Yamasaki T, Aizawa K (2017) Sketch-based manga retrieval using manga109 dataset. Multimed Tools Appl 76(20):21811–21838
Min W, Mei S, Li Z, Jiang S (2020) A two-stage triplet network training framework for image retrieval. IEEE Trans Multimedia 22(12):3128–3138
Navon D (1977) Forest before trees: the precedence of global features in visual perception. Cogn Psychol 9(3):353–383
Paulin M, Douze M, Harchaoui Z, Mairal J, Perronin F, Schmid C (2015) Local convolutional features with unsupervised training for image retrieval. In: IEEE International Conference on Computer Vision. pp 91–99
Rahman T, Chou S-H, Sigal L, Carenini G (2021) An improved attention for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1653–1662
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp 91–99
Roth K, Milbich T, Sinha S, Gupta P, Ommer B, Cohen JP (2020) Revisiting training strategies and generalization performance in deep metric learning. In: International Conference on Machine Learning. PMLR, pp 8242–8252
Sadeghi F, Kumar Divvala SK, Farhadi A (2015) Viske: visual knowledge extraction and question answering by visual verification of relation phrases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 1456–1464
Sagi E, Gentner D, Lovett A (2012) What difference reveals about similarity. Cogn Sci 36(6):1019–1050
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Preprint at http://arxiv.org/abs/1409.1556
Smeulders AW, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Suhail M, Mittal A, Siddiquie B, Broaddus C, Eledath J, Medioni G, Sigal L (2021) Energy-based learning for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 13936–13945
Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild
Tzelepi M, Tefas A (2018) Deep convolutional learning for content based image retrieval. Neurocomputing 275:2467–2478
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 3156–3164
Wang S, Jiang S (2015) Instre: a new benchmark for instance-level object retrieval and recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 11(3):37
Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pp 157–166
Wu L, Wang Y, Shao L (2018) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans Image Process 28(4):1602–1612
Wu Z, Ke Q, Isard M, Sun J (2009) Bundling features for large scale partial-duplicate web image search. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. (CVPR 2009). IEEE, pp 25–32
Xu H, Wang J, Hua X-S, Li S (2010) Image search by concept map. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp 275–282
Yang X, Zhang H, Qi G, Cai J (2021) Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 9847–9857
Yu J, Amores J, Sebe N, Radeva P, Tian Q (2008) Distance learning for similarity estimation. IEEE Trans Pattern Anal Mach Intell 30(3):451–462
Zhang D-Q, Chang S-F (2004) Detecting image near-duplicate by stochastic attributed relational graph matching with learning. In: Proceedings of the 12th Annual ACM International Conference on Multimedia. ACM, pp 877–884
Zhang H, Zha Z-J, Yang Y, Yan S, Gao Y, Chua T-S (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: Proceedings of the 21st ACM International Conference on Multimedia. ACM, pp 33–42
Zhao W-L, Ngo C-W, Tan H-K, Wu X (2007) Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Trans Multimedia 9(5):1037–1048
Zhou F, De la Torre F (2016) Factorized graph matching. IEEE Trans Pattern Anal Mach Intell 38(9):1774–1789
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems. pp 487–495
Zhou W, Lu Y, Li H, Song Y, Tian Q (2010) Spatial coding for large scale partial-duplicate web image search. In: Proceedings of the 18th ACM International Conference on Multimedia. ACM, pp 511–520
Zhu Y, Jiang S (2018) Deep structured learning for visual relationship detection. In: Proceedings of the AAAI Conference on Artificial Intelligence
Zhu Y, Jiang S, Li X (2017) Visual relationship detection with object spatial distribution. In: IEEE International Conference on Multimedia and Expo. pp 379–384
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
There is no interest dispute with this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, X., Liu, L. Diverse image search with explanations. Multimed Tools Appl 83, 23067–23082 (2024). https://doi.org/10.1007/s11042-023-16393-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16393-8