Skip to main content
Log in

Diverse image search with explanations

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose a novel content based image search framework with explanations, which can not only compare the similarity among images from different perspectives, but also describe the commonalities of two images with language. Specifically, we develop a graph matching method to calculate the similarity of two images and locate their commonalities, where each graph includes perceptual information, conceptual information and relational information. Furthermore, we utilize a language model based method to generate sentences to describe the similarities of two images. Comparing with different perspectives, we follow the principle that rich structured representations are more important than simple ones. To evaluate this principle, we conduct the experiment on the Visual Genome dataset, where each image contains lots of objects and multiple object relationships. The experimental results demonstrate the effectiveness of the principle. We also evaluate our method in the explanation of similar images, and the experimental results demonstrate that our method can obtain comparable performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The data are openly available at https://visualgenome.org/api/v0/api_home.html.

References

  1. Berretti S, Del Bimbo A, Vicario E (2001) Efficient matching and indexing of graph models in content-based retrieval. IEEE Trans Pattern Anal Mach Intell 23(10):1089–1105

    Article  Google Scholar 

  2. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision. Springer, pp 213–229

  3. Chen W, Liu Y, Wang W, Bakker E, Georgiou T, Fieguth P, Liu L, Lew MS (2021) Deep learning for instance retrieval: a survey. Preprint at http://arxiv.org/abs/2101.11282

  4. Conte D, Foggia P, Sansone C, Vento M (2004) Thirty years of graph matching in pattern recognition. Int J Pattern Recognit Artif Intell 18(03):265–298

    Article  Google Scholar 

  5. Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):5

    Article  Google Scholar 

  6. Elhoseiny M, Cohen S, Chang W, Price B, Elgammal A (2016) Sherlock: scalable fact learning in images

  7. Eshera M, Fu K-S (1986) An image understanding system using attributed symbolic representation and inexact graph-matching. IEEE Trans Pattern Anal Mach Intell 5:604–618

    Article  ADS  Google Scholar 

  8. Gentner D, Medina J (1998) Similarity and the development of rules. Cognition 65(2):263–297

    Article  CAS  PubMed  Google Scholar 

  9. Goldstone RL (1994) Similarity, interactive activation, and mapping. J Exp Psychol Learn Mem Cogn 20(1):3–28

    Article  Google Scholar 

  10. Gong Y, Lazebnik S, Gordo A, Perronnin F (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35(12):2916–2929

    Article  PubMed  Google Scholar 

  11. Grauman K (2010) Efficiently searching for similar images. Commun ACM 53(6):84–94

    Article  Google Scholar 

  12. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 770–778

  13. Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vision 87(3):316–336

    Article  Google Scholar 

  14. Jegou H, Perronnin F, Douze M, Nchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716

    Article  PubMed  Google Scholar 

  15. Johnson J, Krishna R, Stark M, Li L-J, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 3668–3678

  16. Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 4976–4984

  17. Kawano Y, Yanai K (2014) Food image recognition with deep convolutional features. In: Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication. ACM, pp 589–593

  18. Krishnamoorthy N, Malkarnenkar G, Mooney R, Saenko K, Guadarrama S (2013) Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence. pp 541–547

  19. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. Preprint at http://arxiv.org/abs/1602.07332

  20. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp 1097–1105

  21. Li X, Uricchio T, Ballan L, Bertini M, Snoek CG, Bimbo AD (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surv (CSUR) 49(1):14

    Google Scholar 

  22. Li X, Xu C, Wang X, Lan W, Jia Z, Yang G, Xu J (2019) Coco-CN for cross-lingual image tagging, captioning, and retrieval. IEEE Trans Multimedia 21(9):2347–2360

    Article  ADS  Google Scholar 

  23. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. pp 2980–2988

  24. Li Y, Ouyang W, Wang X (2017) VIP-CNN: a visual phrase reasoning convolutional neural network for visual relationship detection. Preprint at http://arxiv.org/abs/1702.07191

  25. Li Y, Pan Y, Yao T, Mei T (2022) Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 17990–17999

  26. Liu G-H, Yang J-Y (2013) Content-based image retrieval using color difference histogram. Pattern Recogn 46(1):188–198

    Article  ADS  Google Scholar 

  27. Liu Y, Zhang D, Lu G, Ma W-Y (2007) A survey of content-based image retrieval with high-level semantics. Pattern Recogn 40(1):262–282

    Article  ADS  Google Scholar 

  28. Liu L, Shen F, Shen Y, Liu X, Shao L (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 2862–2871

  29. Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European Conference on Computer Vision. Springer, pp 852–869

  30. Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems. pp 289–297

  31. Matsui Y, Ito K, Aramaki Y, Fujimoto A, Ogawa T, Yamasaki T, Aizawa K (2017) Sketch-based manga retrieval using manga109 dataset. Multimed Tools Appl 76(20):21811–21838

    Article  Google Scholar 

  32. Min W, Mei S, Li Z, Jiang S (2020) A two-stage triplet network training framework for image retrieval. IEEE Trans Multimedia 22(12):3128–3138

    Article  Google Scholar 

  33. Navon D (1977) Forest before trees: the precedence of global features in visual perception. Cogn Psychol 9(3):353–383

    Article  ADS  Google Scholar 

  34. Paulin M, Douze M, Harchaoui Z, Mairal J, Perronin F, Schmid C (2015) Local convolutional features with unsupervised training for image retrieval. In: IEEE International Conference on Computer Vision. pp 91–99

  35. Rahman T, Chou S-H, Sigal L, Carenini G (2021) An improved attention for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 1653–1662

  36. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems. pp 91–99

  37. Roth K, Milbich T, Sinha S, Gupta P, Ommer B, Cohen JP (2020) Revisiting training strategies and generalization performance in deep metric learning. In: International Conference on Machine Learning. PMLR, pp 8242–8252

  38. Sadeghi F, Kumar Divvala SK, Farhadi A (2015) Viske: visual knowledge extraction and question answering by visual verification of relation phrases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 1456–1464

  39. Sagi E, Gentner D, Lovett A (2012) What difference reveals about similarity. Cogn Sci 36(6):1019–1050

    Article  PubMed  Google Scholar 

  40. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Preprint at http://arxiv.org/abs/1409.1556

  41. Smeulders AW, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380

    Article  Google Scholar 

  42. Suhail M, Mittal A, Siddiquie B, Broaddus C, Eledath J, Medioni G, Sigal L (2021) Energy-based learning for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 13936–13945

  43. Thomason J, Venugopalan S, Guadarrama S, Saenko K, Mooney R (2014) Integrating language and vision to generate natural language descriptions of videos in the wild

  44. Tzelepi M, Tefas A (2018) Deep convolutional learning for content based image retrieval. Neurocomputing 275:2467–2478

    Article  Google Scholar 

  45. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 3156–3164

  46. Wang S, Jiang S (2015) Instre: a new benchmark for instance-level object retrieval and recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 11(3):37

  47. Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pp 157–166

  48. Wu L, Wang Y, Shao L (2018) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans Image Process 28(4):1602–1612

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  49. Wu Z, Ke Q, Isard M, Sun J (2009) Bundling features for large scale partial-duplicate web image search. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. (CVPR 2009). IEEE, pp 25–32

  50. Xu H, Wang J, Hua X-S, Li S (2010) Image search by concept map. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp 275–282

  51. Yang X, Zhang H, Qi G, Cai J (2021) Causal attention for vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp 9847–9857

  52. Yu J, Amores J, Sebe N, Radeva P, Tian Q (2008) Distance learning for similarity estimation. IEEE Trans Pattern Anal Mach Intell 30(3):451–462

    Article  PubMed  Google Scholar 

  53. Zhang D-Q, Chang S-F (2004) Detecting image near-duplicate by stochastic attributed relational graph matching with learning. In: Proceedings of the 12th Annual ACM International Conference on Multimedia. ACM, pp 877–884

  54. Zhang H, Zha Z-J, Yang Y, Yan S, Gao Y, Chua T-S (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: Proceedings of the 21st ACM International Conference on Multimedia. ACM, pp 33–42

  55. Zhao W-L, Ngo C-W, Tan H-K, Wu X (2007) Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Trans Multimedia 9(5):1037–1048

    Article  Google Scholar 

  56. Zhou F, De la Torre F (2016) Factorized graph matching. IEEE Trans Pattern Anal Mach Intell 38(9):1774–1789

    Article  PubMed  Google Scholar 

  57. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems. pp 487–495

  58. Zhou W, Lu Y, Li H, Song Y, Tian Q (2010) Spatial coding for large scale partial-duplicate web image search. In: Proceedings of the 18th ACM International Conference on Multimedia. ACM, pp 511–520

  59. Zhu Y, Jiang S (2018) Deep structured learning for visual relationship detection. In: Proceedings of the AAAI Conference on Artificial Intelligence

  60. Zhu Y, Jiang S, Li X (2017) Visual relationship detection with object spatial distribution. In: IEEE International Conference on Multimedia and Expo. pp 379–384

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linhu Liu.

Ethics declarations

Conflict of interest

There is no interest dispute with this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, X., Liu, L. Diverse image search with explanations. Multimed Tools Appl 83, 23067–23082 (2024). https://doi.org/10.1007/s11042-023-16393-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16393-8

Keywords

Navigation