skip to main content
10.1145/3581783.3612283acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph

Published:27 October 2023Publication History

ABSTRACT

Image-to-image retrieval, a fundamental task, aims at matching similar images based on a query image. Existing methods with convolutional neural networks are usually sensitive to low-level visual features, and ignore high-level semantic relationship information. This makes retrieving complicated images with multiple objects and various relationships a significant challenge. Although some works introduce the scene graph to capture the global semantic features of the objects and their relations, they ignore the local visual representations. In addition, due to the fragility of individual modal representations, poisoning attacks in adversarial scenarios are easily achieved, hurting the robustness of the visual-guided foundation image retrieval model. To overcome these issues, we propose a novel hierarchical semantic-guided image-to-image retrieval method via scene graph, called Hi-SIGIR. Specifically, to begin with, our proposed method generates the scene graph of an image. Then, our model extracts and learns both the visual and semantic features of the nodes and relations within the scene graphs. Next, these features are fused to obtain local information and sent to the graph neural network to obtain global information. Using these information, the similarity between the scene graphs of several images is calculated at both the local and global levels to perform image retrieval. Finally, we introduce a surrogate that calculates relevance in a cross-modal manner to understand image content better. Experimental evaluations on several wildly-used benchmarks demonstrate the superiority of the proposed method.

References

  1. Ryan Prescott Adams and Richard S Zemel. 2011. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925 (2011).Google ScholarGoogle Scholar
  2. Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.Google ScholarGoogle ScholarCross RefCross Ref
  3. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. Lecture Notes in Computer Science 3951 (2006), 404--417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lubomir Bourdev and Jitendra Malik. 2009. Poselets: Body part detectors trained using 3d human pose annotations. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 1365--1372.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a "siamese" time delay neural network. Advances in neural information processing systems 6 (1993).Google ScholarGoogle Scholar
  6. Ming-Yi Chen and Ching-I Teng. 2013. A comprehensive model of the effects of online store image on purchase intention in an e-commerce environment. Electronic Commerce Research 13 (2013), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wei Chen, Yu Liu, Weiping Wang, Erwin M Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, and Michael S Lew. 2022. Deep learning for instance retrieval: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).Google ScholarGoogle Scholar
  8. Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  9. Shiv Ram Dubey. 2021. A decade survey of content based image retrieval using deep learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2021), 2687--2704.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Matthias Fey, Jan E Lenssen, Christopher Morris, Jonathan Masci, and Nils M Kriege. 2020. Deep graph matching consensus. arXiv preprint arXiv:2001.09621 (2020).Google ScholarGoogle Scholar
  11. Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14. Springer, 241--257.Google ScholarGoogle ScholarCross RefCross Ref
  13. Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124, 2 (2017), 237--254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Albert Gordo and Diane Larlus. 2017. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6589--6598.Google ScholarGoogle ScholarCross RefCross Ref
  15. Jindong Gu and Volker Tresp. 2020. Improving the robustness of capsule networks to image affine transformations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7285--7293.Google ScholarGoogle ScholarCross RefCross Ref
  16. Robert M Haralick, Karthikeyan Shanmugam, and Its' Hak Dinstein. 1973. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics 6 (1973), 610--621.Google ScholarGoogle ScholarCross RefCross Ref
  17. Chris Harris, Mike Stephens, et al. 1988. A combined corner and edge detector. In Alvey vision conference, Vol. 15. Citeseer, 10--5244.Google ScholarGoogle Scholar
  18. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  19. Yihao Huang, Felix Juefei-Xu, Qing Guo, Yang Liu, and Geguang Pu. 2022. Fake-Locator: Robust localization of GAN-based face manipulations. IEEE Transactions on Information Forensics and Security 17 (2022), 2657--2672.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yihao Huang, Felix Juefei-Xu, RunWang, Qing Guo, Lei Ma, Xiaofei Xie, Jianwen Li,Weikai Miao, Yang Liu, and Geguang Pu. 2020. Fakepolisher: Making deepfakes more detection-evasive by shallow reconstruction. In Proceedings of the 28th ACM international conference on multimedia. 1217--1226.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Anil K Jain and Aditya Vailaya. 1996. Image retrieval using color and shape. Pattern recognition 29, 8 (1996), 1233--1244.Google ScholarGoogle Scholar
  22. Xin Ji, Wei Wang, Meihui Zhang, and Yang Yang. 2017. Cross-domain image retrieval with attention modeling. In Proceedings of the 25th ACM International Conference on Multimedia. 1654--1662.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Xiaojun Jia, Yong Zhang, Xingxing Wei, Baoyuan Wu, Ke Ma, Jue Wang, and Xiaochun Cao. 2022. Prior-guided adversarial initialization for fast adversarial training. In European Conference on Computer Vision. Springer, 567--584.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Xiaojun Jia, Yong Zhang, Baoyuan Wu, Ke Ma, Jue Wang, and Xiaochun Cao. 2022. LAS-AT: adversarial training with learnable attack strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13398--13408.Google ScholarGoogle ScholarCross RefCross Ref
  25. Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3668--3678.Google ScholarGoogle ScholarCross RefCross Ref
  26. Mohammed Lamine Kherfi, Djemel Ziou, and Alan Bernardi. 2004. Image retrieval from the world wide web: Issues, techniques, and systems. ACM Computing Surveys (Csur) 36, 1 (2004), 35--67.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  28. Thomas N Kipf and MaxWelling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google ScholarGoogle Scholar
  29. Nils M Kriege, Fredrik D Johansson, and Christopher Morris. 2020. A survey on graph kernels. Applied Network Science 5, 1 (2020), 1--42.Google ScholarGoogle ScholarCross RefCross Ref
  30. Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. 2011. BRISK: Binary robust invariant scalable keypoints. In 2011 International Conference on Computer Vision. IEEE, 2548--2555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In International Conference on Machine Learning. PMLR, 3835--3845.Google ScholarGoogle Scholar
  32. Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, and Xiaochun Cao. 2022. A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision. Springer, 619--636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Siyuan Liang, Xingxing Wei, Siyuan Yao, and Xiaochun Cao. 2020. Efficient adversarial attacks for visual object tracking. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVI 16. Springer, 34--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  35. David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2004), 91--110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Guixiang Ma, Nesreen K Ahmed, Theodore L Willke, and Philip S Yu. 2021. Deep graph similarity learning: A survey. Data Mining and Knowledge Discovery 35 (2021), 688--725.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ke Ma, Qianqian Xu, Jinshan Zeng, Xiaochun Cao, and Qingming Huang. 2021. Poisoning attack against estimating from pairwise comparisons. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 6393--6408.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ke Ma, Qianqian Xu, Jinshan Zeng, Guorong Li, Xiaochun Cao, and Qingming Huang. 2022. A Tale of HodgeRank and Spectral Method: Target Attack Against Rank Aggregation is the Fixed Point of Adversarial Game. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2022), 4090--4108.Google ScholarGoogle Scholar
  39. Bangalore S Manjunath and Wei-Ying Ma. 1996. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 8 (1996), 837--842.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. 2018. Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint arXiv:1802.08665 (2018).Google ScholarGoogle Scholar
  41. Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. 2004. A review of content-based image retrieval systems in medical applications-clinical benefits and future directions. International Journal of Medical Informatics 73, 1 (2004), 1--23.Google ScholarGoogle ScholarCross RefCross Ref
  42. Manh-Duy Nguyen, Binh T Nguyen, and Cathal Gurrin. 2021. A deep local and global scene-graph matching for image-text retrieval. arXiv preprint arXiv:2106.02400 (2021).Google ScholarGoogle Scholar
  43. Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision. 3456--3465.Google ScholarGoogle ScholarCross RefCross Ref
  44. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  45. Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. 2010. Large-scale image retrieval with compressed fisher vectors. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3384--3391.Google ScholarGoogle ScholarCross RefCross Ref
  46. James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  47. James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  48. Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641--2649.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Filip Radenović, Giorgos Tolias, and Ondřej Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 3--20.Google ScholarGoogle ScholarCross RefCross Ref
  50. Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).Google ScholarGoogle Scholar
  51. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).Google ScholarGoogle Scholar
  52. Edward Rosten, Reid Porter, and Tom Drummond. 2008. Faster and better: A machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1 (2008), 105--119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Michael J Swain and Dana H Ballard. 1991. Color indexing. International Journal of Computer Vision 7, 1 (1991), 11--32.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105--6114.Google ScholarGoogle Scholar
  55. Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).Google ScholarGoogle Scholar
  56. Luo Wang, Xueming Qian, Yuting Zhang, Jialie Shen, and Xiaochun Cao. 2019. Enhancing sketch-based image retrieval by cnn semantic re-ranking. IEEE Transactions on Cybernetics 50, 7 (2019), 3330--3342.Google ScholarGoogle ScholarCross RefCross Ref
  57. Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318--23340.Google ScholarGoogle Scholar
  58. Runzhong Wang, Junchi Yan, and Xiaokang Yang. 2021. Neural graph matching network: Learning lawler's quadratic assignment problem with extension to hypergraph and multiple-graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5261--5279.Google ScholarGoogle Scholar
  59. Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508--1517.Google ScholarGoogle ScholarCross RefCross Ref
  60. Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048--2057.Google ScholarGoogle Scholar
  61. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google ScholarGoogle Scholar
  62. Pinar Yanardag and SVN Vishwanathan. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1365--1374.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Sangwoong Yoon, Woo Young Kang, Sungwook Jeon, SeongEun Lee, Changjin Han, Jonghun Park, and Eun-Sol Kim. 2021. Image-to-image retrieval by learning similarity between scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10718--10726.Google ScholarGoogle ScholarCross RefCross Ref
  64. Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate995of4,171submissions,24%

      Upcoming Conference

      MM '24
      MM '24: The 32nd ACM International Conference on Multimedia
      October 28 - November 1, 2024
      Melbourne , VIC , Australia
    • Article Metrics

      • Downloads (Last 12 months)143
      • Downloads (Last 6 weeks)17

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader