ABSTRACT
Image-to-image retrieval, a fundamental task, aims at matching similar images based on a query image. Existing methods with convolutional neural networks are usually sensitive to low-level visual features, and ignore high-level semantic relationship information. This makes retrieving complicated images with multiple objects and various relationships a significant challenge. Although some works introduce the scene graph to capture the global semantic features of the objects and their relations, they ignore the local visual representations. In addition, due to the fragility of individual modal representations, poisoning attacks in adversarial scenarios are easily achieved, hurting the robustness of the visual-guided foundation image retrieval model. To overcome these issues, we propose a novel hierarchical semantic-guided image-to-image retrieval method via scene graph, called Hi-SIGIR. Specifically, to begin with, our proposed method generates the scene graph of an image. Then, our model extracts and learns both the visual and semantic features of the nodes and relations within the scene graphs. Next, these features are fused to obtain local information and sent to the graph neural network to obtain global information. Using these information, the similarity between the scene graphs of several images is calculated at both the local and global levels to perform image retrieval. Finally, we introduce a surrogate that calculates relevance in a cross-modal manner to understand image content better. Experimental evaluations on several wildly-used benchmarks demonstrate the superiority of the proposed method.
- Ryan Prescott Adams and Richard S Zemel. 2011. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925 (2011).Google Scholar
- Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. 2016. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5297--5307.Google ScholarCross Ref
- Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. Lecture Notes in Computer Science 3951 (2006), 404--417.Google ScholarDigital Library
- Lubomir Bourdev and Jitendra Malik. 2009. Poselets: Body part detectors trained using 3d human pose annotations. In 2009 IEEE 12th International Conference on Computer Vision. IEEE, 1365--1372.Google ScholarCross Ref
- Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a "siamese" time delay neural network. Advances in neural information processing systems 6 (1993).Google Scholar
- Ming-Yi Chen and Ching-I Teng. 2013. A comprehensive model of the effects of online store image on purchase intention in an e-commerce environment. Electronic Commerce Research 13 (2013), 1--23.Google ScholarDigital Library
- Wei Chen, Yu Liu, Weiping Wang, Erwin M Bakker, Theodoros Georgiou, Paul Fieguth, Li Liu, and Michael S Lew. 2022. Deep learning for instance retrieval: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).Google Scholar
- Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248--255.Google ScholarCross Ref
- Shiv Ram Dubey. 2021. A decade survey of content based image retrieval using deep learning. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2021), 2687--2704.Google ScholarDigital Library
- Matthias Fey, Jan E Lenssen, Christopher Morris, Jonathan Masci, and Nils M Kriege. 2020. Deep graph matching consensus. arXiv preprint arXiv:2001.09621 (2020).Google Scholar
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448.Google ScholarDigital Library
- Albert Gordo, Jon Almazán, Jerome Revaud, and Diane Larlus. 2016. Deep image retrieval: Learning global representations for image search. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14. Springer, 241--257.Google ScholarCross Ref
- Albert Gordo, Jon Almazan, Jerome Revaud, and Diane Larlus. 2017. End-to-end learning of deep visual representations for image retrieval. International Journal of Computer Vision 124, 2 (2017), 237--254.Google ScholarDigital Library
- Albert Gordo and Diane Larlus. 2017. Beyond instance-level image retrieval: Leveraging captions to learn a global visual representation for semantic retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6589--6598.Google ScholarCross Ref
- Jindong Gu and Volker Tresp. 2020. Improving the robustness of capsule networks to image affine transformations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7285--7293.Google ScholarCross Ref
- Robert M Haralick, Karthikeyan Shanmugam, and Its' Hak Dinstein. 1973. Textural features for image classification. IEEE Transactions on Systems, Man, and Cybernetics 6 (1973), 610--621.Google ScholarCross Ref
- Chris Harris, Mike Stephens, et al. 1988. A combined corner and edge detector. In Alvey vision conference, Vol. 15. Citeseer, 10--5244.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
- Yihao Huang, Felix Juefei-Xu, Qing Guo, Yang Liu, and Geguang Pu. 2022. Fake-Locator: Robust localization of GAN-based face manipulations. IEEE Transactions on Information Forensics and Security 17 (2022), 2657--2672.Google ScholarCross Ref
- Yihao Huang, Felix Juefei-Xu, RunWang, Qing Guo, Lei Ma, Xiaofei Xie, Jianwen Li,Weikai Miao, Yang Liu, and Geguang Pu. 2020. Fakepolisher: Making deepfakes more detection-evasive by shallow reconstruction. In Proceedings of the 28th ACM international conference on multimedia. 1217--1226.Google ScholarDigital Library
- Anil K Jain and Aditya Vailaya. 1996. Image retrieval using color and shape. Pattern recognition 29, 8 (1996), 1233--1244.Google Scholar
- Xin Ji, Wei Wang, Meihui Zhang, and Yang Yang. 2017. Cross-domain image retrieval with attention modeling. In Proceedings of the 25th ACM International Conference on Multimedia. 1654--1662.Google ScholarDigital Library
- Xiaojun Jia, Yong Zhang, Xingxing Wei, Baoyuan Wu, Ke Ma, Jue Wang, and Xiaochun Cao. 2022. Prior-guided adversarial initialization for fast adversarial training. In European Conference on Computer Vision. Springer, 567--584.Google ScholarDigital Library
- Xiaojun Jia, Yong Zhang, Baoyuan Wu, Ke Ma, Jue Wang, and Xiaochun Cao. 2022. LAS-AT: adversarial training with learnable attack strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13398--13408.Google ScholarCross Ref
- Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3668--3678.Google ScholarCross Ref
- Mohammed Lamine Kherfi, Djemel Ziou, and Alan Bernardi. 2004. Image retrieval from the world wide web: Issues, techniques, and systems. ACM Computing Surveys (Csur) 36, 1 (2004), 35--67.Google ScholarDigital Library
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Thomas N Kipf and MaxWelling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).Google Scholar
- Nils M Kriege, Fredrik D Johansson, and Christopher Morris. 2020. A survey on graph kernels. Applied Network Science 5, 1 (2020), 1--42.Google ScholarCross Ref
- Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. 2011. BRISK: Binary robust invariant scalable keypoints. In 2011 International Conference on Computer Vision. IEEE, 2548--2555.Google ScholarDigital Library
- Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph matching networks for learning the similarity of graph structured objects. In International Conference on Machine Learning. PMLR, 3835--3845.Google Scholar
- Siyuan Liang, Longkang Li, Yanbo Fan, Xiaojun Jia, Jingzhi Li, Baoyuan Wu, and Xiaochun Cao. 2022. A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision. Springer, 619--636.Google ScholarDigital Library
- Siyuan Liang, Xingxing Wei, Siyuan Yao, and Xiaochun Cao. 2020. Efficient adversarial attacks for visual object tracking. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVI 16. Springer, 34--50.Google ScholarDigital Library
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740--755.Google ScholarCross Ref
- David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2004), 91--110.Google ScholarDigital Library
- Guixiang Ma, Nesreen K Ahmed, Theodore L Willke, and Philip S Yu. 2021. Deep graph similarity learning: A survey. Data Mining and Knowledge Discovery 35 (2021), 688--725.Google ScholarCross Ref
- Ke Ma, Qianqian Xu, Jinshan Zeng, Xiaochun Cao, and Qingming Huang. 2021. Poisoning attack against estimating from pairwise comparisons. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 10 (2021), 6393--6408.Google ScholarDigital Library
- Ke Ma, Qianqian Xu, Jinshan Zeng, Guorong Li, Xiaochun Cao, and Qingming Huang. 2022. A Tale of HodgeRank and Spectral Method: Target Attack Against Rank Aggregation is the Fixed Point of Adversarial Game. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2022), 4090--4108.Google Scholar
- Bangalore S Manjunath and Wei-Ying Ma. 1996. Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 8 (1996), 837--842.Google ScholarDigital Library
- Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. 2018. Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint arXiv:1802.08665 (2018).Google Scholar
- Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. 2004. A review of content-based image retrieval systems in medical applications-clinical benefits and future directions. International Journal of Medical Informatics 73, 1 (2004), 1--23.Google ScholarCross Ref
- Manh-Duy Nguyen, Binh T Nguyen, and Cathal Gurrin. 2021. A deep local and global scene-graph matching for image-text retrieval. arXiv preprint arXiv:2106.02400 (2021).Google Scholar
- Hyeonwoo Noh, Andre Araujo, Jack Sim, Tobias Weyand, and Bohyung Han. 2017. Large-scale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision. 3456--3465.Google ScholarCross Ref
- Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on Empirical Methods in Natural Language Processing (EMNLP). 1532--1543.Google ScholarCross Ref
- Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. 2010. Large-scale image retrieval with compressed fisher vectors. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3384--3391.Google ScholarCross Ref
- James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarCross Ref
- James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--8.Google ScholarCross Ref
- Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641--2649.Google ScholarDigital Library
- Filip Radenović, Giorgos Tolias, and Ondřej Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I 14. Springer, 3--20.Google ScholarCross Ref
- Prajit Ramachandran, Barret Zoph, and Quoc V Le. 2017. Searching for activation functions. arXiv preprint arXiv:1710.05941 (2017).Google Scholar
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).Google Scholar
- Edward Rosten, Reid Porter, and Tom Drummond. 2008. Faster and better: A machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1 (2008), 105--119.Google ScholarDigital Library
- Michael J Swain and Dana H Ballard. 1991. Color indexing. International Journal of Computer Vision 7, 1 (1991), 11--32.Google ScholarDigital Library
- Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning. PMLR, 6105--6114.Google Scholar
- Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).Google Scholar
- Luo Wang, Xueming Qian, Yuting Zhang, Jialie Shen, and Xiaochun Cao. 2019. Enhancing sketch-based image retrieval by cnn semantic re-ranking. IEEE Transactions on Cybernetics 50, 7 (2019), 3330--3342.Google ScholarCross Ref
- Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning. PMLR, 23318--23340.Google Scholar
- Runzhong Wang, Junchi Yan, and Xiaokang Yang. 2021. Neural graph matching network: Learning lawler's quadratic assignment problem with extension to hypergraph and multiple-graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 5261--5279.Google Scholar
- Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508--1517.Google ScholarCross Ref
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048--2057.Google Scholar
- Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018).Google Scholar
- Pinar Yanardag and SVN Vishwanathan. 2015. Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1365--1374.Google ScholarDigital Library
- Sangwoong Yoon, Woo Young Kang, Sungwook Jeon, SeongEun Lee, Changjin Han, Jonghun Park, and Eun-Sol Kim. 2021. Image-to-image retrieval by learning similarity between scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10718--10726.Google ScholarCross Ref
- Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.Google ScholarCross Ref
Index Terms
- Hi-SIGIR: Hierachical Semantic-Guided Image-to-image Retrieval via Scene Graph
Recommendations
Bag-of-Words Based Deep Neural Network for Image Retrieval
MM '14: Proceedings of the 22nd ACM international conference on MultimediaThis work targets image retrieval task hold by MSR-Bing Grand Challenge. Image retrieval is considered as a challenge task because of the gap between low-level image representation and high-level textual query representation. Recently further developed ...
Leveraging non-relevant images to enhance image retrieval performance
MULTIMEDIA '02: Proceedings of the tenth ACM international conference on MultimediaInherent subjectivity in user's perception of an image has motivated the use of relevance feedback (RF) in the image desigined output's retrieval process. RF techniques interactively determine the user's query concept, given the user's relevance ...
Localized content based image retrieval
MIR '05: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrievalClassic Content-Based Image Retrieval (CBIR) takes a single non-annotated query image, and retrieves similar images from an image repository. Such a search must rely upon a holistic (or global) view of the image. Yet often the desired content of an ...
Comments