Abstract
Recent works in deep-learning research highlighted remarkable relational reasoning capabilities of some carefully designed architectures. In this work, we employ a relationship-aware deep learning model to extract compact visual features used relational image descriptors. In particular, we are interested in relational content-based image retrieval (R-CBIR), a task consisting in finding images containing similar inter-object relationships. Inspired by the relation networks (RN) employed in relational visual question answering (R-VQA), we present novel architectures to explicitly capture relational information from images in the form of network activations that can be subsequently extracted and used as visual features. We describe a two-stage relation network module (2S-RN), trained on the R-VQA task, able to collect non-aggregated visual features. Then, we propose the aggregated visual features relation network (AVF-RN) module that is able to produce better relationship-aware features by learning the aggregation directly inside the network. We employ an R-CBIR ground-truth built by exploiting scene-graphs similarities available in the CLEVR dataset in order to rank images in a relational fashion. Experiments show that features extracted from our 2S-RN model provide an improved retrieval performance with respect to standard non-relational methods. Moreover, we demonstrate that the features extracted from the novel AVF-RN can further improve the performance measured on the R-CBIR task, reaching the state-of-the-art on the proposed dataset.
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) VQA: visual question answering. CoRR arXiv:1505.00468
Belilovsky E, Blaschko MB, Kiros JR, Urtasun R, Zemel R (2017) Joint embeddings of scene graphs and images. ICLR
Cai H, Zheng VW, Chang KC (2017) A comprehensive survey of graph embedding: problems, techniques and applications. CoRR arXiv:1709.07604
Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3298–3308. IEEE
Gordo A, Almazan J, Revaud J, Larlus D (2016) End-to-end learning of deep visual representations for image retrieval. arXiv preprint arXiv:1610.07940
Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K (2017) Learning to reason: end-to-end module networks for visual question answering. In: The IEEE international conference on computer vision (ICCV)
Johnson J, Hariharan B, van der Maaten L, Fei-Fei L, Zitnick CL, Girshick R (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning
Johnson J, Hariharan B, van der Maaten L, Hoffman J, Fei-Fei L, Lawrence Zitnick C, Girshick R (2017) Inferring and executing programs for visual reasoning. In: The IEEE international conference on computer vision (ICCV)
Johnson J, Krishna R, Stark M, Li LJ, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678
Kahou SE, Atkinson A, Michalski V, Kádár Á, Trischler A, Bengio Y (2017) Figureqa: an annotated figure dataset for visual reasoning. CoRR arXiv:1710.07300
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein M, Fei-Fei L (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations
Kuznetsova A, Rom H, Alldrin N, Uijlings JRR, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Duerig T, Ferrari V (2018) The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR arXiv:1811.00982
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European conference on computer vision
Lu P, Ji L, Zhang W, Duan N, Zhou M, Wang J (2018) R-VQA: learning visual relation facts with semantic attention for visual question answering. In: SIGKDD 2018
Malinowski M, Fritz M (2014) A multi-world approach to question answering about real-world scenes based on uncertain input. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger K (eds) Advances in neural information processing systems 27. Curran Associates Inc, pp 1682–1690
Mascharka D, Tran P, Soklaski R, Majumdar A (2018) Transparency by design: closing the gap between performance and interpretability in visual reasoning. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Melucci M (2007) On rank correlation in information retrieval evaluation. SIGIR Forum 41(1):18–33. https://doi.org/10.1145/1273221.1273223
Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2019) Learning relationship-aware visual features. In: Leal-Taixé L, Roth S (eds) Computer vision: ECCV 2018 workshops. Springer, Cham, pp 486–501
Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations. In: ICCV 2017—international conference on computer vision 2017. Venice, Italy. https://hal.archives-ouvertes.fr/hal-01576035
Qi M, Li W, Yang Z, Wang Y, Luo J (2018) Attentive relational networks for mapping images to scene graphs. CoRR arXiv:1811.10696
Raposo D, Santoro A, Barrett DGT, Pascanu R, Lillicrap TP, Battaglia PW (2017) Discovering objects and their relations from entangled scene representations. CoRR arXiv:1702.05068
Ren M, Kiros R, Zemel R (2015) Exploring models and data for image question answering. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 2953–2961
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems 28. Curran Associates Inc, pp 91–99
Riesen K, Bunke H (2009) Approximate graph edit distance computation by means of bipartite graph matching. Image Vis Comput 27(7):950–959. https://doi.org/10.1016/j.imavis.2008.04.004
Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, pp 4967–4976
Tolias G, Sicre R, Jégou H (2015) Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. CoRR arXiv:1808.00191
Yang Z, He X, Gao J, Deng L, Smola AJ (2015) Stacked attention networks for image question answering. CoRR arXiv:1511.02274
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. CoRR arXiv:1809.07041
Zhang J, Kalantidis Y, Rohrbach M, Paluri M, Elgammal AM, Elhoseiny M (2018) Large-scale visual relationship understanding. CoRR arXiv:1804.10660
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus R (2015) Simple baseline for visual question answering. CoRR arXiv:1512.02167
Acknowledgements
This work was partially supported by Automatic Data and documents Analysis to enhance human-based processes (ADA), CUP CIPE D55F17000290009, and by the AI4EU project, funded by the EC (H2020—Contract no. 825619). We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Messina, N., Amato, G., Carrara, F. et al. Learning visual features for relational CBIR. Int J Multimed Info Retr 9, 113–124 (2020). https://doi.org/10.1007/s13735-019-00178-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-019-00178-7