Abstract
Visual relationship detection is a task aiming at mining the information of interactions between the paired objects in the image, describing the image in the form of (subject − predicate − object). Most of the previous works regard it as a pure classification problem by taking the integrated triplets as the label of the image; however, the numerous combinations of objects and the diversity of predicates are the tough challenges for these studies. Hence, we propose a deep model based on a modified bidirectional recurrent neural network (BRNN) to classify object and predict predicate simultaneously. By using the BRNN, the hidden information of the relationship in the image is extracted and a feature-infusion method is proposed. Additionally, we improve the existing works by introducing a paired non-maximum suppression method. The experiments show that our approach is competitive with the state-of-the-art works.
Similar content being viewed by others
References
Agrawal A, Lu J, Antol S, Zitnick CL, Zitnick CL, Parikh D, Batra D (2017) Vqa: visual question answering. Int J Comput Vis 123(1):1–28
Choi MJ, Lim JJ, Torralba A, Willsky AS (2010) Exploiting hierarchical context on a large database of object categories. In: Computer vision and pattern recognition, pp 129–136
Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks: 3298–3308
Desai C, Ramanan D, Fowlkes C (2010) Discriminative models for multi-class object layout. In: IEEE International conference on computer vision, pp 229–236
Divvala SK, Farhadi A, Guestrin C (2014) Learning everything about anything: Webly-supervised visual concept learning. In: IEEE Conference on computer vision and pattern recognition, pp 3270–3277
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29
Fidler S, Leonardis A (2007) Towards scalable representations of object categories: learning a hierarchy of parts. In: IEEE conference on computer vision and pattern recognition, 2007. CVPR’07, pp 1–8
Galleguillos C, Belongie S (2010) Context based object categorization: a critical survey. Comput Vis Image Underst 114(6):712–722
Galleguillos C, Rabinovich A, Belongie S (2008) Object categorization using co-occurrence, location and appearance. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008, pp 1–8
Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation: 580–587
Gould S, Rodgers J, Cohen D, Elidan G, Koller D (2008) Multi-class segmentation with relative location prior. Int J Comput Vis 80(3):300–316
Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition: 770–778
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105
Kumar MP, Koller D (2010) Efficiently selecting regions for scene understanding. In: Computer vision and pattern recognition, pp 3217–3224
Ladicky L, Russell C, Kohli P, Torr PHS (2010) Graph cut based inference with co-occurrence statistics. In: European conference on computer vision, pp 239–253
Li Y, Ouyang W, Wang X, Tang X (2017) Vip-cnn: visual phrase guided convolutional neural network: 7244–7253
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects phrases and region captions
Liao W, Shuai L, Rosenhahn B, Yang MY (2017) Natural language guided visual relationship detection
Lu C, Krishna R, Bernstein M, Li FF (2016) Visual relationship detection with language priors: 852–869
Maji S, Bourdev L, Malik J (2011) Action recognition from a distributed representation of pose and appearance. In: Computer vision and pattern recognition, pp 3177–3184
Mensink T, Gavves E, Snoek CGM (2014) Costa: co-occurrence statistics for zero-shot classification. In: IEEE Conference on computer vision and pattern recognition, pp 2441–2448
Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations: 5189–5198
Plummer BA, Mallya A, Cervantes CM, Hockenmaier J, Lazebnik S (2017) Phrase localization and visual relationship detection with comprehensive image-language cues: 1946–1955
Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: IEEE International conference on computer vision, pp 1–8
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE International conference on computer vision, pp 433–440
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Sadeghi MA, Farhadi A (2011) Recognition using visual phrases. In: Computer vision and pattern recognition, pp 1745–1752
Salakhutdinov R, Torralba A, Tenenbaum J (2011) Learning to share visual appearance for multiclass object detection. In: Computer vision and pattern recognition, pp 1481–1488
Shelhamer E, Long J, Darrell T (2014) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell PP(99):1–1
Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT (2005) Discovering objects and their localization in images. In: Tenth IEEE international conference on computer vision, vol 1, pp 370–377
Socher R, Li FF (2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: Computer vision and pattern recognition, pp 966–973
Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4 inception-resnet and the impact of residual connections on learning
Werbos PJ (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Netw 1(4):339–356
Xu D, Zhu Y, Choy CB, Li FF (2017) Scene graph generation by iterative message passing. In: IEEE Conference on computer vision and pattern recognition, pp 3097–3106
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: IEEE Conference on computer vision and pattern recognition, pp 21–29
Yao B, Li FF (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: Computer vision and pattern recognition, pp 17–24
Yu R, Li A, Morariu VI, Davis LS (2017) Visual relationship detection with internal and external linguistic knowledge distillation: 1068–1076
Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection: 3107–3115
Acknowledgements
This project is partly supported by NSF of China (61773117, 61473086).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dai, Y., Wang, C., Dong, J. et al. Visual relationship detection based on bidirectional recurrent neural network. Multimed Tools Appl 79, 35297–35313 (2020). https://doi.org/10.1007/s11042-019-7732-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-019-7732-z