Skip to main content
Log in

Visual relationship detection based on bidirectional recurrent neural network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Visual relationship detection is a task aiming at mining the information of interactions between the paired objects in the image, describing the image in the form of (subjectpredicateobject). Most of the previous works regard it as a pure classification problem by taking the integrated triplets as the label of the image; however, the numerous combinations of objects and the diversity of predicates are the tough challenges for these studies. Hence, we propose a deep model based on a modified bidirectional recurrent neural network (BRNN) to classify object and predict predicate simultaneously. By using the BRNN, the hidden information of the relationship in the image is extracted and a feature-infusion method is proposed. Additionally, we improve the existing works by introducing a paired non-maximum suppression method. The experiments show that our approach is competitive with the state-of-the-art works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Agrawal A, Lu J, Antol S, Zitnick CL, Zitnick CL, Parikh D, Batra D (2017) Vqa: visual question answering. Int J Comput Vis 123(1):1–28

    Article  MathSciNet  Google Scholar 

  2. Choi MJ, Lim JJ, Torralba A, Willsky AS (2010) Exploiting hierarchical context on a large database of object categories. In: Computer vision and pattern recognition, pp 129–136

  3. Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks: 3298–3308

  4. Desai C, Ramanan D, Fowlkes C (2010) Discriminative models for multi-class object layout. In: IEEE International conference on computer vision, pp 229–236

  5. Divvala SK, Farhadi A, Guestrin C (2014) Learning everything about anything: Webly-supervised visual concept learning. In: IEEE Conference on computer vision and pattern recognition, pp 3270–3277

  6. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, pp 15–29

  7. Fidler S, Leonardis A (2007) Towards scalable representations of object categories: learning a hierarchy of parts. In: IEEE conference on computer vision and pattern recognition, 2007. CVPR’07, pp 1–8

  8. Galleguillos C, Belongie S (2010) Context based object categorization: a critical survey. Comput Vis Image Underst 114(6):712–722

    Article  Google Scholar 

  9. Galleguillos C, Rabinovich A, Belongie S (2008) Object categorization using co-occurrence, location and appearance. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008, pp 1–8

  10. Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation: 580–587

  11. Gould S, Rodgers J, Cohen D, Elidan G, Koller D (2008) Multi-class segmentation with relative location prior. Int J Comput Vis 80(3):300–316

    Article  Google Scholar 

  12. Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789

    Article  Google Scholar 

  13. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition: 770–778

  14. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  15. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1097–1105

  16. Kumar MP, Koller D (2010) Efficiently selecting regions for scene understanding. In: Computer vision and pattern recognition, pp 3217–3224

  17. Ladicky L, Russell C, Kohli P, Torr PHS (2010) Graph cut based inference with co-occurrence statistics. In: European conference on computer vision, pp 239–253

  18. Li Y, Ouyang W, Wang X, Tang X (2017) Vip-cnn: visual phrase guided convolutional neural network: 7244–7253

  19. Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects phrases and region captions

  20. Liao W, Shuai L, Rosenhahn B, Yang MY (2017) Natural language guided visual relationship detection

  21. Lu C, Krishna R, Bernstein M, Li FF (2016) Visual relationship detection with language priors: 852–869

  22. Maji S, Bourdev L, Malik J (2011) Action recognition from a distributed representation of pose and appearance. In: Computer vision and pattern recognition, pp 3177–3184

  23. Mensink T, Gavves E, Snoek CGM (2014) Costa: co-occurrence statistics for zero-shot classification. In: IEEE Conference on computer vision and pattern recognition, pp 2441–2448

  24. Peyre J, Laptev I, Schmid C, Sivic J (2017) Weakly-supervised learning of visual relations: 5189–5198

  25. Plummer BA, Mallya A, Cervantes CM, Hockenmaier J, Lazebnik S (2017) Phrase localization and visual relationship detection with comprehensive image-language cues: 1946–1955

  26. Rabinovich A, Vedaldi A, Galleguillos C, Wiewiora E, Belongie S (2007) Objects in context. In: IEEE International conference on computer vision, pp 1–8

  27. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  28. Rohrbach M, Qiu W, Titov I, Thater S, Pinkal M, Schiele B (2013) Translating video content to natural language descriptions. In: IEEE International conference on computer vision, pp 433–440

  29. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  30. Sadeghi MA, Farhadi A (2011) Recognition using visual phrases. In: Computer vision and pattern recognition, pp 1745–1752

  31. Salakhutdinov R, Torralba A, Tenenbaum J (2011) Learning to share visual appearance for multiclass object detection. In: Computer vision and pattern recognition, pp 1481–1488

  32. Shelhamer E, Long J, Darrell T (2014) Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell PP(99):1–1

    Google Scholar 

  33. Sivic J, Russell BC, Efros AA, Zisserman A, Freeman WT (2005) Discovering objects and their localization in images. In: Tenth IEEE international conference on computer vision, vol 1, pp 370–377

  34. Socher R, Li FF (2010) Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: Computer vision and pattern recognition, pp 966–973

  35. Szegedy C, Ioffe S, Vanhoucke V, Alemi A (2016) Inception-v4 inception-resnet and the impact of residual connections on learning

  36. Werbos PJ (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Netw 1(4):339–356

    Article  Google Scholar 

  37. Xu D, Zhu Y, Choy CB, Li FF (2017) Scene graph generation by iterative message passing. In: IEEE Conference on computer vision and pattern recognition, pp 3097–3106

  38. Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: IEEE Conference on computer vision and pattern recognition, pp 21–29

  39. Yao B, Li FF (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: Computer vision and pattern recognition, pp 17–24

  40. Yu R, Li A, Morariu VI, Davis LS (2017) Visual relationship detection with internal and external linguistic knowledge distillation: 1068–1076

  41. Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection: 3107–3115

Download references

Acknowledgements

This project is partly supported by NSF of China (61773117, 61473086).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Dong.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dai, Y., Wang, C., Dong, J. et al. Visual relationship detection based on bidirectional recurrent neural network. Multimed Tools Appl 79, 35297–35313 (2020). https://doi.org/10.1007/s11042-019-7732-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-7732-z

Keywords

Navigation