Skip to main content
Log in

Multi-stream neural network fused with local information and global information for HOI detection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human-Object Interaction (HOI) Detection is a new genre of human-centric visual relationship detection task, which is significant to deep understanding of visual scenes. Due to the complexity of the visual scene in the image, HOI detection is still a challenging task, the most critical part of which is feature extraction and representation. Some existing approaches rely solely on local region information for HOI detection without using global contextual information, but global contextual information contributes to this task in some HOI categories. Other approaches incorporate global contextual information for HOI detection while losing local region information. In this work, we propose a multi-stream neural network architecture composed of three special module that employs both local region information and global contextual information for HOI detection. This model can detect not only the HOI categories based on local region information but also on global contextual information. Our model more fully considers all HOI categories in the dataset. Compared with other existing approaches, the proposed model shows improved performance on V-COCO and HICO-DET benchmark datasets, especially when predicting rare HOI categories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Cao Z, Simon T, Wei S E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299

  2. Caponetti L, Castellano G, Basile M T, Corsini V (2014) Fuzzy mathematical morphology for biological image segmentation. Appl Intell 41(1):117–127

    Article  Google Scholar 

  3. Chao YW, Liu Y, Liu X, Zeng H, Deng J (2018) Learning to detect human-object interactions. In: 2018 ieee winter conference on applications of computer vision (wacv), pp 381–389. IEEE

  4. Chao YW, Wang Z, He Y, Wang J, Deng J (2015) Hico: A benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE international conference on computer vision, pp 1017–1025

  5. Chéron G., Laptev I, Schmid C (2015) P-cnn: Pose-based cnn features for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 3218–3226

  6. Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE conference on computer vision and Pattern recognition, pp 3076–3086

  7. Deng C, Yang E, Liu T, Tao D (2019) Two-stream deep hashing with class-specific centers for supervised image search. IEEE Transactions on Neural Networks and Learning Systems

  8. Deng C, Yang X, Nie F, Tao D (2019) Saliency detection via a multiple self-weighted graph-based manifold ranking. IEEE Transactions on Multimedia

  9. Fan X, Yang Y, Deng C, Xu J, Gao X (2018) Compressed multi-scale feature fusion network for single image super-resolution. Signal Process 146:50–60

    Article  Google Scholar 

  10. Fang HS, Cao J, Tai YW, Lu C (2018) Pairwise body-part attention for recognizing human-object interactions. In: Proceedings of the European conference on computer vision (ECCV), pp 51–67

  11. Gao C, Zou Y, Huang JB (2018) ican: Instance-centric attention network for human-object interaction detection. arXiv:1808.104371808.10437

  12. Girdhar R, Ramanan D (2017) Attentional pooling for action recognition. In: Advances in neural information processing systems, pp 34–45

  13. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  14. Girshick R, Radosavovic I, Gkioxari G, Dollár P, He K (2018) Detectron

  15. Gkioxari G, Girshick R, Dollár P., He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367

  16. Gupta A, Kembhavi A, Davis L S (2009) Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789

    Article  Google Scholar 

  17. Gupta S, Malik J (2015) Visual semantic role labeling. arXiv:1505.04474

  18. He K, Gkioxari G, Dollár P., Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770– 778

  20. Hu JF, Zheng WS, Lai J, Gong S, Xiang T (2013) Recognising human-object interaction via exemplar based modelling. In: Proceedings of the IEEE international conference on computer vision, pp 3144–3151

  21. Li Y, Ouyang W, Wang X, Tang X (2017) Vip-cnn: Visual phrase guided convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1347–1356

  22. Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE international conference on computer vision, pp 1261–1270

  23. Lin T Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  24. Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755. Springer

  25. Liu X, Zhu X, Li M, Wang L, Zhu E, Liu T, Kloft M, Shen D, Yin J, Gao W (2019) Multiple kernel k-means with incomplete kernels. IEEE transactions on pattern analysis and machine intelligence

  26. Mehmood Z, Mahmood T, Javid MA (2018) Content-based image retrieval and semantic automatic image annotation based on the weighted average of triangular histograms using support vector machine. Appl Intell 48(1):166–181

    Article  Google Scholar 

  27. Peyre J, Sivic J, Laptev I, Schmid C (2017) Weakly-supervised learning of visual relations. In: Proceedings of the IEEE international conference on computer vision, pp 5179–5188

  28. Prest A, Schmid C, Ferrari V (2011) Weakly supervised learning of interactions between humans and objects. IEEE Trans Pattern Anal Mach Intell 34(3):601–614

    Article  Google Scholar 

  29. Qi S, Wang W, Jia B, Shen J, Zhu S C (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European conference on computer vision (ECCV), pp 401–417

  30. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  31. Shen L, Yeung S, Hoffman J, Mori G, Fei-Fei L (2018) Scaling human-object interaction recognition through zero-shot learning. In: 2018 IEEE winter conference on applications of computer vision (WACV), pp 1568–1576. IEEE

  32. Shen Y, Ji R, Yang K, Deng C, Wang C (2019) Category-aware spatial constraint for weakly supervised detection. IEEE Trans Image Process 29:843–858

    Article  MathSciNet  Google Scholar 

  33. Xu D, Zhu Y, Choy C B, Fei-Fei L (2017) Scene graph generation by iterative message passing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5410–5419

  34. Yao A, Gall J, Fanelli G, Van Gool L (2011) Does human action recognition benefit from pose estimation?. In: BMVC 2011-proceedings of the british machine vision conference 2011

  35. Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 17–24. IEEE

  36. Yu D, Xu Z, Fujita H (2019) Bibliometric analysis on the evolution of applied intelligence. Appl Intell 49(2):449–462

    Article  Google Scholar 

  37. Yu X, Ye X, Gao Q (2020) Infrared handprint image restoration algorithm based on apoptotic mechanism. IEEE Access 8:47334–47343

    Article  Google Scholar 

  38. Yu X, Zhou Z, Gao Q, Li D, Ríha K (2018) Infrared image segmentation using growing immune field and clone threshold. Infrared Physics & Technology 88:184–193

    Article  Google Scholar 

  39. Zeng K, Ding S, Jia W (2019) Single image super-resolution using a polymorphic parallel cnn. Appl Intell 49(1):292–300

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant no. 51678075), the Science and Technology Project of Hunan (Grant no. 2017GK2271).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Li.

Ethics declarations

Conflict of interests

The authors declare that there is no conflict of interests regarding the publication of this paper (such as financial gain).

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xia, L., Li, R. Multi-stream neural network fused with local information and global information for HOI detection. Appl Intell 50, 4495–4505 (2020). https://doi.org/10.1007/s10489-020-01794-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01794-1

Keywords

Navigation