Skip to main content

Advertisement

Log in

Pose attention and object semantic representation-based human-object interaction detection network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human-object interaction (HOI) detection is a core problem in human-centric scene understanding, which is devoted to inferring triplets < human, verb, object > between humans and objects. Previous works mainly determine the interaction of each human-object pair by performing joint inference based on multiple features. In this paper, we design more discriminative representation of the human-object pair and a more effective HOI detection model. On the one hand, we use human poses as an attention mechanism to strengthen features, which is a novel way to deal with human poses in HOI detection. On the other hand, for a more effective representation of objects, a word vector is used to encode objects, and the relation features of humans and objects are captured by a graph convolution network based on object word vectors and human appearance features. These relation features are also strengthened by a human pose attention mechanism. Our model yields favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets, V-COCO and HICO-DET.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Chao YW, Liu Y, Liu X, Zeng H, Deng J (2018) Learning to detect human-object interactions. In: 2018 Ieee winter conference on applications of computer vision (wacv), IEEE, pp 381–389

  2. Chao YW, Wang Z, He Y, Wang J, Deng J (2015) Hico: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE international conference on computer vision, pp 1017–1025

  3. Chowdhary CL, Patel PV, Kathrotia KJ, Attique M, Ijaz MF (2020) Analytical study of hybrid techniques for image encryption and decryption. Sensors 20(18)

  4. Colque RM, Caetano C, de Melo VHC, Chavez GC, Schwartz WR (2018) Novel anomalous event detection based on human-object interactions. In: VISIGRAPP (5: VISAPP), pp 293–300

  5. Fang HS, Cao J, Tai YW, Lu C (2018) Pairwise body-part attention for recognizing human-object interactions. In: Proceedings of the European conference on computer vision (ECCV), pp 51–67

  6. Gao C, Xu J, Zou Y, Huang JB (2020) Drg: Dual relation graph for human-object interaction detection. In: European conference on computer vision, Springer, pp 696–712

  7. Gao C, Zou Y, Huang JB (2018) ican: Instance-centric attention network for human-object interaction detection. arXiv:1808.10437

  8. Girshick R (2015) Fast r-cnn. Computer Science

  9. Gkioxari G, Girshick R, Dollár P, He K (2018) Detecting and recognizing human-object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8359–8367

  10. Gupta S, Malik J (2015) Visual semantic role labeling. arXiv preprint arXiv:1505.04474

  11. Gupta T, Schwing A, Hoiem D (2019) No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: Proceedings of the IEEE international conference on computer vision, pp 9677–9685

  12. Hassan M, Dharmaratne A (2015) Labeling abnormalities in video based complex human-object interactions by robust affordance modelling. In: International conference on computer vision & image analysis applications

  13. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on computer vision & pattern recognition

  15. Huh JH, Seo YS (2019) Understanding edge computing: Engineering evolution with artificial intelligence. IEEE Access PP(99):1–1

    Google Scholar 

  16. Johnson J, Krishna R, Stark M, Li LJ, Shamma DA, Bernstein MS, Fei-Fei L (2015) Image retrieval using scene graphs. In: IEEE Conference on computer vision & pattern recognition

  17. Kim DJ, Sun X, Choi J, Lin S, Kweon IS (2020) Detecting human-object interactions with action co-occurrence priors. In: European conference on computer vision, Springer, pp 718–736

  18. Lee P, Yoo JH (2020) Face recognition at a distance for a stand-alone access control system. Sensors 20(3):785

    Article  Google Scholar 

  19. Li YL, Zhou S, Huang X, Xu L, Ma Z, Fang HS, Wang Y, Lu C (2019) Transferable interactiveness knowledge for human-object interaction detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3585–3594

  20. Liang Z, Liu J, Guan Y, Rojas J (2020) Pose-based modular network for human-object interaction detection. arXiv:2008.02042

  21. Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  22. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755

  23. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: Single shot multibox detector. In: European conference on computer vision, Springer, pp 21–37

  24. Liu Y, Chen Q, Zisserman A (2020) Amplifying key cues for human-object-interaction detection. In: European conference on computer vision, Springer, pp 248–265

  25. Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228

  26. Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2017) Advances in pre-training distributed word representations. arXiv:1712.09405

  27. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26:3111–3119

    Google Scholar 

  28. Qi S, Wang W, Jia B, Shen J, Zhu SC (2018) Learning human-object interactions by graph parsing neural networks. In: Proceedings of the European conference on computer vision (ECCV), pp 401–417

  29. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  30. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  31. Syed MR (2008) Multimedia technologies: Concepts, methodologies, tools, and applications. Media in Foreign Language Instruction 13(2):222–224

    Google Scholar 

  32. Tamang J, Nkapkop JDD, Ijaz MF, Prasad PK, Tsafack N, Saha A, Kengne J, Son Y (2021) Dynamical properties of ion-acoustic waves in space plasma and its application to image encryption. IEEE Access 9:18762–18782

    Article  Google Scholar 

  33. Ulutan O, Iftekhar A, Manjunath BS (2020) Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13617–13626

  34. Wan B, Zhou D, Liu Y, Li R, He X (2019) Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE international conference on computer vision, pp 9469–9478

  35. Wang H, Zheng WS, Yingbiao L (2020) Contextual heterogeneous graph network for human-object interaction detection. In: European conference on computer vision, Springer, pp 248–264

  36. Wang T, Anwer RM, Khan MH, Khan FS, Pang Y, Shao L, Laaksonen J (2019) Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE international conference on computer vision, pp 5694–5702

  37. Xiang T, Gong S, Lai J, Zheng W-S, Hu J-F (2016) Exemplar-based recognition of human-object interactions. IEEE Transactions on Circuits & Systems for Video Technology

  38. Xu B, Li J, Wong Y, Zhao Q, Kankanhalli MS (2019) Interact as you intend: Intention-driven human-object interaction detection. IEEE Transactions on Multimedia 22(6):1423–1432

    Article  Google Scholar 

  39. Zhang HB, Zhang YX, Zhong B, Lei Q, Yang L, Du JX, Chen DS (2019) A comprehensive survey of Vision-Based human action recognition methods. Sensors 19(5)

  40. Zhang HB, Zhou YZ, Du JX, Huang JL, Yang L (2020) Improved human-object interaction detection through skeleton-object relations. Journal of Experimental & Theoretical Artificial Intelligence (1), 1–12

  41. Zhou P, Chi M (2019) Relation parsing neural network for human-object interaction detection. In: 2019 IEEE/CVF international conference on computer vision (ICCV)

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable and insightful comments on an earlier version of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong-Bo Zhang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Natural Science Foundation of China [No. 61871196, 62001176, 61902330 and 61673186]; National Key Research and Development Program of China [NO.2019YFC1604700]; Natural Science Foundation of Fujian Province of China [No. 2019J01082 and 2020J01085]; and the Promotion Program for Young and Middle-aged Teacher in Science and Technology Research of Huaqiao University [ZQN-YX601].

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Deng, WM., Zhang, HB., Lei, Q. et al. Pose attention and object semantic representation-based human-object interaction detection network. Multimed Tools Appl 81, 39453–39470 (2022). https://doi.org/10.1007/s11042-022-13146-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13146-x

Keywords

Navigation