Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods

Li, Fang; Wang, Shunli; Wang, Shuaiping; Zhang, Lihua

doi:10.1007/978-3-031-20497-5_36

Fang Li^12,13,
Shunli Wang^12,13,
Shuaiping Wang^12,13 &
…
Lihua Zhang^12,13,14,15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13604))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

1537 Accesses
2 Citations

Abstract

In recent years, rapid progress has been made in detecting and identifying single object instances. In order to understand the situation in the scene, computers need to recognize how humans interact with surrounding objects. Human-object interaction (HOI) detection aims to identify a set of interactions in images or videos. It involves the positioning of interactive subjects and objects and the classification of interactive types. It is crucial to realize high-level semantic understanding of people-centered scenarios. The study of HOI detection is also conducive to promoting the research of other advanced visual tasks. In this paper, we introduce the previous works on HOI detection based on deep learning, which are raised from the two primary development trends of sequential and parallel methods. Secondly, we summarize the main challenges faced by the HOI detection task. Further, we introduce the most popular HOI detection datasets, including image and video datasets, and main metrics. Finally, we summarize the future research directions for the HOI detection task.

This work is supported by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0103) and National Key R &D Program of China (2021ZD0113503).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Chao, Y., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: Winter Conference on Applications of Computer Vision (WACV) (2018)
Google Scholar
Chao, Y., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Chiou, M.J., Liao, C.Y., Wang, L.W., Zimmermann, R., Feng, J.: ST-HOI: a spatial-temporal baseline for human-object interaction detection in videos. In: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval (ICDAR) (2021)
Google Scholar
Chiou, M.J., Zimmermann, R., Feng, J.: Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access 9, 50441–50451 (2021)
Article Google Scholar
Gao, C., Xu, J., Zou, Y., Huang, J.-B.: DRG: dual relation graph for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 696–712. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_41
Chapter Google Scholar
Gao, C., Zou, Y., Huang, J.: iCAN: instance-centric attention network for human-object interaction detection. CoRR (2018)
Google Scholar
Gkioxari, G., Girshick, R., Dollar, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Gu, C., et al.: AVA: a video dataset of Spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Gupta, S., Malik, J.: Visual semantic role labeling. CoRR (2015)
Google Scholar
Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: actions as compositions of Spatio-temporal scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Kim, B., Choi, T., Kang, J., Kim, H.J.: UnionDet: union-level detector towards real-time human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 498–514. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_30
Chapter Google Scholar
Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: HOTR: end-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. CoRR (2012)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)
Article MathSciNet Google Scholar
Li, S., Du, Y., Torralba, A., Sivic, J., Russell, B.: Weakly supervised human-object interaction detection in video via contrastive spatiotemporal regions. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Li, Y.L., et al.: Detailed 2D-3D joint representation for human-object interaction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Li, Y., et al.: HAKE: human activity knowledge engine. CoRR (2019)
Google Scholar
Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., Feng, J.: PPDM: parallel point detection and matching for real-time human-object interaction detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, Y., Yuan, J., Chen, C.W.: ConsNet: learning consistency graph for zero-shot human-object interaction detection (2020)
Google Scholar
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Chapter Google Scholar
Orcesi, A., Audigier, R., Toukam, F.P., Luvison, B.: Detecting human-to-human-or-object (H2O) interactions with DIABOLO. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021) (2021)
Google Scholar
Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.-C.: Learning human-object interactions by graph parsing neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 407–423. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_25
Chapter Google Scholar
Shang, X., Di, D., Xiao, J., Cao, Y., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval (2019)
Google Scholar
Ulutan, O., Iftekhar, A.S.M., Manjunath, B.S.: VSGNet: spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. Stat 1050, 20 (2017)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Wang, H., Zheng, W., Yingbiao, L.: Contextual heterogeneous graph network for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 248–264. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_15
Chapter Google Scholar
Wang, T., et al.: Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Zhuang, B., Wu, Q., Shen, C., Reid, I.D., van den Hengel, A.: Care about you: towards large-scale human-centric visual relationship detection. CoRR (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Academy for Engineering and Technology, Fudan University, Shanghai, China
Fang Li, Shunli Wang, Shuaiping Wang & Lihua Zhang
Engineering Research Center of AI and Robotics, Ministry of Education, Beijing, China
Fang Li, Shunli Wang, Shuaiping Wang & Lihua Zhang
Jilin Provincial Key Laboratory of Intelligence Science and Engineering, Changchun, China
Lihua Zhang
Artifical Intelligence and Unmanned Systems Engineering Research Center of Jilin Province, Changchun, China
Lihua Zhang

Authors

Fang Li
View author publications
You can also search for this author in PubMed Google Scholar
Shunli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shuaiping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lihua Zhang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Xiaomi Inc., Beijing, China
Daniel Povey
Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
JD Explore Academy, Beijing, China
Tao Mei
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, F., Wang, S., Wang, S., Zhang, L. (2022). Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13604. Springer, Cham. https://doi.org/10.1007/978-3-031-20497-5_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-20497-5_36
Published: 17 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20496-8
Online ISBN: 978-3-031-20497-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods