Abstract
In recent years, rapid progress has been made in detecting and identifying single object instances. In order to understand the situation in the scene, computers need to recognize how humans interact with surrounding objects. Human-object interaction (HOI) detection aims to identify a set of interactions in images or videos. It involves the positioning of interactive subjects and objects and the classification of interactive types. It is crucial to realize high-level semantic understanding of people-centered scenarios. The study of HOI detection is also conducive to promoting the research of other advanced visual tasks. In this paper, we introduce the previous works on HOI detection based on deep learning, which are raised from the two primary development trends of sequential and parallel methods. Secondly, we summarize the main challenges faced by the HOI detection task. Further, we introduce the most popular HOI detection datasets, including image and video datasets, and main metrics. Finally, we summarize the future research directions for the HOI detection task.
This work is supported by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0103) and National Key R &D Program of China (2021ZD0113503).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Chao, Y., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human-object interactions. In: Winter Conference on Applications of Computer Vision (WACV) (2018)
Chao, Y., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Chiou, M.J., Liao, C.Y., Wang, L.W., Zimmermann, R., Feng, J.: ST-HOI: a spatial-temporal baseline for human-object interaction detection in videos. In: Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval (ICDAR) (2021)
Chiou, M.J., Zimmermann, R., Feng, J.: Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access 9, 50441–50451 (2021)
Gao, C., Xu, J., Zou, Y., Huang, J.-B.: DRG: dual relation graph for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 696–712. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_41
Gao, C., Zou, Y., Huang, J.: iCAN: instance-centric attention network for human-object interaction detection. CoRR (2018)
Gkioxari, G., Girshick, R., Dollar, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Gu, C., et al.: AVA: a video dataset of Spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Gupta, S., Malik, J.: Visual semantic role labeling. CoRR (2015)
Gupta, T., Schwing, A., Hoiem, D.: No-frills human-object interaction detection: factorization, layout encodings, and training techniques. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Ji, J., Krishna, R., Fei-Fei, L., Niebles, J.C.: Action genome: actions as compositions of Spatio-temporal scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Kim, B., Choi, T., Kang, J., Kim, H.J.: UnionDet: union-level detector towards real-time human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 498–514. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_30
Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: HOTR: end-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Koppula, H.S., Gupta, R., Saxena, A.: Learning human activities and object affordances from RGB-D videos. CoRR (2012)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123, 32–73 (2017)
Li, S., Du, Y., Torralba, A., Sivic, J., Russell, B.: Weakly supervised human-object interaction detection in video via contrastive spatiotemporal regions. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)
Li, Y.L., et al.: Detailed 2D-3D joint representation for human-object interaction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Li, Y., et al.: HAKE: human activity knowledge engine. CoRR (2019)
Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., Feng, J.: PPDM: parallel point detection and matching for real-time human-object interaction detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Y., Yuan, J., Chen, C.W.: ConsNet: learning consistency graph for zero-shot human-object interaction detection (2020)
Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_51
Orcesi, A., Audigier, R., Toukam, F.P., Luvison, B.: Detecting human-to-human-or-object (H2O) interactions with DIABOLO. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021) (2021)
Qi, S., Wang, W., Jia, B., Shen, J., Zhu, S.-C.: Learning human-object interactions by graph parsing neural networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 407–423. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_25
Shang, X., Di, D., Xiao, J., Cao, Y., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval (2019)
Ulutan, O., Iftekhar, A.S.M., Manjunath, B.S.: VSGNet: spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. Stat 1050, 20 (2017)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Wang, H., Zheng, W., Yingbiao, L.: Contextual heterogeneous graph network for human-object interaction detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 248–264. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_15
Wang, T., et al.: Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Zhuang, B., Wu, Q., Shen, C., Reid, I.D., van den Hengel, A.: Care about you: towards large-scale human-centric visual relationship detection. CoRR (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, F., Wang, S., Wang, S., Zhang, L. (2022). Human-Object Interaction Detection: A Survey of Deep Learning-Based Methods. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13604. Springer, Cham. https://doi.org/10.1007/978-3-031-20497-5_36
Download citation
DOI: https://doi.org/10.1007/978-3-031-20497-5_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20496-8
Online ISBN: 978-3-031-20497-5
eBook Packages: Computer ScienceComputer Science (R0)