skip to main content
10.1145/3581783.3611813acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Human-Object-Object Interaction: Towards Human-Centric Complex Interaction Detection

Published: 27 October 2023 Publication History

Abstract

Localizing and recognizing interactive actions in videos is a pivotal yet intricate task that paves the way towards profound video comprehension. Recent advancements in Human-Object Interaction (HOI) detection, which involve detecting and localizing the interactions between human and object pairs, have undeniably marked significant progress. However, the realm of human-object-object interaction, an essential aspect of real-world industrial applications, remains largely uncharted. In this paper, we introduce a novel task referred to as Human-Object-Object Interaction (HOOI) detection and present a cutting-edge method named the Human-Object-Object Interaction Network (H2O-Net). The proposed H2O-Net is comprised of two principal modules: sequential motion feature extraction and HOOI modeling. The former module delves into the gradually evolving visual characteristics of entities throughout the HOOI process, harnessing spatial-temporal features across multiple fine-grained partitions. Conversely, the latter module aspires to encapsulate HOOI actions through intricate interactions between entities. It commences by capturing and amalgamating two sub-interaction features to extract comprehensive HOOI features, subsequently refining them using the interaction cues embedded within the long-term global context. Furthermore, we contribute to the research community by constructing a new video dataset, dubbed the HOOI dataset. The actions encompassed within this dataset pertain to pivotal operational behaviors in industrial manufacturing, imbuing it with substantial application potential and serving as a valuable addition to the existing repertoire of interaction action detection datasets. Experimental evaluations conducted on the proposed HOOI and widely-used AVA datasets demonstrate that our method outperforms existing state-of-the-art techniques by margins of 6.16 mAP and 1.9 mAP, respectively, thus substantiating its effectiveness.

References

[1]
A Balasundaram and C Chellappan. 2020. An intelligent video analytics model for abnormal event detection in online surveillance video. Journal of Real-Time Image Processing, Vol. 17, 4 (2020), 915--930.
[2]
Antoni Buades, Bartomeu Coll, and J-M Morel. 2005. A non-local algorithm for image denoising. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. 60--65.
[3]
Zhaowei Cai and Nuno Vasconcelos. 2018. Cascade r-cnn: Delving into high quality object detection. In IEEE Conference on Computer Vision and Pattern Recognition. 6154--6162.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[5]
Hanyuan Chen, Jun-Yan He, Wangmeng Xiang, Wei Liu, Zhi-Qi Cheng, Hanbing Liu, Bin Luo, Yifeng Geng, and Xuansong Xie. 2023. HDFormer: High-order Directed Transformer for 3D Human Pose Estimation. International Joint Conferences on Artificial Intelligence (2023).
[6]
Junwen Chen, Gaurav Mittal, Ye Yu, Yu Kong, and Mei Chen. 2022. Gatehub: Gated history unit with background suppression for online action detection. In IEEE Conference on Computer Vision and Pattern Recognition. 19925--19934.
[7]
Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. 2020. Counterfactual samples synthesizing for robust visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition. 10800--10809.
[8]
Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, and Alexander Hauptmann. 2022. Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement. In ACM International Conference on Multimedia. 3272--3281.
[9]
Zhi-Qi Cheng, Yang Liu, Xiao Wu, and Xian-Sheng Hua. 2016. Video ecommerce: Towards online video advertising. In ACM international conference on Multimedia. 1365--1374.
[10]
Zhi-Qi Cheng, Xiao Wu, Yang Liu, and Xian-Sheng Hua. 2017a. Video ecommerce: Toward large scale online video advertising. IEEE transactions on multimedia, Vol. 19, 6 (2017), 1170--1183.
[11]
Zhi-Qi Cheng, Xiao Wu, Yang Liu, and Xian-Sheng Hua. 2017b. Video2shop: Exact matching clothes in videos to online shopping images. In IEEE Conference on Computer Vision and Pattern Recognition. 4048--4056.
[12]
Meng-Jiun Chiou, Chun-Yu Liao, Li-Wei Wang, Roger Zimmermann, and Jiashi Feng. 2021. St-hoi: A spatial-temporal baseline for human-object interaction detection in videos. In Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval. 9--17.
[13]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition. 248--255.
[14]
Mihai Dogariu, Liviu-Daniel Stefan, Mihai Gabriel Constantin, and Bogdan Ionescu. 2020. Human-object interaction: Application to abandoned luggage detection in video surveillance scenarios. In International Conference on Communications. IEEE, 157--160.
[15]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In IEEE International Conference on Computer Vision. 6202--6211.
[16]
Chen Gao, Yuliang Zou, and Jia-Bin Huang. 2018. iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection. CoRR, Vol. abs/1808.10437 (2018).
[17]
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).
[18]
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In IEEE Conference on Computer Vision and Pattern Recognition. 6047--6056.
[19]
Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).
[20]
Jun-Yan He, Xiao Wu, Zhi-Qi Cheng, Zhaoquan Yuan, and Yu-Gang Jiang. 2021. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing, Vol. 444 (2021), 319--331.
[21]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7132--7141.
[22]
Jingwei Ji, Rishi Desai, and Juan Carlos Niebles. 2021. Detecting human-object relationships in videos. In IEEE International Conference on Computer Vision. 8106--8116.
[23]
Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In IEEE International Conference on Computer Vision. 4405--4413.
[24]
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in vision: A survey. ACM computing surveys, Vol. 54, 10s (2022), 1--41.
[25]
Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J Kim. 2021. HOTR: End-to-End Human-Object Interaction Detection With Transformers. In IEEE Conference on Computer Vision and Pattern Recognition. 74--83.
[26]
Okan Köpüklü, Xiangyu Wei, and Gerhard Rigoll. 2019. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644 (2019).
[27]
Dong Li, Zhaofan Qiu, Qi Dai, Ting Yao, and Tao Mei. 2018. Recurrent tubelet proposal and recognition networks for action detection. In European Conference on Computer Vision. 303--318.
[28]
Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li, and Cewu Lu. 2020a. Detailed 2D-3D Joint Representation for Human-Object Interaction. In IEEE Conference on Computer Vision and Pattern Recognition. 10163--10172.
[29]
Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. 2020b. PaStaNet: Toward Human Activity Knowledge Engine. In IEEE Conference on Computer Vision and Pattern Recognition. 379--388.
[30]
Yong-Lu Li, Siyuan Zhou, Xijie Huang, Liang Xu, Ze Ma, Hao-Shu Fang, Yanfeng Wang, and Cewu Lu. 2019. Transferable interactiveness knowledge for human-object interaction detection. In IEEE Conference on Computer Vision and Pattern Recognition. 3585--3594.
[31]
Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. PPDM: Parallel Point Detection and Matching for Real-Time Human-Object Interaction Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 479--487.
[32]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 740--755.
[33]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. 21--37.
[34]
Ye Liu, Junsong Yuan, and Chang Wen Chen. 2020. Consnet: Learning consistency graph for zero-shot human-object interaction detection. In ACM International Conference on Multimedia. 4235--4243.
[35]
Xiang Long, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In IEEE Conference on Computer Vision and Pattern Recognition. 7834--7843.
[36]
Romero Morais, Vuong Le, Svetha Venkatesh, and Truyen Tran. 2021. Learning asynchronous and sparse human-object interaction in videos. In IEEE Conference on Computer Vision and Pattern Recognition. 16041--16050.
[37]
Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. 2021. Actor-context-actor relation network for spatio-temporal action localization. In IEEE Conference on Computer Vision and Pattern Recognition. 464--474.
[38]
Jian-Jun Qiao, Zhi-Qi Cheng, Xiao Wu, Wei Li, and Ji Zhang. 2022. Real-time semantic segmentation with parallel multiple views feature augmentation. In ACM International Conference on Multimedia. 6300--6308.
[39]
Jian-Jun Qiao, Xiao Wu, Jun-Yan He, Wei Li, and Qiang Peng. 2020. SWNet: a deep learning based approach for splashed water detection on road. IEEE transactions on intelligent transportation systems, Vol. 23, 4 (2020), 3012--3025.
[40]
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. 2022. From show to tell: a survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 1 (2022), 539--559.
[41]
Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. 2020. Asynchronous interaction aggregation for action detection. In European Conference on Computer Vision. 71--87.
[42]
Danyang Tu, Wei Sun, Xiongkuo Min, Guangtao Zhai, and Wei Shen. 2022. Video-based human-object interaction detection from tubelet tokens. Advances in Neural Information Processing Systems, Vol. 35 (2022), 23345--23357.
[43]
Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, and Yu-Gang Jiang. 2023. Implicit temporal modeling with learnable alignment for video recognition. IEEE International Conference on Computer Vision (2023).
[44]
Oytun Ulutan, ASM Iftekhar, and Bangalore S Manjunath. 2020. Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. 13617--13626.
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Conference on Neural Information Processing Systems, Vol. 30.
[46]
Ning Wang, Guangming Zhu, Liang Zhang, Peiyi Shen, Hongsheng Li, and Cong Hua. 2021. Spatio-temporal interaction graph parsing networks for human-object interaction recognition. In ACM International Conference on Multimedia. 4985--4993.
[47]
Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. 2020. ECA-Net: Efficient channel attention for deep convolutional neural networks. (2020), 11534--11542.
[48]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.
[49]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In European Conference on Computer Vision. 399--417.
[50]
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 284--293.
[51]
Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S Davis, and Jan Kautz. 2019. Step: Spatio-temporal progressive learning for video action detection. In IEEE Conference on Computer Vision and Pattern Recognition. 264--272.
[52]
Yuancheng Ye, Xiaodong Yang, and Yingli Tian. 2019. Discovering spatio-temporal action tubes. Journal of Visual Communication and Image Representation, Vol. 58 (2019), 515--524.
[53]
Aixi Zhang, Yue Liao, Si Liu, Miao Lu, Yongliang Wang, Chen Gao, and Xiaobo Li. 2021. Mining the Benefits of Two-stage and One-stage HOI Detection. Conference on Neural Information Processing Systems, Vol. 34 (2021).
[54]
Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. 2022. Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In IEEE Conference on Computer Vision and Pattern Recognition. 19548--19557.
[55]
Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Bing Shuai, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, et al. 2022. Tuber: Tubelet transformer for video action detection. In IEEE Conference on Computer Vision and Pattern Recognition. 13598--13607.
[56]
Ting Zhao and Xiangqian Wu. 2019. Pyramid feature attention network for saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition. 3085--3094.
[57]
Desen Zhou, Zhichao Liu, Jian Wang, Leshan Wang, Tao Hu, Errui Ding, and Jingdong Wang. 2022b. Human-object interaction detection via disentangled transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 19568--19577.
[58]
Yuxuan Zhou, Zhi-Qi Cheng, Jun-Yan He, Bin Luo, Yifeng Geng, Xuansong Xie, and Margret Keuper. 2023. Overcoming Topology Agnosticism: Enhancing Skeleton-Based Action Recognition through Redefined Skeletal Topology Awareness. arXiv preprint arXiv:2305.11468 (2023).
[59]
Yuxuan Zhou, Chao Li, Zhi-Qi Cheng, Yifeng Geng, Xuansong Xie, and Margret Keuper. 2022a. Hypergraph transformer for skeleton-based action recognition. arXiv preprint arXiv:2211.09590 (2022).
[60]
Lei Zhu, Qi She, Duo Li, Yanye Lu, Xuejing Kang, Jie Hu, and Changhu Wang. 2021. Unifying Nonlocal Blocks for Neural Networks. In IEEE International Conference on Computer Vision. 12292--12301.
[61]
Cheng Zou, Bohan Wang, Yue Hu, Junqi Liu, Qian Wu, Yu Zhao, Boxun Li, Chenguang Zhang, Chi Zhang, Yichen Wei, et al. 2021. End-to-end human object interaction detection with hoi transformer. In IEEE Conference on Computer Vision and Pattern Recognition. 11825--11834.

Cited By

View all
  • (2024)From Category to Scenery: An End-to-End Framework for Multi-person Human-Object Interaction Recognition in VideosPattern Recognition10.1007/978-3-031-78354-8_17(262-277)Online publication date: 4-Dec-2024

Index Terms

  1. Human-Object-Object Interaction: Towards Human-Centric Complex Interaction Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action detection
    2. human-object-object interaction
    3. interaction model
    4. temporal context

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)109
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)From Category to Scenery: An End-to-End Framework for Multi-person Human-Object Interaction Recognition in VideosPattern Recognition10.1007/978-3-031-78354-8_17(262-277)Online publication date: 4-Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media