skip to main content
10.1145/3581783.3611945acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection

Published: 27 October 2023 Publication History

Abstract

Spatio-temporal Action Detection (SAD) aims to recognize the multi-class actions, and meanwhile locate their spatio-temporal occurrence in untrimmed videos. Besides relying on the inherent inter-actor interactions, most previous SAD approaches model actor interactions between multi-actors and the whole frames or special parts (e.g., objects/hands). However, such approaches are relatively graceless by 1) roughly treating all various actors to equivalently interact with frames/parts or by 2) sumptuously borrowing multiple costly detectors to acquire the special parts. To solve the above dilemma, we propose a novel Foreground/Background-masked Interaction Learning (dubbed as FBI Learning) framework to learn the multi-actor features by attentively interacting with the hands-down foreground and background frames. Specifically, we first design a new Mask-guided Cross Attention (MCA) mechanism that calculates the masked cross-attentions to capture the compact relations between the actors and foreground/background regions. Next, we present a new Actor-guided Feature Aggregation (AFA) scheme that integrates foreground- and background-interacted actor features with the learnable actor-based weights. Finally, we construct a long-term feature bank that associates temporal context information to facilitate action classification. Extensive experiments are conducted on commonly available UCF101-24, MultiSports, and AVA v2.1/v2.2 datasets, which illustrate the competitive performance of FBI Learning against the state-of-the-art methods.

References

[1]
Liangliang Cao, Zicheng Liu, and Thomas S Huang. 2010. Cross-dataset action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1998--2005.
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV). 213--229.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.
[4]
Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, and Ping Luo. 2021. Watch only once: An end-to-end video action detection framework. In IEEE/CVF International Conference on Computer Vision (ICCV). 8178--8187.
[5]
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1290--1299.
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR). 1--21.
[7]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In IEEE/CVF International Conference on Computer Vision (ICCV). 6824--6835.
[8]
Gueter Josmy Faure, Min-Hung Chen, and Shang-Hong Lai. 2023. Holistic Interaction Transformer Network for Action Detection. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 3340--3350.
[9]
Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 203--213.
[10]
Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaiming He. 2022. Masked Autoencoders As Spatiotemporal Learners. arXiv preprint arXiv:2205.09113 (2022).
[11]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In IEEE/CVF International Conference on Computer Vision (ICCV). 6202--6211.
[12]
Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 244--253.
[13]
Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 759--768.
[14]
Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6047--6056.
[15]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning spatio-temporal features with 3d residual networks for action recognition. In IEEE/CVF International Conference on Computer Vision Workshops (ICCV Workshops). 3154--3160.
[16]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In IEEE/CVF International Conference on Computer Vision (ICCV). 2961--2969.
[17]
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020a. Action genome: Actions as compositions of spatio-temporal scene graphs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10236--10247.
[18]
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020b. Action genome: Actions as compositions of spatio-temporal scene graphs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10236--10247.
[19]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 1 (2012), 221--231.
[20]
Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In IEEE/CVF International Conference on Computer Vision (ICCV). 4405--4413.
[21]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies (NAACL-HLT). 4171--4186.
[22]
Okan Köpüklü, Xiangyu Wei, and Gerhard Rigoll. 2019. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644 (2019).
[23]
Rui Li, Yiting Wang, Wei-Long Zheng, and Bao-Liang Lu. 2022. A Multi-view Spectral-Spatial-Temporal Masked Autoencoder for Decoding Emotions with Self-supervised Learning. In the 30th ACM International Conference on Multimedia (ACM MM). 6--14.
[24]
Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. 2021. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In IEEE/CVF International Conference on Computer Vision (ICCV). 13536--13545.
[25]
Yixuan Li, Zixu Wang, Limin Wang, and Gangshan Wu. 2020. Actions as moving points. In European Conference on Computer Vision (ECCV). 68--84.
[26]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2117--2125.
[27]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). 740--755.
[28]
Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In AAAI Conference on Artificial Intelligence (AAAI). 7138--7145.
[29]
Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762 (2019).
[30]
Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. 2018. Attend and interact: Higher-order object interactions for video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6790--6800.
[31]
Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. 2020. Something-else: Compositional action recognition with spatial-temporal interaction networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1049--1059.
[32]
Jingcheng Ni, Jie Qin, and Di Huang. 2021. Identity-aware Graph Memory Network for Action Detection. In the 29th ACM International Conference on Multimedia (ACM MM). 3437--3445.
[33]
Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. 2021. Actor-context-actor relation network for spatio-temporal action localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 464--474.
[34]
Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. 2021. Mango: A mask attention guided one-stage scene text spotter. In AAAI Conference on Artificial Intelligence (AAAI). 2467--2476.
[35]
Xian Qu, Changxing Ding, Xingao Li, Xubin Zhong, and Dacheng Tao. 2022. Distillation using oracle queries for transformer-based human-object interaction detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19558--19567.
[36]
Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7263--7271.
[37]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS). 91--99.
[38]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, Vol. 115 (2015), 211--252.
[39]
Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2016. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016).
[40]
Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Wei Liu, and Jian Yang. 2019. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 3 (2019), 1110--1118.
[41]
Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Yan Song, Zechao Li, and Liyan Zhang. 2017. Concurrence-aware long short-term sub-memories for person-person action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops). 1--8.
[42]
Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2021. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 6 (2021), 3300--3315.
[43]
Xiangbo Shu, Liyan Zhang, Yunlian Sun, and Jinhui Tang. 2020. Host--parasite: Graph LSTM-in-LSTM for group activity recognition. IEEE Transactions on Neural Networks and Learning Systems, Vol. 32, 2 (2020), 663--674.
[44]
Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2017. Online real-time multiple spatiotemporal action localisation and prediction. In IEEE/CVF International Conference on Computer Vision (ICCV). 3637--3646.
[45]
Kaitao Song, Xiu-Shen Wei, Xiangbo Shu, Ren-Jie Song, and Jianfeng Lu. 2020. Bi-modal progressive mask attention for fine-grained recognition. IEEE Transactions on Image Processing, Vol. 29 (2020), 7006--7018.
[46]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
[47]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, Vol. 15, 1 (2014), 1929--1958.
[48]
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. 2018. Actor-centric relation network. In European Conference on Computer Vision (ECCV). 318--334.
[49]
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. 2019a. Relational action forecasting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 273--283.
[50]
Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. 2019b. Relational action forecasting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 273--283.
[51]
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019c. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).
[52]
Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. 2020. Asynchronous interaction aggregation for action detection. In European Conference on Computer Vision (ECCV). 71--87.
[53]
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems (NeurIPS). 1--16.
[54]
Du Tran and Junsong Yuan. 2012. Max-margin structured output regression for spatio-temporal action localization. In Advances in Neural Information Processing Systems (NeurIPS). 1--9.
[55]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). 5998--6008.
[56]
Guangzhi Wang, Yangyang Guo, Yongkang Wong, and Mohan Kankanhalli. 2022a. Distance Matters in Human-Object Interaction Detection. In the 29th ACM International Conference on Multimedia (ACM MM). 4546--4554.
[57]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Video action detection with relational dynamic-poselets. In European Conference on Computer Vision (ECCV). 565--580.
[58]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7794--7803.
[59]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In European Conference on Computer Vision (ECCV). 399--417.
[60]
Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, and Jianming Zhang. 2022b. Less is More: Consistent Video Depth Estimation with Masked Frames Modeling. In the 30th ACM International Conference on Multimedia (ACM MM). 6347--6358.
[61]
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 284--293.
[62]
Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, and Gangshan Wu. 2020. Context-aware rcnn: A baseline for action detection in videos. In European Conference on Computer Vision (ECCV). 440--456.
[63]
Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In European Conference on Computer Vision (ECCV). 447--464.
[64]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1492--1500.
[65]
Wentao Xie, Guanghui Ren, and Si Liu. 2020. Video relation detection with trajectory-aware multi-modal features. In the 28th ACM International Conference on Multimedia (ACM MM). 4590--4594.
[66]
Rui Yan, Lingxi Xie, Xiangbo Shu, Liyan Zhang, and Jinhui Tang. 2023. Progressive Instance-Aware Feature Learning for Compositional Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45 (2023), 10317--10330.
[67]
Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45 (2020), 6955--6968.
[68]
Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S Davis, and Jan Kautz. 2019. Step: Spatio-temporal progressive learning for video action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 264--272.
[69]
Yanyi Zhang, Xinyu Li, and Ivan Marsic. 2021. Multi-label activity recognition using activity-specific features and activity correlations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14625--14635.
[70]
Yubo Zhang, Pavel Tokmakov, Martial Hebert, and Cordelia Schmid. 2019. A structured model for action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9975--9984.
[71]
Jiaojiao Zhao and Cees GM Snoek. 2019. Dance with flow: Two-in-one stream action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9935--9944.
[72]
Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Bing Shuai, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, et al. 2022. Tuber: Tubelet transformer for video action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13598--13607.
[73]
Xubin Zhong, Changxing Ding, Zijian Li, and Shaoli Huang. 2022. Towards Hard-Positive Query Mining for DETR-Based Human-Object Interaction Detection. In European Conference on Computer Vision (ECCV). 444--460.
[74]
Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In European Conference on Computer Vision (ECCV). 803--818.

Cited By

View all
  • (2024)Leveraging Multimodal Knowledge for Spatio-Temporal Action Localization2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW63481.2024.10645431(1-5)Online publication date: 15-Jul-2024
  • (2024)End-to-End Spatio-Temporal Action Localisation with Video Transformers2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01739(18373-18383)Online publication date: 16-Jun-2024

Index Terms

  1. Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. 3d convolutional neural network
    2. actor-guided feature aggregation
    3. mask-guided attention
    4. video spatio-temporal action detection

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)103
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Leveraging Multimodal Knowledge for Spatio-Temporal Action Localization2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW63481.2024.10645431(1-5)Online publication date: 15-Jul-2024
    • (2024)End-to-End Spatio-Temporal Action Localisation with Video Transformers2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01739(18373-18383)Online publication date: 16-Jun-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media