research-article

Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection

Authors:

Jinhui TangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2381 - 2390

https://doi.org/10.1145/3581783.3611945

Published: 27 October 2023 Publication History

Abstract

Spatio-temporal Action Detection (SAD) aims to recognize the multi-class actions, and meanwhile locate their spatio-temporal occurrence in untrimmed videos. Besides relying on the inherent inter-actor interactions, most previous SAD approaches model actor interactions between multi-actors and the whole frames or special parts (e.g., objects/hands). However, such approaches are relatively graceless by 1) roughly treating all various actors to equivalently interact with frames/parts or by 2) sumptuously borrowing multiple costly detectors to acquire the special parts. To solve the above dilemma, we propose a novel Foreground/Background-masked Interaction Learning (dubbed as FBI Learning) framework to learn the multi-actor features by attentively interacting with the hands-down foreground and background frames. Specifically, we first design a new Mask-guided Cross Attention (MCA) mechanism that calculates the masked cross-attentions to capture the compact relations between the actors and foreground/background regions. Next, we present a new Actor-guided Feature Aggregation (AFA) scheme that integrates foreground- and background-interacted actor features with the learnable actor-based weights. Finally, we construct a long-term feature bank that associates temporal context information to facilitate action classification. Extensive experiments are conducted on commonly available UCF101-24, MultiSports, and AVA v2.1/v2.2 datasets, which illustrate the competitive performance of FBI Learning against the state-of-the-art methods.

References

[1]

Liangliang Cao, Zicheng Liu, and Thomas S Huang. 2010. Cross-dataset action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1998--2005.

[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision (ECCV). 213--229.

Digital Library

[3]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6299--6308.

[4]

Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, and Ping Luo. 2021. Watch only once: An end-to-end video action detection framework. In IEEE/CVF International Conference on Computer Vision (ICCV). 8178--8187.

[5]

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1290--1299.

[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR). 1--21.

[7]

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale vision transformers. In IEEE/CVF International Conference on Computer Vision (ICCV). 6824--6835.

[8]

Gueter Josmy Faure, Min-Hung Chen, and Shang-Hong Lai. 2023. Holistic Interaction Transformer Network for Action Detection. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 3340--3350.

[9]

Christoph Feichtenhofer. 2020. X3d: Expanding architectures for efficient video recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 203--213.

[10]

Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, and Kaiming He. 2022. Masked Autoencoders As Spatiotemporal Learners. arXiv preprint arXiv:2205.09113 (2022).

[11]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In IEEE/CVF International Conference on Computer Vision (ICCV). 6202--6211.

[12]

Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video action transformer network. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 244--253.

[13]

Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 759--768.

[14]

Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. 2018. Ava: A video dataset of spatio-temporally localized atomic visual actions. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6047--6056.

[15]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning spatio-temporal features with 3d residual networks for action recognition. In IEEE/CVF International Conference on Computer Vision Workshops (ICCV Workshops). 3154--3160.

[16]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In IEEE/CVF International Conference on Computer Vision (ICCV). 2961--2969.

[17]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020a. Action genome: Actions as compositions of spatio-temporal scene graphs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10236--10247.

[18]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020b. Action genome: Actions as compositions of spatio-temporal scene graphs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10236--10247.

[19]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, 1 (2012), 221--231.

Digital Library

[20]

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. 2017. Action tubelet detector for spatio-temporal action localization. In IEEE/CVF International Conference on Computer Vision (ICCV). 4405--4413.

[21]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics-Human Language Technologies (NAACL-HLT). 4171--4186.

[22]

Okan Köpüklü, Xiangyu Wei, and Gerhard Rigoll. 2019. You only watch once: A unified cnn architecture for real-time spatiotemporal action localization. arXiv preprint arXiv:1911.06644 (2019).

[23]

Rui Li, Yiting Wang, Wei-Long Zheng, and Bao-Liang Lu. 2022. A Multi-view Spectral-Spatial-Temporal Masked Autoencoder for Decoding Emotions with Self-supervised Learning. In the 30th ACM International Conference on Multimedia (ACM MM). 6--14.

Digital Library

[24]

Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, and Limin Wang. 2021. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In IEEE/CVF International Conference on Computer Vision (ICCV). 13536--13545.

[25]

Yixuan Li, Zixu Wang, Limin Wang, and Gangshan Wu. 2020. Actions as moving points. In European Conference on Computer Vision (ECCV). 68--84.

Digital Library

[26]

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2117--2125.

[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). 740--755.

[28]

Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In AAAI Conference on Artificial Intelligence (AAAI). 7138--7145.

[29]

Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Understanding and improving transformer from a multi-particle dynamic system point of view. arXiv preprint arXiv:1906.02762 (2019).

[30]

Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf. 2018. Attend and interact: Higher-order object interactions for video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6790--6800.

[31]

Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. 2020. Something-else: Compositional action recognition with spatial-temporal interaction networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1049--1059.

[32]

Jingcheng Ni, Jie Qin, and Di Huang. 2021. Identity-aware Graph Memory Network for Action Detection. In the 29th ACM International Conference on Multimedia (ACM MM). 3437--3445.

[33]

Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, and Hongsheng Li. 2021. Actor-context-actor relation network for spatio-temporal action localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 464--474.

[34]

Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei Wu. 2021. Mango: A mask attention guided one-stage scene text spotter. In AAAI Conference on Artificial Intelligence (AAAI). 2467--2476.

[35]

Xian Qu, Changxing Ding, Xingao Li, Xubin Zhong, and Dacheng Tao. 2022. Distillation using oracle queries for transformer-based human-object interaction detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19558--19567.

[36]

Joseph Redmon and Ali Farhadi. 2017. YOLO9000: better, faster, stronger. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7263--7271.

[37]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS). 91--99.

[38]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, Vol. 115 (2015), 211--252.

Digital Library

[39]

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2016. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016).

[40]

Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Wei Liu, and Jian Yang. 2019. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 3 (2019), 1110--1118.

[41]

Xiangbo Shu, Jinhui Tang, Guo-Jun Qi, Yan Song, Zechao Li, and Liyan Zhang. 2017. Concurrence-aware long short-term sub-memories for person-person action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops). 1--8.

[42]

Xiangbo Shu, Liyan Zhang, Guo-Jun Qi, Wei Liu, and Jinhui Tang. 2021. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 6 (2021), 3300--3315.

[43]

Xiangbo Shu, Liyan Zhang, Yunlian Sun, and Jinhui Tang. 2020. Host--parasite: Graph LSTM-in-LSTM for group activity recognition. IEEE Transactions on Neural Networks and Learning Systems, Vol. 32, 2 (2020), 663--674.

[44]

Gurkirt Singh, Suman Saha, Michael Sapienza, Philip HS Torr, and Fabio Cuzzolin. 2017. Online real-time multiple spatiotemporal action localisation and prediction. In IEEE/CVF International Conference on Computer Vision (ICCV). 3637--3646.

[45]

Kaitao Song, Xiu-Shen Wei, Xiangbo Shu, Ren-Jie Song, and Jianfeng Lu. 2020. Bi-modal progressive mask attention for fine-grained recognition. IEEE Transactions on Image Processing, Vol. 29 (2020), 7006--7018.

[46]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).

[47]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, Vol. 15, 1 (2014), 1929--1958.

Digital Library

[48]

Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, and Cordelia Schmid. 2018. Actor-centric relation network. In European Conference on Computer Vision (ECCV). 318--334.

Digital Library

[49]

Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. 2019a. Relational action forecasting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 273--283.

[50]

Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, and Cordelia Schmid. 2019b. Relational action forecasting. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 273--283.

[51]

Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019c. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223 (2019).

[52]

Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, and Cewu Lu. 2020. Asynchronous interaction aggregation for action detection. In European Conference on Computer Vision (ECCV). 71--87.

Digital Library

[53]

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. 2022. VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems (NeurIPS). 1--16.

[54]

Du Tran and Junsong Yuan. 2012. Max-margin structured output regression for spatio-temporal action localization. In Advances in Neural Information Processing Systems (NeurIPS). 1--9.

[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). 5998--6008.

[56]

Guangzhi Wang, Yangyang Guo, Yongkang Wong, and Mohan Kankanhalli. 2022a. Distance Matters in Human-Object Interaction Detection. In the 29th ACM International Conference on Multimedia (ACM MM). 4546--4554.

[57]

Limin Wang, Yu Qiao, and Xiaoou Tang. 2014. Video action detection with relational dynamic-poselets. In European Conference on Computer Vision (ECCV). 565--580.

[58]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7794--7803.

[59]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In European Conference on Computer Vision (ECCV). 399--417.

Digital Library

[60]

Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, and Jianming Zhang. 2022b. Less is More: Consistent Video Depth Estimation with Masked Frames Modeling. In the 30th ACM International Conference on Multimedia (ACM MM). 6347--6358.

Digital Library

[61]

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 284--293.

[62]

Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, and Gangshan Wu. 2020. Context-aware rcnn: A baseline for action detection in videos. In European Conference on Computer Vision (ECCV). 440--456.

Digital Library

[63]

Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, and Tat-Seng Chua. 2020. Visual relation grounding in videos. In European Conference on Computer Vision (ECCV). 447--464.

Digital Library

[64]

Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1492--1500.

[65]

Wentao Xie, Guanghui Ren, and Si Liu. 2020. Video relation detection with trajectory-aware multi-modal features. In the 28th ACM International Conference on Multimedia (ACM MM). 4590--4594.

Digital Library

[66]

Rui Yan, Lingxi Xie, Xiangbo Shu, Liyan Zhang, and Jinhui Tang. 2023. Progressive Instance-Aware Feature Learning for Compositional Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45 (2023), 10317--10330.

Digital Library

[67]

Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, and Qi Tian. 2020. HiGCIN: Hierarchical graph-based cross inference network for group activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45 (2020), 6955--6968.

Digital Library

[68]

Xitong Yang, Xiaodong Yang, Ming-Yu Liu, Fanyi Xiao, Larry S Davis, and Jan Kautz. 2019. Step: Spatio-temporal progressive learning for video action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 264--272.

[69]

Yanyi Zhang, Xinyu Li, and Ivan Marsic. 2021. Multi-label activity recognition using activity-specific features and activity correlations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14625--14635.

[70]

Yubo Zhang, Pavel Tokmakov, Martial Hebert, and Cordelia Schmid. 2019. A structured model for action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9975--9984.

[71]

Jiaojiao Zhao and Cees GM Snoek. 2019. Dance with flow: Two-in-one stream action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9935--9944.

[72]

Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Bing Shuai, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, et al. 2022. Tuber: Tubelet transformer for video action detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13598--13607.

[73]

Xubin Zhong, Changxing Ding, Zijian Li, and Shaoli Huang. 2022. Towards Hard-Positive Query Mining for DETR-Based Human-Object Interaction Detection. In European Conference on Computer Vision (ECCV). 444--460.

[74]

Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal relational reasoning in videos. In European Conference on Computer Vision (ECCV). 803--818.

Digital Library

Cited By

Chen KZhewei TShu X(2024)Leveraging Multimodal Knowledge for Spatio-Temporal Action Localization2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW63481.2024.10645431(1-5)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICMEW63481.2024.10645431
Gritsenko AXiong XDjolonga JDehghani MSun CLučić MSchmid CArnab A(2024)End-to-End Spatio-Temporal Action Localisation with Video Transformers2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01739(18373-18383)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01739

Index Terms

Foreground/Background-Masked Interaction Learning for Spatio-temporal Action Detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

A Hybrid Background Subtraction Method with Background and Foreground Candidates Detection

Background subtraction for motion detection is often used in video surveillance systems. However, difficulties in bootstrapping restrict its development. This article proposes a novel hybrid background subtraction technique to solve this problem. For ...
Dynamic background estimation and complementary learning for pixel-wise foreground/background segmentation

Change and motion detection plays a basic and guiding role in surveillance video analysis. Since most outdoor surveillance videos are taken in native and complex environments, these "static" backgrounds change in some unknown patterns, which make ...
An adaptive background modeling for foreground detection using spatio-temporal features
Abstract
Background modeling is a well accepted foreground detection technique for many visual surveillance applications like remote sensing, medical imaging, traffic monitoring, crime detection, machine/robot vision etc. Regardless of simplicity of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
221
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)14

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen KZhewei TShu X(2024)Leveraging Multimodal Knowledge for Spatio-Temporal Action Localization2024 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)10.1109/ICMEW63481.2024.10645431(1-5)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICMEW63481.2024.10645431
Gritsenko AXiong XDjolonga JDehghani MSun CLučić MSchmid CArnab A(2024)End-to-End Spatio-Temporal Action Localisation with Video Transformers2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01739(18373-18383)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01739

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten