skip to main content
10.1145/3664647.3688974acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

End-to-end Spatio-Temporal Information Aggregation For Micro-Action Detection

Published: 28 October 2024 Publication History

Abstract

Micro-actions convey the emotions of characters in daily communication and offer richer semantic information compared to conventional actions. Accurate detection of these micro-actions is essential for video understanding. Due to their short duration, low intensity, and high overlap, micro-actions require more detailed video features, presenting a significant challenge for accurate detection. To address these challenges, we propose the 3D-SENet Adapter, which aggregates spatio-temporal information and enables end-to-end online video feature learning. We also find that incorporating background information significantly enhances the detection of small-scale micro-actions. Thus we develop the Cross-Attention Aggregation Detection Head, which integrates multi-scale features within the feature pyramid, thereby improving the detection accuracy of micro-actions occupying small regions in video frames. Our approach achieves first place in the Multi-label Micro-Action Detection (MMAD) and second place in the Micro-Action Recognition (MAR) of Micro-Action Analysis Grand Challenge.

References

[1]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.
[2]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.
[3]
Haoyu Chen, Xin Liu, Xiaobai Li, Henglin Shi, and Guoying Zhao. 2019. Analyze Spontaneous Gestures for Emotional Stress State Recognition: A Micro-gesture Dataset and Analysis with Deep Learning. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). 1--8. https://doi.org/10.1109/FG.2019.8756513
[4]
Rui Dai, Srijan Das, Kumara Kahatapitiya, Michael S Ryoo, and Franccois Brémond. 2022. Ms-tct: Multi-scale temporal convtransformer for action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20041--20051.
[5]
Shiming Ge, Jia Li, Qiting Ye, and Zhao Luo. 2017. Detecting Masked Faces in the Wild with LLE-CNNs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 426--434. https://doi.org/10.1109/CVPR.2017.53
[6]
Shiming Ge, Shengwei Zhao, Chenyu Li, and Jia Li. 2019. Low-Resolution Face Recognition in the Wild via Selective Knowledge Distillation. IEEE Transactions on Image Processing, Vol. 28, 4 (2019), 2051--2062. https://doi.org/10.1109/TIP.2018.2883743
[7]
Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking Micro-action Recognition: Dataset, Methods, and Applications. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 34, 7 (2024), 6238--6252. https://doi.org/10.1109/TCSVT.2024.3358415
[8]
Dan Guo, Xiaobai Li, Kun Li, Haoyu Chen, Jingjing Hu, Guoying Zhao, Yi Yang, and Meng Wang. 2024. MAC 2024: Micro-Action Analysis Grand Challenge. In Proceedings of the 32nd ACM International Conference on Multimedia. https://doi.org/10.1145/3664647.3688973
[9]
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2022. Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=0RDcd5Axok
[10]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 2790--2799. https://proceedings.mlr.press/v97/houlsby19a.html
[11]
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
[12]
Amin Jourabloo, Fernando De la Torre, Jason Saragih, Shih-En Wei, Stephen Lombardi, Te-Li Wang, Danielle Belko, Autumn Trimble, and Hernan Badino. 2022. Robust Egocentric Photo-Realistic Facial Expression Transfer for Virtual Reality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 20323--20332.
[13]
Ho-Joong Kim, Jung-Ho Hong, Heejo Kong, and Seong-Whan Lee. 2024. TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18837--18846.
[14]
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. arxiv: 2104.08691 [cs.CL] https://arxiv.org/abs/2104.08691
[15]
Kun Li, Dan Guo, Pengyu Liu, Guoliang Chen, and Meng Wang. 2024. MMAD: Multi-label Micro-Action Detection in Videos. arXiv preprint arXiv:2407.05311 (2024).
[16]
Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, Limin Wang, and Yu Qiao. 2024. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977 (2024).
[17]
Kunchang Li, Yali Wang, Gao Peng, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. 2022. UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning. In International Conference on Learning Representations.
[18]
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, Online, 4582--4597. https://doi.org/10.18653/v1/2021.acl-long.353
[19]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 3888--3897. https://doi.org/10.1109/ICCV.2019.00399
[20]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In Proceedings of the European Conference on Computer Vision (ECCV).
[21]
Shuming Liu, Chen-Lin Zhang, Chen Zhao, and Bernard Ghanem. 2024. End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18591--18601.
[22]
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. arxiv: 2110.07602 [cs.CL] https://arxiv.org/abs/2110.07602
[23]
Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022. End-to-End Temporal Action Detection With Transformer. IEEE Transactions on Image Processing, Vol. 31 (2022), 5427--5441. https://doi.org/10.1109/tip.2022.3195321
[24]
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2023. GPT understands, too. AI Open (2023). https://doi.org/10.1016/j.aiopen.2023.08.012
[25]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.
[26]
Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Scott Yih, and Madian Khabsa. 2022. UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 6253--6264. https://doi.org/10.18653/v1/2022.acl-long.433
[27]
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28]
Jiayi Shao, Xiaohan Wang, Ruijie Quan, Junjun Zheng, Jiang Yang, and Yi Yang. 2023. Action Sensitivity Learning for Temporal Action Localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 13457--13469.
[29]
Dingfeng Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, and Dacheng Tao. 2023. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18857--18866.
[30]
Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, and Limin Wang. 2022. PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 15268--15280. https://proceedings.neurips.cc/paper_files/paper/2022/file/6255539f776ce988a81d3841eadc4cf9-Paper-Conference.pdf
[31]
Tuan N Tang, Kwonyoung Kim, and Kwanghoon Sohn. 2023. Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization. arXiv preprint arXiv:2303.09055 (2023).
[32]
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. 2023. VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14549--14560.
[33]
Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. 2022. AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning. arxiv: 2205.12410 [cs.CL] https://arxiv.org/abs/2205.12410
[34]
Ryo Yonetani, Kris M. Kitani, and Yoichi Sato. 2016. Recognizing Micro-Actions and Reactions from Paired Egocentric Videos. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2629--2638. https://doi.org/10.1109/CVPR.2016.288
[35]
Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. 2022. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. arxiv: 2106.10199 [cs.LG] https://arxiv.org/abs/2106.10199
[36]
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision. Springer, 492--510.
[37]
Daichi Zhang, Chenyu Li, Fanzhao Lin, Dan Zeng, and Shiming Ge. 2021. Detecting Deepfake Videos with Temporal Dropout 3DCNN. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, Zhi-Hua Zhou (Ed.). International Joint Conferences on Artificial Intelligence Organization, 1288--1294. https://doi.org/10.24963/ijcai.2021/178 Main Track.
[38]
Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, and Limin Wang. 2024. Dual DETRs for Multi-Label Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18559--18569.

Index Terms

  1. End-to-end Spatio-Temporal Information Aggregation For Micro-Action Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. attention mechanism
    2. end-to-end
    3. micro-action detection
    4. spatio-temporal information aggregation

    Qualifiers

    • Research-article

    Funding Sources

    • Dreams Foundation of Jianghuai Advance Technology Center
    • Beijing Municipal Science Technology Commission, Administrative Commission of Zhongguancun Science Park
    • Natural Science Foundation of China
    • Anhui Province Key Research and Development Program
    • National Aviation Science Foundation

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 250
      Total Downloads
    • Downloads (Last 12 months)250
    • Downloads (Last 6 weeks)146
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media