skip to main content
10.1145/3581783.3612206acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition

Published: 27 October 2023 Publication History

Abstract

Vision transformers have achieved impressive performance for video action recognition due to their strong capability of modeling long-range dependencies among spatio-temporal tokens. However, as for fine-grained actions, subtle and discriminative differences mainly exist in the regions of actors, directly utilizing vision transformers without removing irrelevant tokens will compromise recognition performance and lead to high computational costs. In this paper, we propose a coarse-to-fine action token localization and interaction network, namely C2F-ALIN, that dynamically localizes the most informative tokens at a coarse granularity and then partitions these located tokens to a fine granularity for sufficient fine-grained spatio-temporal interaction. Specifically, in the coarse stage, we devise a discriminative token localization module to accurately identify informative tokens and to discard irrelevant tokens, where each localized token corresponds to a large spatial region, thus effectively preserving the continuity of action regions.In the fine stage, we only further partition the localized tokens obtained in the coarse stage into a finer granularity and then characterize fine-grained token interactions in two aspects: (1) first using vanilla transformers to learn compact dependencies among all discriminative tokens; and (2) proposing a global contextual interaction module which enables each fine-grained tokens to communicate with all the spatio-temporal tokens and to embed the global context. As a result, our coarse-to-fine strategy is able to identify more relevant tokens and integrate global context for high recognition accuracy while maintaining high efficiency.Comprehensive experimental results on four widely used action recognition benchmarks, including FineGym, Diving48, Kinetics and Something-Something, clearly demonstrate the advantages of our proposed method in comparison with other state-of-the-art ones.

References

[1]
Vishal Anand, Raksha Ramesh, Boshen Jin, Ziyin Wang, Xiaoxiao Lei, and Ching-Yung Lin. 2021. MultiModal Language Modelling on Knowledge Graphs for Deep Video Understanding. In MM '21: ACM Multimedia, October. 4868--4872.
[2]
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. 2021. ViViT: A Video Vision Transformer. CoRR, Vol. abs/2103.15691 (2021).
[3]
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In ICML, July, Virtual Event. 813--824.
[4]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In ECCV, August, Vol. 12346. 213--229.
[5]
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In IEEE CVPR, July. 4724--4733.
[6]
Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, and Rongrong Ji. 2023. CF-ViT: A General Coarse-to-Fine Method for Vision Transformer. In AAAI, February. 8658--8665.
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, Austria, May.
[8]
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers. In IEEE/CVF, ICCV, October. 6804--6815.
[9]
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Jürgen Gall. 2022. Adaptive Token Sampling for Efficient Vision Transformers. In ECCV, October, Proceedings, Part XI, Vol. 13671. Springer, 396--414.
[10]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In IEEE ICCV, October. 6201--6210.
[11]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In ECCV, August. 214--229.
[12]
Rohit Girdhar, Jo a o Carreira, Carl Doersch, and Andrew Zisserman. 2019. Video Action Transformer Network. In IEEE CVPR, June. 244--253.
[13]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fründ, Peter Yianilos, Moritz Mueller-Freitag, Florian Hoppe, Christian Thurau, Ingo Bax, and Roland Memisevic. 2017. The "Something Something" Video Database for Learning and Evaluating Visual Common Sense. In IEEE ICCV, October. 5843--5851.
[14]
Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR, Vol. abs/1606.08415 (2016).
[15]
Kun Hu, Zhiyong Wang, Wei Wang, Kaylena A. Ehgoetz Martens, Liang Wang, Tieniu Tan, Simon J. G. Lewis, and David Dagan Feng. 2020. Graph Sequence Recurrent Neural Network for Vision-Based Freezing of Gait Detection. IEEE Trans. Image Process., Vol. 29 (2020), 1890--1901.
[16]
Boyuan Jiang, Mengmeng Wang, Weihao Gan, Wei Wu, and Junjie Yan. 2019. STM: SpatioTemporal and Motion Encoding for Action Recognition. In IEEE ICCV, October. 2000--2009.
[17]
Colin Lea, Michael D. Flynn, René Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal Convolutional Networks for Action Segmentation and Detection. In IEEE CVPR, July. 1003--1012.
[18]
Peng Lei and Sinisa Todorovic. 2018. Temporal Deformable Residual Networks for Action Segmentation in Videos. In IEEE CVPR, June. 6742--6751.
[19]
Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. 2022b. UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning. In ICLR, Virtual Event, April.
[20]
Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. 2022c. UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning. In ICLR, April.
[21]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. 2023. UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer. In ICLR, Virtual Event, April.
[22]
Tianjiao Li, Lin Geng Foo, Qiuhong Ke, Hossein Rahmani, Anran Wang, Jinghua Wang, and Jun Liu. 2022a. Dynamic Spatio-Temporal Specialization Learning for Fine-Grained Action Recognition. In ECCV, October, Part IV, Vol. 13664. 386--403.
[23]
Yan Li, Bin Ji, Xintian Shi, Jianguo Zhang, Bin Kang, and Limin Wang. 2020. TEA: Temporal Excitation and Aggregation for Action Recognition. In IEEE CVPR, June. 906--915.
[24]
Yingwei Li, Yi Li, and Nuno Vasconcelos. 2018. RESOUND: Towards Action Recognition Without Representation Bias. In ECCV, September. 520--535.
[25]
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. 2022a. Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations. (2022).
[26]
Yuxuan Liang, Pan Zhou, Roger Zimmermann, and Shuicheng Yan. 2022b. DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition. In ECCV, October, Proceedings, Part XXXIV, Vol. 13694. Springer, 577--595.
[27]
Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal Shift Module for Efficient Video Understanding. In IEEE ICCV, October. 7082--7092.
[28]
Kun Liu, Minzhi Zhu, Huiyuan Fu, Huadong Ma, and Tat-Seng Chua. 2020. Enhancing Anomaly Detection in Surveillance Videos with Transfer Learning from Action Recognition. In MM '20: The 28th ACM Multimedia, October. 4664--4668.
[29]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image Transformer. In ICML, Stockholmsmässan, Stockholm, Sweden, July, Vol. 80. 4052--4061.
[30]
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification. In NeurIPS, December, virtual. 13937--13949.
[31]
Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding. In IEEE CVPR, June. 2613--2622.
[32]
Baoli Sun, Xinchen Ye, Tiantian Yan, Zhihui Wang, Haojie Li, and Zhiyong Wang. 2022. Fine-grained Action Recognition with Robust Motion Representation Decoupling and Concentration. In The 30th ACM MM, Lisboa, Portugal, October. 4779--4788.
[33]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In IEEE ICCV, October. 7463--7472.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS, December. 5998--6008.
[35]
Limin Wang, Zhan Tong, Bin Ji, and Gangshan Wu. 2021a. TDN: Temporal Difference Networks for Efficient Action Recognition. In IEEE, CVPR, June. 1895--1904.
[36]
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. 2022. BEVT: BERT Pretraining of Video Transformers. In IEEE/CVF CVPR, New Orleans, LA, USA, June. 14713--14723.
[37]
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021b. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In IEEE/CVF, ICCV, Montreal, QC, Canada, October. 548--558.
[38]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021c. End-to-End Video Instance Segmentation With Transformers. In IEEE CVPR, June. 8741--8750.
[39]
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbü hl, and Ross B. Girshick. 2019. Long-Term Feature Banks for Detailed Video Understanding. In IEEE CVPR, June. 284--293.
[40]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification. In ECCV, September. 318--335.
[41]
Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang. 2022. Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer. In IEEE/CVF CVPR, New Orleans, LA, USA, June. 11091--11101.
[42]
Chuhan Zhang, Ankush Gupta, and Andrew Zisserman. 2021a. Temporal Query Networks for Fine-Grained Video Understanding. In IEEE CVPR, June. 4486--4496.
[43]
Yanyi Zhang, Xinyu Li, Chunhui Liu, Bing Shuai, Yi Zhu, Biagio Brattoli, Hao Chen, Ivan Marsic, and Joseph Tighe. 2021b. VidTr: Video Transformer Without Convolutions. In IEEE/CVF, ICCV, October. 13557--13567.

Cited By

View all
  • (2025)A strong benchmark for yoga action recognition based on lightweight pose estimation modelMultimedia Systems10.1007/s00530-024-01646-931:1Online publication date: 14-Jan-2025
  • (2024)STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognitionMultimedia Systems10.1007/s00530-023-01251-230:1Online publication date: 26-Jan-2024

Index Terms

  1. Exploring Coarse-to-Fine Action Token Localization and Interaction for Fine-grained Video Action Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. action recognition
    2. fine-grained
    3. token localization and interaction
    4. vision transformer

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)110
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)A strong benchmark for yoga action recognition based on lightweight pose estimation modelMultimedia Systems10.1007/s00530-024-01646-931:1Online publication date: 14-Jan-2025
    • (2024)STSD: spatial–temporal semantic decomposition transformer for skeleton-based action recognitionMultimedia Systems10.1007/s00530-023-01251-230:1Online publication date: 26-Jan-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media