skip to main content
10.1145/3688867.3690169acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Spatial and Channel Squeeze & Excitation in Adapting Vision Transformers for Temporal Action Localization

Published: 28 October 2024 Publication History

Abstract

Transformer-based methods have achieved impressive performance on temporal action localization (TAL). Although this achievement is attributed to the multiheaded self-attention (MSA) mechanism, there is still a lack of systematic understanding. Specifically, per attention head focuses on different combinations of object. MSA can be explained as a mixture of information bottlenecks. However, recent studies show self-attention may promote robustness through improved mid-level representations. Inspired by this outcome, we propose an effectively fine-tuning approach (AdaptMLP) for transformer, which can adapt the model into TAL tasks. We then propose a channel attention module (scSE) to strengthen this ability. Benefiting from the AdaptMLP and scSE, we attain 36.3%/22.7% mAP on AcitivityNet-1.3/EPIC-Kitchens 100, outperforming ActionFormer by 0.7%/0.8%.

References

[1]
Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3173--3183.
[2]
Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, and Junhui Liu. 2020. Boundary content graph neural network for temporal action proposal generation. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XXVIII 16. Springer, 121--137.
[3]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961--970.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[5]
Shuning Chang, Pichao Wang, Fan Wang, Hao Li, and Zheng Shou. 2022. Augmented transformer with adaptive graph for temporal action proposal generation. In Proceedings of the 3rd International Workshop on Human-Centric Multimedia Analysis. 41--50.
[6]
Bo Chen and Klara Nahrstedt. 2021. Escalation: a framework for efficient and scalable spatio-temporal action localization. In Proceedings of the 12th ACM Multimedia Systems Conference. 146--158.
[7]
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. 2022. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, Vol. 35 (2022), 16664--16678.
[8]
Feng Cheng and Gedas Bertasius. 2022. Tallformer: Temporal action localization with a long-memory transformer. In European Conference on Computer Vision. Springer, 503--521.
[9]
Damen Dima. 2020. Rescaling egocentric vision. Comput. Res. Reposit., Vol. 2006 (2020).
[10]
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. 2021. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning. PMLR, 2793--2803.
[11]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.
[12]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[13]
Junshan Hu, Liansheng Zhuang, Weisong Dong, Shiming Ge, and Shafei Wang. 2023. Learning Generalized Representations for Open-Set Temporal Action Localization. In Proceedings of the 31st ACM International Conference on Multimedia. 1987--1996.
[14]
Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2021. Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3320--3329.
[15]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision. 3889--3898.
[16]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 3--19.
[17]
Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, Vol. 31 (2022), 5427--5441.
[18]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 344--353.
[19]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807--814.
[20]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 485--494.
[21]
Abhijit Guha Roy, Nassir Navab, and Christian Wachinger. 2018. Concurrent spatial and channel 'squeeze & excitation'in fully convolutional networks. In Medical Image Computing and Computer Assisted Intervention--MICCAI 2018: 21st International Conference, Granada, Spain, September 16--20, 2018, Proceedings, Part I. Springer, 421--429.
[22]
Dingfeng Shi, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, and Dacheng Tao. 2022. React: Temporal action detection with relational queries. In European conference on computer vision. Springer, 105--121.
[23]
Haichao Shi, Xiao-Yu Zhang, Changsheng Li, Lixing Gong, Yong Li, and Yongjun Bao. 2022. Dynamic graph modeling for weakly-supervised temporal action localization. In Proceedings of the 30th ACM international conference on multimedia. 3820--3828.
[24]
Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng Dai, and Juwei Lu. 2021. Class semantics-based attention for action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13739--13748.
[25]
Yuetian Weng, Zizheng Pan, Mingfei Han, Xiaojun Chang, and Bohan Zhuang. 2022. An efficient spatio-temporal pyramid transformer for action detection. In European Conference on Computer Vision. Springer, 358--375.
[26]
Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10156--10165.
[27]
Ceyuan Yang, Yinghao Xu, Jianping Shi, Bo Dai, and Bolei Zhou. 2020. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 591--600.
[28]
Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, Vol. 29 (2020), 8535--8548.
[29]
Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, and Limin Wang. 2023. Basictad: an astounding rgb-only baseline for temporal action detection. Computer Vision and Image Understanding, Vol. 232 (2023), 103692.
[30]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision. 7094--7103.
[31]
Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10287--10296.
[32]
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision. Springer, 492--510.
[33]
Chen Zhao, Ali K Thabet, and Bernard Ghanem. 2021. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13658--13667.
[34]
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. 2022. Understanding the robustness in vision transformers. In International Conference on Machine Learning. PMLR, 27378--27394.
[35]
Zixin Zhu, Wei Tang, Le Wang, Nanning Zheng, and Gang Hua. 2021. Enriching local and global contexts for temporal action localization. In Proceedings of the IEEE/CVF international conference on computer vision. 13516--13525.

Index Terms

  1. Spatial and Channel Squeeze & Excitation in Adapting Vision Transformers for Temporal Action Localization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    McGE '24: Proceedings of the 2nd International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice
    October 2024
    77 pages
    ISBN:9798400711947
    DOI:10.1145/3688867
    • Program Chairs:
    • Cheng Jin,
    • Liang He,
    • Mingli Song,
    • Rui Wang
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. adaptmlp
    2. channel attention
    3. temporal action localization
    4. transformer

    Qualifiers

    • Research-article

    Funding Sources

    • the National Natural Science Foundation of China
    • the Fujian Provincial Natural Science Foundation of China

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 42
      Total Downloads
    • Downloads (Last 12 months)42
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media