skip to main content
10.1145/3581783.3612455acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization

Published: 27 October 2023 Publication History

Abstract

Temporal Action Localization (TAL) aims to predict the categories and temporal segments of all action instances in untrimmed videos, which is a critical and challenging task in the video understanding field. The performances of existing TAL methods remain unsatisfactory, due to the lack of highly effective temporal modeling and refined action proposal decoding. In this paper, we propose Multiscale Temporal Similarity Network (MTSN), a novel one-stage method for TAL, which mainly benefits from dynamic complementary modeling and temporal similarity decoding. Specifically, we first design Dynamic Complementary Context Aggregation (DCCA), a Transformer-based encoder. DCCA performs both long-range and short-range temporal modeling through different interaction range types of attention heads at each feature pyramid level, while higher-level semantic representations are effectively complemented with more short-range detail information in a dynamic fashion. Moreover, Temporal Similarity Mask (TSM) is designed to generate masks through an optimized globally-aware decoding process, including similarity cross-modeling, region-aware optimization and multiscale aggregated residual, which leads to high-quality action proposals. We conduct extensive experiments on two major TAL benchmarks: THUMOS14 and ActivityNet-1.3, where our method establishes a new state-of-the-art and significantly outperforms the previous best methods. Without bells and whistles, on THUMOS14, MTSN achieves an average mAP of 72.1% (+5.3%). On ActivityNet-1.3, MTSN reaches an average mAP of 40.7% (+3.1%), which crosses the 40% average mAP for the first time.

References

[1]
Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3173--3183.
[2]
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision. 5561--5569.
[3]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961--970.
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[5]
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In proceedings of the IEEE conference on computer vision and pattern recognition. 1130--1139.
[6]
Guo Chen, Yin-Dong Zheng, Limin Wang, and Tong Lu. 2022. DCAN: Improving temporal action detection via dual context aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 248--257.
[7]
Qi Dong, Xiatian Zhu, and Shaogang Gong. 2019. Single-label multi-class image classification by deep logistic regression. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3486--3493.
[8]
Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. Ctap: Complementary temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 68--83.
[9]
Jialin Gao, Zhixiang Shi, Guanshuo Wang, Jiani Li, Yufeng Yuan, Shiming Ge, and Xi Zhou. 2020. Accurate temporal action proposal generation with relation-aware pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10810--10817.
[10]
Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE international conference on computer vision. 3628--3636.
[11]
Yilong He, Xiao Han, Yong Zhong, and Lishun Wang. 2022. Non-Local Temporal Difference Network for Temporal Action Detection. Sensors, Vol. 22, 21 (2022), 8396.
[12]
Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos ?in the wild". Computer Vision and Image Understanding, Vol. 155 (2017), 1--23.
[13]
Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. 2020. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11499--11506.
[14]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision. 3889--3898.
[15]
Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia. 988--996.
[16]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 3--19.
[17]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21--37.
[18]
Xiaolong Liu, Song Bai, and Xiang Bai. 2022a. An Empirical Study of End-to-End Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20010--20019.
[19]
Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip HS Torr. 2021. Multi-shot temporal event localization: a benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12596--12606.
[20]
Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022d. End-to-End Temporal Action Detection With Transformer. IEEE Transactions on Image Processing, Vol. 31 (2022), 5427--5441. https://doi.org/10.1109/TIP.2022.3195321
[21]
Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, and Shih-Fu Chang. 2019. Multi-granularity generator for temporal action proposal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3604--3613.
[22]
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. 2022b. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12009--12019.
[23]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022c. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.
[24]
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). IEEE, 565--571.
[25]
Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Proposal-free temporal action detection via global segmentation mask learning. In European Conference on Computer Vision. Springer, 645--662.
[26]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 485--494.
[27]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1049--1058.
[28]
Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng Dai, and Juwei Lu. 2021. Class semantics-based attention for action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13739--13748.
[29]
Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. Bsn: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2602--2610.
[30]
Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. 2021. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13526--13535.
[31]
Yepeng Tang, Weining Wang, Yanwu Yang, Chunjie Zhang, and Jing Liu. 2023. Anchor-free temporal action localization via Progressive Boundary-aware Boosting. Information Processing & Management, Vol. 60, 1 (2023), 103141.
[32]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[34]
Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 4325--4334.
[35]
Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015).
[36]
Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Zhengrong Zuo, Changxin Gao, and Nong Sang. 2021. Oadtr: Online action detection with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7565--7575.
[37]
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. 2022. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv preprint arXiv:2212.03191 (2022).
[38]
Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 284--293.
[39]
Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156--10165.
[40]
Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. 2022. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3333--3343.
[41]
Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, Vol. 29 (2020), 8535--8548.
[42]
Jun Yuan, Bingbing Ni, Xiaokang Yang, and Ashraf A Kassim. 2016. Temporal action localization with pyramid of score distribution features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3093--3102.
[43]
Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7094--7103.
[44]
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision. Springer, 492--510.
[45]
Chen Zhao, Ali K Thabet, and Bernard Ghanem. 2021. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13658--13667.

Cited By

View all
  • (2025)Pseudo label refining for semi-supervised temporal action localizationPLOS ONE10.1371/journal.pone.031841820:2(e0318418)Online publication date: 5-Feb-2025

Index Terms

  1. MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep neural network
    2. temporal action localization
    3. video understanding

    Qualifiers

    • Research-article

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)47
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Pseudo label refining for semi-supervised temporal action localizationPLOS ONE10.1371/journal.pone.031841820:2(e0318418)Online publication date: 5-Feb-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media