research-article

MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization

Authors:

Taiping ZhangAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2573 - 2581

https://doi.org/10.1145/3581783.3612455

Published: 27 October 2023 Publication History

Abstract

Temporal Action Localization (TAL) aims to predict the categories and temporal segments of all action instances in untrimmed videos, which is a critical and challenging task in the video understanding field. The performances of existing TAL methods remain unsatisfactory, due to the lack of highly effective temporal modeling and refined action proposal decoding. In this paper, we propose Multiscale Temporal Similarity Network (MTSN), a novel one-stage method for TAL, which mainly benefits from dynamic complementary modeling and temporal similarity decoding. Specifically, we first design Dynamic Complementary Context Aggregation (DCCA), a Transformer-based encoder. DCCA performs both long-range and short-range temporal modeling through different interaction range types of attention heads at each feature pyramid level, while higher-level semantic representations are effectively complemented with more short-range detail information in a dynamic fashion. Moreover, Temporal Similarity Mask (TSM) is designed to generate masks through an optimized globally-aware decoding process, including similarity cross-modeling, region-aware optimization and multiscale aggregated residual, which leads to high-quality action proposals. We conduct extensive experiments on two major TAL benchmarks: THUMOS14 and ActivityNet-1.3, where our method establishes a new state-of-the-art and significantly outperforms the previous best methods. Without bells and whistles, on THUMOS14, MTSN achieves an average mAP of 72.1% (+5.3%). On ActivityNet-1.3, MTSN reaches an average mAP of 40.7% (+3.1%), which crosses the 40% average mAP for the first time.

References

[1]

Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. Tsp: Temporally-sensitive pretraining of video encoders for localization tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3173--3183.

[2]

Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. 2017. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision. 5561--5569.

[3]

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition. 961--970.

[4]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[5]

Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. 2018. Rethinking the faster r-cnn architecture for temporal action localization. In proceedings of the IEEE conference on computer vision and pattern recognition. 1130--1139.

[6]

Guo Chen, Yin-Dong Zheng, Limin Wang, and Tong Lu. 2022. DCAN: Improving temporal action detection via dual context aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 248--257.

[7]

Qi Dong, Xiatian Zhu, and Shaogang Gong. 2019. Single-label multi-class image classification by deep logistic regression. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3486--3493.

Digital Library

[8]

Jiyang Gao, Kan Chen, and Ram Nevatia. 2018. Ctap: Complementary temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 68--83.

Digital Library

[9]

Jialin Gao, Zhixiang Shi, Guanshuo Wang, Jiani Li, Yufeng Yuan, Shiming Ge, and Xi Zhou. 2020. Accurate temporal action proposal generation with relation-aware pyramid network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10810--10817.

[10]

Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE international conference on computer vision. 3628--3636.

[11]

Yilong He, Xiao Han, Yong Zhong, and Lishun Wang. 2022. Non-Local Temporal Difference Network for Temporal Action Detection. Sensors, Vol. 22, 21 (2022), 8396.

[12]

Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. 2017. The THUMOS challenge on action recognition for videos ?in the wild". Computer Vision and Image Understanding, Vol. 155 (2017), 1--23.

[13]

Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. 2020. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11499--11506.

[14]

Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF international conference on computer vision. 3889--3898.

[15]

Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia. 988--996.

Digital Library

[16]

Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European conference on computer vision (ECCV). 3--19.

Digital Library

[17]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. 2016. Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 21--37.

[18]

Xiaolong Liu, Song Bai, and Xiang Bai. 2022a. An Empirical Study of End-to-End Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20010--20019.

[19]

Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip HS Torr. 2021. Multi-shot temporal event localization: a benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12596--12606.

[20]

Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai, and Xiang Bai. 2022d. End-to-End Temporal Action Detection With Transformer. IEEE Transactions on Image Processing, Vol. 31 (2022), 5427--5441. https://doi.org/10.1109/TIP.2022.3195321

Digital Library

[21]

Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, and Shih-Fu Chang. 2019. Multi-granularity generator for temporal action proposal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3604--3613.

[22]

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. 2022b. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12009--12019.

[23]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022c. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.

[24]

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV). IEEE, 565--571.

[25]

Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Proposal-free temporal action detection via global segmentation mask learning. In European Conference on Computer Vision. Springer, 645--662.

Digital Library

[26]

Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 485--494.

[27]

Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1049--1058.

[28]

Deepak Sridhar, Niamul Quader, Srikanth Muralidharan, Yaoxin Li, Peng Dai, and Juwei Lu. 2021. Class semantics-based attention for action detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13739--13748.

[29]

Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. Bsn: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2602--2610.

[30]

Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. 2021. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13526--13535.

[31]

Yepeng Tang, Weining Wang, Yanwu Yang, Chunjie Zhang, and Jing Liu. 2023. Anchor-free temporal action localization via Progressive Boundary-aware Boosting. Information Processing & Management, Vol. 60, 1 (2023), 103141.

Digital Library

[32]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[34]

Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 4325--4334.

[35]

Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015).

[36]

Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Zhengrong Zuo, Changxin Gao, and Nong Sang. 2021. Oadtr: Online action detection with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7565--7575.

[37]

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. 2022. InternVideo: General Video Foundation Models via Generative and Discriminative Learning. arXiv preprint arXiv:2212.03191 (2022).

[38]

Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krahenbuhl, and Ross Girshick. 2019. Long-term feature banks for detailed video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 284--293.

[39]

Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156--10165.

[40]

Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, and Cordelia Schmid. 2022. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3333--3343.

[41]

Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. 2020. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, Vol. 29 (2020), 8535--8548.

[42]

Jun Yuan, Bingbing Ni, Xiaokang Yang, and Ashraf A Kassim. 2016. Temporal action localization with pyramid of score distribution features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3093--3102.

[43]

Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7094--7103.

[44]

Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision. Springer, 492--510.

Digital Library

[45]

Chen Zhao, Ali K Thabet, and Bernard Ghanem. 2021. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13658--13667.

Cited By

Meng LBan GXi GGuo S(2025)Pseudo label refining for semi-supervised temporal action localizationPLOS ONE10.1371/journal.pone.031841820:2(e0318418)Online publication date: 5-Feb-2025
https://doi.org/10.1371/journal.pone.0318418

Index Terms

MTSN: Multiscale Temporal Similarity Network for Temporal Action Localization
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Dual relation network for temporal action localization
Abstract
Temporal action localization is a challenging task for video understanding. Most previous methods process each proposal independently and neglect the reasoning of proposal-proposal and proposal-context relations. We argue that the supplementary ...
Temporal Action Detection with Structured Segment Networks
Abstract
This paper addresses an important and challenging task, namely detecting the temporal intervals of actions in untrimmed videos. Specifically, we present a framework called structured segment network (SSN). It is built on temporal proposals of ...
Anchor-free temporal action localization via Progressive Boundary-aware Boosting
Abstract
Enormous untrimmed videos from the real world are difficult to analyze and manage. Temporal action localization algorithms can help us to locate and recognize human activity clips in untrimmed videos. Recently, anchor-free temporal action ...
Highlights
- We focus on instance features for anchor-free temporal action localization.
- A novel temporal context-aware module is designed to improve the capability.
- The state-of-the-art performances are achieved on three public datasets.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
152
Total Downloads

Downloads (Last 12 months)47
Downloads (Last 6 weeks)4

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Meng LBan GXi GGuo S(2025)Pseudo label refining for semi-supervised temporal action localizationPLOS ONE10.1371/journal.pone.031841820:2(e0318418)Online publication date: 5-Feb-2025
https://doi.org/10.1371/journal.pone.0318418

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten