skip to main content
10.1145/3581783.3612167acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

A Novel Temporal Channel Enhancement and Contextual Excavation Network for Temporal Action Localization

Published: 27 October 2023 Publication History

Abstract

The temporal action localization (TAL) task aims to locate and classify action instances in untrimmed videos. Most previous methods use classifiers and locators to act on the same feature; thus, the classification and localization processes are relatively independent. Therefore, if the classification results and localization results are fused, there will be a problem that the classification results are correct while the localization results are wrong, resulting in inaccurate final results, and vice versa. To solve this problem, we propose a novel temporal channel enhancement and contextual excavation network (TCN) for the TAL task, which generates robust classification and localization features and refines the final localization results. Specifically, a temporal channel enhancement module is designed to enhance the temporal and channel information of the feature sequence. Then, the temporal semantic contextual excavation module is developed to establish relationships between similar frames. Finally, the features with enhanced contextual information are transferred to a classifier. While executing the classification process, we obtain powerful classification features. Most importantly, with the robust classification features, the final localization features are produced by the refine localization module, which is applied to obtain the final localization results. Extensive experiments show that TCN can outperform all the SOTA methods on the THUMOS14 dataset, and achieves a comparable performance on the ActivityNet1.3 dataset. Compared with ActionFormer (ECCV 2022) and BREM (MM 2022) on the THUMOS14 dataset, the proposed TCN can achieve improvements of 1.8% and 5.0%, respectively.

Supplemental Material

MP4 File
Paper 2074 presentation

References

[1]
Humam Alwassel, Silvio Giancola, and Bernard Ghanem. 2021. TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW. 3166--3176.
[2]
Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, and Junhui Liu. 2020. Boundary Content Graph Neural Network for Temporal Action Proposal Generation. In Computer Vision - ECCV. 121--137.
[3]
Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. 2017. Soft-NMS - Improving Object Detection with One Line of Code. In IEEE International Conference on Computer Vision, ICCV. 5562--5570.
[4]
João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 4724--4733.
[5]
Guo Chen, Yin-Dong Zheng, Limin Wang, and Tong Lu. 2022. DCAN: Improving Temporal Action Detection via Dual Context Aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence.
[6]
Feng Cheng and Gedas Bertasius. 2022. TallFormer: Temporal Action Localization with a Long-Memory Transformer. In Computer Vision - ECCV. 503--521.
[7]
Xun Deng, Wenjie Wang, Fuli Feng, Hanwang Zhang, Xiangnan He, and Yong Liao. 2023. Counterfactual Active Learning for Out-of-Distribution Generalization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 11362--11377.
[8]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast Networks for Video Recognition. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV. 6201--6210.
[9]
Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In IEEE International Conference on Computer Vision, ICCV. 3648--3656.
[10]
Zan Gao, Xinglei Cui, Tao Zhuo, Zhiyong Cheng, An-An Liu, Meng Wang, and Shenyong Chen. 2023. A Multi-temporal Scale and Spatial-Temporal Transformer Network for Temporal Action Localization. IEEE Trans. Hum. Mach. Syst., Vol. 53, 3 (2023), 569--580.
[11]
Zan Gao, Leming Guo, Tongwei Ren, An-An Liu, Zhiyong Cheng, and Shengyong Chen. 2022. Pairwise Two-Stream ConvNets for Cross-Domain Action Recognition With Small Data. IEEE Trans. Neural Networks Learn. Syst., Vol. 33, 3 (2022), 1147--1161.
[12]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 961--970.
[13]
Junshan Hu, Chaoxu Guo, Liansheng Zhuang, Biao Wang, Tiezheng Ge, Yuning Jiang, and Houqiang Li. 2022. Estimation of Reliable Proposal Quality for Temporal Action Detection. In MM '22: The 30th ACM International Conference on Multimedia. 6685--6695.
[14]
Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. 2014. THUMOS Challenge: Action Recognition with a LargeNumber of Classes. http://crcv.ucf.edu/THUMOS14/.
[15]
Tae-Kyung Kang, Gun-Hee Lee, Kyung-Min Jin, and Seong-Whan Lee. 2023. Action-aware Masking Network with Group-based Attention for Temporal Action Localization. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV. 6047--6056.
[16]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR.
[17]
Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. 2020b. Fast Learning of Temporal Action Proposal via Dense Boundary Generator. In Conference on Artificial Intelligence, AAAI. 11499--11506.
[18]
Chuming Lin, Chengming Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. 2021. Learning Salient Boundary Feature for Anchor-free Temporal Action Localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 3320--3329.
[19]
Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. 2020a. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. (2020), 318--327.
[20]
Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. 2019. BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In IEEE/CVF International Conference on Computer Vision, ICCV. 3888--3897.
[21]
Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In Proceedings of the ACM on Multimedia Conference, MM. 988--996.
[22]
Tianwei Lin, Xu Zhao, and Haisheng Su. 2020c. Joint Learning of Local and Global Context for Temporal Action Proposal Generation. IEEE Trans. Circuits Syst. Video Technol. (2020), 4899--4912.
[23]
Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. 2018. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In Computer Vision - ECCV. 3--21.
[24]
Huajun Liu, Fuqiang Liu, Xinyi Fan, and Dong Huang. 2021b. Polarized Self-Attention: Towards High-quality Pixel-wise Regression. CoRR, Vol. abs/2107.00782 (2021).
[25]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single Shot MultiBox Detector. In Computer Vision - ECCV. 21--37.
[26]
Xiaolong Liu, Yao Hu, Song Bai, Fei Ding, Xiang Bai, and Philip H. S. T 2021a. Multi-Shot Temporal Event Localization: A Benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 12596--12606.
[27]
Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Song Bai, and Xiang Bai. 2021c. End-to-end Temporal Action Detection with Transformer. CoRR (2021).
[28]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video Swin Transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 3192--3201.
[29]
Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. 2019. Gaussian Temporal Awareness Networks for Action Localization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 344--353.
[30]
Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, and Tao Xiang. 2022. Proposal-Free Temporal Action Detection via Global Segmentation Mask Learning. In Computer Vision - ECCV. 645--662.
[31]
Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, and Alberto Del Bimbo. 2022. Search-oriented Micro-video Captioning. In MM '22: The 30th ACM International Conference on Multimedia. 3234--3243.
[32]
Troy J. Nunnally, Penyen Chi, Kulsoom Abdullah, A. Selcuk Uluagac, John A. Copeland, and Raheem A. Beyah. 2013. P3D: A parallel 3D coordinate visualization for advanced network scans. In Proceedings of IEEE International Conference on Communications, ICC. 2052--2057.
[33]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal Context Aggregation Network for Temporal Action Proposal Refinement. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 485--494.
[34]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1104--1113.
[35]
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian D. Reid, and Silvio Savarese. 2019. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. 658--666.
[36]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI. 234--241.
[37]
Dingfeng Shi, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, and Dacheng Tao. 2022. ReAct: Temporal Action Detection with Relational Queries. In Computer Vision - ECCV. 105--121.
[38]
Haisheng Su, Weihao Gan, Wei Wu, Yu Qiao, and Junjie Yan. 2021. BSN: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation. In Conference on Artificial Intelligence, AAAI. 2602--2610.
[39]
Jing Tan, Jiaqi Tang, Limin Wang, and Gangshan Wu. 2021. Relaxed Transformer Decoders for Direct Action Proposal Generation. In IEEE/CVF International Conference on Computer Vision, ICCV. 13506--13515.
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, December 4-9, Long Beach, CA, USA. 5998--6008.
[41]
Qiang Wang, Yanhao Zhang, Yun Zheng, and Pan Pan. 2022b. RCL: Recurrent Continuous Localization for Temporal Action Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13566--13575.
[42]
Wenjie Wang, Xinyu Lin, Fuli Feng, Xiangnan He, Min Lin, and Tat-Seng Chua. 2022a. Causal representation learning for out-of-distribution recommendation. In Proceedings of the ACM Web Conference 2022. 3562--3571.
[43]
Kun Xia, Le Wang, Sanping Zhou, Nanning Zheng, and Wei Tang. 2022. Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 13874--13883.
[44]
Li Xiao, Yufan Luo, Chunlong Luo, Lianhe Zhao, Quanshui Fu, Guoqing Yang, Anpeng Huang, and Yi Zhao. 2020. PBRnet: Pyramidal Bounding Box Refinement to Improve Object Localization Accuracy. CoRR (2020).
[45]
Yuanjun Xiong, Limin Wang, Zhe Wang, Bowen Zhang, Hang Song, Wei Li, Dahua Lin, Yu Qiao, Luc Van Gool, and Xiaoou Tang. 2016. CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016. CoRR (2016).
[46]
Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection. In IEEE International Conference on Computer Vision, ICCV. 5794--5803.
[47]
Mengmeng Xu, Chen Zhao, David S. Rojas, Ali K. Thabet, and Bernard Ghanem. 2020. G-TAD: Sub-Graph Localization for Temporal Action Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. 10153--10162.
[48]
Haosen Yang, Wenhao Wu, Lining Wang, Sheng Jin, Boyang Xia, Hongxun Yao, and Hujie Huang. 2022. Temporal Action Proposal Generation with Background Constraint. In Conference on Artificial Intelligence, AAAI. 3054--3062.
[49]
Le Yang, Houwen Peng, Dingwen Zhang, Jianlong Fu, and Junwei Han. 2020. Revisiting Anchor Mechanisms for Temporal Action Localization. IEEE Trans. Image Process. (2020), 8535--8548.
[50]
Runhao Zeng, Wenbing Huang, Chuang Gan, Mingkui Tan, Yu Rong, Peilin Zhao, and Junzhou Huang. 2019. Graph Convolutional Networks for Temporal Action Localization. In IEEE/CVF International Conference on Computer Vision, ICCV. 7093--7102.
[51]
Chen-Lin Zhang, Jianxin Wu, and Yin Li. 2022. ActionFormer: Localizing Moments of Actions with Transformers. In Computer Vision - ECCV. 492--510.
[52]
Chen Zhao, Ali K. Thabet, and Bernard Ghanem. 2021. Video Self-Stitching Graph Network for Temporal Action Localization. In IEEE/CVF International Conference on Computer Vision, ICCV. 13638--13647.
[53]
Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, and Qi Tian. 2020. Bottom-Up Temporal Action Localization with Mutual Regularization. In Computer Vision - ECCV. 539--555.
[54]
Yibo Zhao, Hua Zhang, Zan Gao, Wenjie Gao, Meng Wang, and Shengyong Chen. 2023. A Novel Action Saliency and Context-Aware Network for Weakly-Supervised Temporal Action Localization. IEEE Transactions on Multimedia (2023), 1--14. https://doi.org/10.1109/TMM.2023.3234362
[55]
Yibo Zhao, Hua Zhang, Zan Gao, Weili Guan, Jie Nie, Anan Liu, Meng Wang, and Shengyong Chen. 2022. A Temporal-Aware Relation and Attention Network for Temporal Action Localization. IEEE Trans. Image Process. (2022), 4746--4760.
[56]
Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, and Dongwei Ren. 2020. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Conference on Artificial Intelligence, AAAI. 12993--13000.
[57]
Zixin Zhu, Wei Tang, Le Wang, Nanning Zheng, and Gang Hua. 2021. Enriching Local and Global Contexts for Temporal Action Localization. In IEEE/CVF International Conference on Computer Vision, ICCV. 13496--13505.
[58]
Zixin Zhu, Le Wang, Wei Tang, Nanning Zheng, and Gang Hua. 2023. ContextLoc: A Unified Context Model for Temporal Action Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

Cited By

View all
  • (2024)A Knowledge-Based Hierarchical Causal Inference Network for Video Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.338633926(9135-9149)Online publication date: 12-Apr-2024

Index Terms

  1. A Novel Temporal Channel Enhancement and Contextual Excavation Network for Temporal Action Localization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MM '23: Proceedings of the 31st ACM International Conference on Multimedia
      October 2023
      9913 pages
      ISBN:9798400701085
      DOI:10.1145/3581783
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 October 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. refine localization
      2. temporal action localization
      3. temporal channel enhancement
      4. temporal semantic contextual

      Qualifiers

      • Research-article

      Funding Sources

      • National Natural Science Foundation of China
      • Young creative team in universities of Shandong Province
      • Jinan 20 projects in universities
      • Shandong Excellent Young Scientists Fund Program
      • Shandong project towards the integration of education and industry

      Conference

      MM '23
      Sponsor:
      MM '23: The 31st ACM International Conference on Multimedia
      October 29 - November 3, 2023
      Ottawa ON, Canada

      Acceptance Rates

      Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)53
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Knowledge-Based Hierarchical Causal Inference Network for Video Action RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.338633926(9135-9149)Online publication date: 12-Apr-2024

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media