research-article

ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal Generation

Authors:
Kai Xing

University of Electronic Science and Technology of China, Sichuan Province, China

University of Electronic Science and Technology of China, Sichuan Province, China

0000-0002-3706-5321
View Profile

,
Tao Li

University of Electronic Science and Technology of China, Sichuan Province, China

University of Electronic Science and Technology of China, Sichuan Province, China

0000-0001-5873-0952
View Profile

,
Xuanhan Wang

University of Electronic Science and Technology of China, Sichuan Province, China

University of Electronic Science and Technology of China, Sichuan Province, China

0000-0002-3881-9658
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19 Issue 3Article No.: 118pp 1–18https://doi.org/10.1145/3571747

Published:25 February 2023Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Temporal action proposal generation aims to localize temporal segments of human activities in videos. Current boundary-based proposal generation methods can generate proposals with precise boundary but often suffer from the inferior quality of confidence scores used for proposal retrieving. In this article, we propose an effective and end-to-end action proposal generation method, named ProposalVLAD, with Proposal-Intra Exploring Network (PVPI-Net). We first propose a ProposalVLAD module to dynamically generate global features of the entire video, then we combine the global features and proposal local features to generate the final feature representations for all candidate proposals. Then, we design a novel Proposal-Intra Loss function (PI-Loss) to generate more reliable proposal confidence scores. Extensive experiments on large-scale and challenging datasets demonstrate the effectiveness of our proposed method. Experimental results show that our PVPI-Net achieves significant improvements on two benchmark datasets (i.e., THUMOS’14 and ActivityNet-1.3) and sets new records for temporal action detection task.

REFERENCES

[1] Aafaq Nayyer, Akhtar Naveed, Liu Wei, Gilani Syed Zulqarnain, and Mian Ajmal. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the CVPR. 12487–12496.Google ScholarCross Ref
[2] Buch Shyamal, Escorcia Victor, Shen Chuanqi, Ghanem Bernard, and Niebles Juan Carlos. 2017. SST: Single-stream temporal action proposals. In Proceedings of the CVPR. 2911–2920.Google ScholarCross Ref
[3] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the CVPR. 961–970.Google ScholarCross Ref
[4] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the CVPR. 6299–6308.Google ScholarCross Ref
[5] Chao Yu-Wei, Vijayanarasimhan Sudheendra, Seybold Bryan, Ross David A., Deng Jia, and Sukthankar Rahul. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the CVPR. 1130–1139.Google ScholarCross Ref
[6] Dai Xiyang, Singh Bharat, Zhang Guyue, Davis Larry S., and Chen Yan Qiu. 2017. Temporal context network for activity localization in videos. In Proceedings of the ICCV. 5793–5802.Google ScholarCross Ref
[7] Feichtenhofer Christoph, Pinz Axel, and Zisserman Andrew. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the CVPR. 1933–1941.Google ScholarCross Ref
[8] Gao Jiyang, Chen Kan, and Nevatia Ram. 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the ECCV. 68–83.Google ScholarDigital Library
[9] Gao Jialin, Shi Zhixiang, Wang Guanshuo, Li Jiani, Yuan Yufeng, Ge Shiming, and Zhou Xi. 2020. Accurate temporal action proposal generation with relation-aware pyramid network. In Proceedings of the AAAI, Vol. 34. 10810–10817.Google ScholarCross Ref
[10] Gao Jiyang, Yang Zhenheng, Chen Kan, Sun Chen, and Nevatia Ram. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the ICCV. 3628–3636.Google ScholarCross Ref
[11] Gao Jiyang, Yang Zhenheng, and Nevatia Ram. 2017. Cascaded boundary regression for temporal action detection. In Proceedings of the BMVC.Google ScholarCross Ref
[12] Ghanem Bernard, Niebles Juan Carlos, Snoek Cees, Heilbron Fabian Caba, Alwassel Humam, Krishna Ranjay, Escorcia Victor, Hata Kenji, and Buch Shyamal. 2017. ActivityNet challenge 2017 summary. Retrieved from https://arxiv.org/abs/1710.08011.Google Scholar
[13] Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR. 580–587.Google ScholarDigital Library
[14] Ignat Oana, Castro Santiago, Zhou Yuhang, Bao Jiajun, Shan Dandan, and Mihalcea Rada. 2022. When did it happen? Duration-informed temporal localization of narrated actions in vlogs. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarDigital Library
[15] Jiang Y.-G., Liu J., Zamir A. Roshan, Toderici G., Laptev I., Shah M., and Sukthankar R.. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.Google Scholar
[16] Kalogeiton Vicky, Weinzaepfel Philippe, Ferrari Vittorio, and Schmid Cordelia. 2017. Action tubelet detector for spatio-temporal action localization. In Proceedings of the ICCV. 4405–4413.Google ScholarCross Ref
[17] Li Yixuan, Wang Zixu, Wang Limin, and Wu Gangshan. 2020. Actions as moving points. In Proceedings of the ECCV. 68–84.Google ScholarDigital Library
[18] Lin Chuming, Li Jian, Wang Yabiao, Tai Ying, Luo Donghao, Cui Zhipeng, Wang Chengjie, Li Jilin, Huang Feiyue, and Ji Rongrong. 2020. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI, Vol. 34. 11499–11506.Google ScholarCross Ref
[19] Lin Ji, Gan Chuang, and Han Song. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the ICCV. 7083–7093.Google ScholarCross Ref
[20] Lin Tianwei, Liu Xiao, Li Xin, Ding Errui, and Wen Shilei. 2019. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the ICCV. 3889–3898.Google ScholarCross Ref
[21] Lin Tianwei, Zhao Xu, and Shou Zheng. 2017. Single shot temporal action detection. In Proceedings of the ACM MM. 988–996.Google ScholarDigital Library
[22] Lin Tianwei, Zhao Xu, and Shou Zheng. 2017. Temporal convolution-based action proposal: Submission to ActivityNet 2017. Retrieved from https://arxiv.org/abs/1707.06750.Google Scholar
[23] Lin Tianwei, Zhao Xu, Su Haisheng, Wang Chongjing, and Yang Ming. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the ECCV. 3–19.Google ScholarDigital Library
[24] Liu Huafeng, Jing Liping, Wen Jingxuan, Xu Pengyu, Wang Jiaqi, Yu Jian, and Ng Michael K. 2021. Interpretable deep generative recommendation models.J. Mach. Learn. Res. 22 (2021), 202–1.Google Scholar
[25] Liu Yuan, Ma Lin, Zhang Yifeng, Liu Wei, and Chang Shih-Fu. 2019. Multi-granularity generator for temporal action proposal. In Proceedings of the CVPR. 3604–3613.Google ScholarCross Ref
[26] Long Fuchen, Yao Ting, Qiu Zhaofan, Tian Xinmei, Luo Jiebo, and Mei Tao. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the CVPR. 344–353.Google ScholarCross Ref
[27] Meng Quanling, Zhu Heyan, Zhang Weigang, Piao Xuefeng, and Zhang Aijie. 2020. Action recognition using form and motion modalities. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1s (2020), 1–16.Google ScholarDigital Library
[28] Qing Zhiwu, Su Haisheng, Gan Weihao, Wang Dongliang, Wu Wei, Wang Xiang, Qiao Yu, Yan Junjie, Gao Changxin, and Sang Nong. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the CVPR. 485–494.Google ScholarCross Ref
[29] Qiu Zhaofan, Yao Ting, and Mei Tao. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the ICCV. 5533–5541.Google ScholarCross Ref
[30] Shou Zheng, Wang Dongang, and Chang Shih-Fu. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the CVPR. 1049–1058.Google ScholarCross Ref
[31] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the NeurIPS. 568–576.Google Scholar
[32] Singh Gurkirt and Cuzzolin Fabio. 2016. Untrimmed video classification for activity detection: submission to activitynet challenge. Retrieved from https://arxiv.org/abs/1607.01979.Google Scholar
[33] Su Haisheng, Gan Weihao, Wu Wei, Qiao Yu, and Yan Junjie. 2021. BSN++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI, Vol. 35. 2602–2610.Google ScholarCross Ref
[34] Tan Jing, Tang Jiaqi, Wang Limin, and Wu Gangshan. 2021. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the ICCV. 13526–13535.Google ScholarCross Ref
[35] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the ICCV. 4489–4497.Google ScholarDigital Library
[36] Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the CVPR. 6450–6459.Google ScholarCross Ref
[37] Wang Heng, Kläser Alexander, Schmid Cordelia, and Liu Cheng-Lin. 2011. Action recognition by dense trajectories. In Proceedings of the CVPR. 3169–3176.Google ScholarDigital Library
[38] Wang Heng and Schmid Cordelia. 2013. Action recognition with improved trajectories. In Proceedings of the ICCV. 3551–3558.Google ScholarDigital Library
[39] Wang Limin, Xiong Yuanjun, Lin Dahua, and Gool Luc Van. 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the CVPR. 4325–4334.Google ScholarCross Ref
[40] Wang Limin, Xiong Yuanjun, Wang Zhe, and Qiao Yu. 2015. Towards good practices for very deep two-stream ConvNets. Retrieved from https://arxiv.org/abs/1507.02159.Google Scholar
[41] Wang Xuanhan, Dai Yan, Gao Lianli, and Song Jingkuan. 2022. Skeleton-based action recognition via adaptive cross-form learning. In Proceedings of the ACM MM. 1670–1678.Google ScholarDigital Library
[42] Wang Xuanhan, Gao Lianli, Song Jingkuan, Guo Yuyu, and Shen Heng Tao. 2021. AMANet: Adaptive multi-path aggregation for learning human 2d-3d correspondences. IEEE Trans. Multimedia (2021), 1–1. DOI:Google ScholarCross Ref
[43] Xie Saining, Sun Chen, Huang Jonathan, Tu Zhuowen, and Murphy Kevin. 2017. Rethinking spatiotemporal feature learning for video understanding. Retrieved from https://arxiv.org/abs/1712.04851.Google Scholar
[44] Xiong Yuanjun, Wang Limin, Wang Zhe, Zhang Bowen, Song Hang, Li Wei, Lin Dahua, Qiao Yu, Gool Luc Van, and Tang Xiaoou. 2016. CUHK & ETHZ & SIAT submission to activitynet challenge 2016. Retrieved from https://arxiv.org/pdf/1608.00797.pdf.Google Scholar
[45] Xiong Yuanjun, Zhao Yue, Wang Limin, Lin Dahua, and Tang Xiaoou. 2017. A pursuit of temporal accuracy in general activity detection. Retrieved from https://arxiv.org/abs/1703.02716.Google Scholar
[46] Xu Mengmeng, Zhao Chen, Rojas David S, Thabet Ali, and Ghanem Bernard. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the CVPR. 10156–10165.Google ScholarCross Ref
[47] Yang Xitong, Yang Xiaodong, Liu Ming-Yu, Xiao Fanyi, Davis Larry S., and Kautz Jan. 2019. STEP: Spatio-temporal progressive learning for video action detection. In Proceedings of the CVPR. 264–272.Google ScholarCross Ref
[48] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the ICCV. 7094–7103.Google ScholarCross Ref
[49] Zhang Junxuan, Hu Haifeng, and Lu Xinlong. 2019. Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3 (2019), 1–16.Google ScholarDigital Library
[50] Zhang Ji, Song Jingkuan, Gao Lianli, Liu Ye, and Shen Heng Tao. 2022. Progressive meta-learning with curriculum. IEEE Trans. Circ. Syst. Video Technol. 32, 9 (2022), 5916–5930. DOI:Google ScholarCross Ref
[51] Zhao Yue, Xiong Yuanjun, Wang Limin, Wu Zhirong, Tang Xiaoou, and Lin Dahua. 2017. Temporal action detection with structured segment networks. In Proceedings of the ICCV. 2914–2923.Google ScholarCross Ref
[52] Zhu Suguo, Yang Xiaoxian, Yu Jun, Fang Zhenying, Wang Meng, and Huang Qingming. 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 1–12.Google ScholarDigital Library

Index Terms

ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Computer Vision – ECCV 2018
Abstract
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This ...
Read More
Boundary discrimination and proposal evaluation for temporal action proposal generation
Abstract
Temporal action proposal generation for temporal action localization aims to capture temporal intervals that are likely to contain actions from untrimmed videos. Prevailing bottom-up proposal generation methods locate action boundaries (the start ...
Read More
Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation
Abstract
Detecting temporal actions in long and untrimmed videos is a challenging and important field in computer vision. Generating high-quality proposals is a key step in temporal action detection. A high-quality proposal usually contains two main ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 3
May 2023
514 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3582886
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 February 2023
- Online AM: 24 November 2022
- Accepted: 6 November 2022
- Revised: 6 October 2022
- Received: 11 July 2022
Published in tomm Volume 19, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Video analysis
temporal action proposal generation
ProposalVLAD
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 154
  Total Downloads
- Downloads (Last 12 months)82
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal Generation

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Boundary discrimination and proposal evaluation for temporal action proposal generation

Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation