A two-stage temporal proposal network for precise action localization in untrimmed video

Wang, Fei; Wang, Guorui; Du, Yuxuan; He, Zhenquan; Jiang, Yong

doi:10.1007/s13042-021-01301-z

A two-stage temporal proposal network for precise action localization in untrimmed video

Original Article
Published: 30 March 2021

Volume 12, pages 2199–2211, (2021)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Fei Wang ORCID: orcid.org/0000-0001-8296-8039¹,
Guorui Wang²,
Yuxuan Du²,
Zhenquan He² &
…
Yong Jiang³

348 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose a two-stage temporal proposal algorithm for the action detection task of long untrimmed videos. In the first stage, we propose a novel prior-minor watershed algorithm for action proposals with precise prior watershed proposal algorithm and minor supplementary sliding window algorithm. Here, we propose the correctness discriminator to fill the proposals that watershed proposal algorithm may omit with the sliding window proposals. In the second stage, an extended context pooling (ECP) is firstly proposed with two modules (internal and context). The context information module of ECP can structure the proposals and enhance the extended features of action proposals. Different level of ECP is introduced to model the action proposal region and make its extended context region more targeted and precise. Then, we propose a temporal context regression network, which adopts a multi-task loss to realize the training of the temporal coordinate regression and the action/background classification simultaneously, and outputs the precise temporal boundaries of the proposals. Here, we also propose prior-minor ranking to balance the effect of the prior watershed proposals and the minor supplementary proposals. On three large scale benchmarks THUMOS14, ActivityNet (v1.2 and v1.3), and Charades, our approach achieves superior performances compared with other state-of-the-art methods and runs over 1020 frames per second (fps) on a single NVIDIA Titan-X Pascal GPU, indicating that our method can efficiently improve the precision of action localization task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Faster learning of temporal action proposal via sparse multilevel boundary generator

Article 05 June 2023

TAN: a temporal-aware attention network with context-rich representation for boosting proposal generation

Article Open access 22 February 2024

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

References

Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 2678–2687. https://doi.org/10.1109/CVPR.2016.293
Tu Z, Xie W, Dauwels J, Li B, Yuan J (2019) Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans Circuits Syst Video Technol 29(5):1423–1437
Article Google Scholar
Liu K, Gao L, Khan NM, Qi L, Guan L (2020) A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. In: IEEE Transactions on Multimedia, vol 23, 2021, pp 64–76. https://doi.org/10.1109/TMM.2020.2974323
Lee I, Kim D, Lee S (2020) 3D human behavior understanding using generalized TS-LSTM networks. In: IEEE Transactions on Multimedia, vol 23, 2021, pp415–428. https://doi.org/10.1109/TMM.2020.2978637
Yang J, Liu W, Yuan J, Mei T (2020) Hierarchical soft quantization for skeleton-based human action recognition. In: IEEE Transactions on Multimedia, vol 23, 2021, pp 883–898. https://doi.org/10.1109/TMM.2020.2990082
Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: deep action proposals for action understanding. In: European conference on computer vision. Springer, Cham, pp 768–784
Shou Z, Wang D, Chang S (2016) Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 1049–1058. https://doi.org/10.1109/CVPR.2016.119
Gao J, Yang Z, Sun C, Chen K, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 3648–3656. https://doi.org/10.1109/ICCV.2017.392
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675
Liu H, Wang S, Wang W, Cheng J (2020) Multi-scale based context-aware net for action detection. IEEE Trans Multimed 22(2):337–348
Article Google Scholar
Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Proceedings of the European conference on computer vision (ECCV), pp 68–83
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2914–2923
Huang J, Li N, Li T, Liu S, Li G (2020) Spatial-temporal context-aware online action detection and prediction. IEEE Trans Circuits Syst Video Technol 30(8):2650–2662
Article Google Scholar
Gong G, Wang X, Mu Y, Tian Q (2020) Learning temporal co-attention models for unsupervised video action localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp 9816–9825. https://doi.org/10.1109/CVPR42600.2020.00984
Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013, pp 1817–1824. https://doi.org/10.1109/ICCV.2013.228
Oneata D, Verbeek J, Schmid C (2014) Efficient action localization with approximately normalized fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 2545–2552. https://doi.org/10.1109/CVPR.2014.326
Jain M, Gemert Jv, Jégou H, Bouthemy P, Snoek CGM (2014) In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 740–747. https://doi.org/10.1109/CVPR.2014.100
Tang K, Yao B, Fei-Fei L, Koller D (2013) Combining the right features for complex event recognition. In: IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013, pp 2696–2703. https://doi.org/10.1109/ICCV.2013.335
Jiang YG, Liu J, Roshan Zamir A, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowd sourcing data collection for activity understanding. In: European Conference on Computer Vision, Springer, Cham, pp 510–526
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision—ECCV 2016. Lecture Notes in Computer Science, vol 9912. Springer, Cham. https://doi.org/10.1007/978-3-319-46484-8_2
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems, vol 1. pp 568–576
Feichtenhofer C, Fan H, Malik J, He K (2018) Slowfast networks for video recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 6201–6210. https://doi.org/10.1109/ICCV.2019.00630
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 4489–4497
Diba A, Sharma V, Van Gool L, Stiefelhagen R (2019) Dynamonet: dynamic action and motion network. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 6191–6200. https://doi.org/10.1109/ICCV.2019.00629
Girdhar R, Tran D, Torresani L, Ramanan D (2019) DistInit: learning video representations without a single labeled video. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 852–861. https://doi.org/10.1109/ICCV.2019.00094
Yu T, Wang L, Da C, Gu H, Xiang S, Pan C (2019) Weakly semantic guided action recognition. IEEE Trans Multimed 21(10):2504–2517
Article Google Scholar
Chen G, Zhang C, Zou Y (2020) AFNet: temporal locality-aware network with dual structure for accurate and fast action detection. In: IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2020.3014555
Wu H, Ma X, Li Y (2020) Convolutional networks with channel and STIPs attention model for action recognition in videos. IEEE Trans Multimed 22(9):2293–2306
Article Google Scholar
Zhang T, Zheng W, Cui Z, Zong Y, Li C, Zhou X, Yang J (2020) Deep manifold-to-manifold transforming network for skeleton-based action recognition. In: IEEE Transactions on Multimedia, vol 22(11), pp 2926–2937. https://doi.org/10.1109/TMM.2020.2966878
Y. An, Y. Wang, Z. Li, Q. Yang, Yu (2019) PA3D: pose-action 3D machine for video recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp 7922–7931. https://doi.org/10.1109/CVPR.2019.00811
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: pose motion representation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp 7024–7033. https://doi.org/10.1109/CVPR.2018.00734
Marcon M, Paracchini MBM, Tubaro S (2019) A framework for interpreting, modeling and recognizing human body gestures through 3D eigenpostures. Int J Mach Learn Cybern 10(5):1205–1226
Zhang S, Callaghan V (2021) Real-time human posture recognition using an adaptive hybrid classifier. Int J Mach Learn Cybern 12(2):489–499
Uijlings JRR, Sande KEAVD, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Article Google Scholar
Zitnick,CL, Dollar P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision, Springer, Cham, pp 391–405
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 580–587. https://doi.org/10.1109/CVPR.2014.81
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 779–788. https://doi.org/10.1109/CVPR.2016.91
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, Springer, Cham pp 21–37
Cho S, Byun H (2016) A space-time graph optimization approach based on maximum cliques for action detection. IEEE Trans Circuits Syst Video Technol 26(4):661–672
Article Google Scholar
Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv:1705.01180
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675
Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: submission to activitynet 2017. arxiv.org/abs/1707.06750
Xu H, Das A, Saenko K (2017) R-C3D: region convolutional 3d network for temporal activity detection. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 5794–5803. https://doi.org/10.1109/ICCV.2017.617
Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-TAD: sub-graph localization for temporal action detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp 10153–10162. https://doi.org/10.1109/CVPR42600.2020.01017
Fan L, Huang W, Gan C, Ermon S, Gong B, Huang J (2018) End-to-end learning of motion representation for video understanding. In: IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp 6016–6025. https://doi.org/10.1109/CVPR.2018.00630
Pont-Tuset J, Arbeláez P, Barron JT, Marques F, Malik J (2017) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans Pattern Anal Mach Intell 39(1):128–140
Article Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR’06), New York, NY, USA, pp 2169–2178. https://doi.org/10.1109/CVPR.2006.68
Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 1914–1923. https://doi.org/10.1109/CVPR.2016.211
Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19
Yuan J, Ni B, Yang X, Kassim AA (2016) Temporal action localization with pyramid of score distribution features. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 3093–3102. https://doi.org/10.1109/CVPR.2016.337
Yu T, Ren Z, Li Y, Yan E, Xu N, Yuan J (2019) Temporal structure mining for weakly supervised action detection. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 5521–5530. https://doi.org/10.1109/ICCV.2019.00562
Narayan S, Cholakkal H, Khan FS, Shao L (2019) 3C-Net: Category count and center loss for weakly-supervised action localization. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 8678–8686. https://doi.org/10.1109/ICCV.2019.00877
Nguyen PX, Ramanan D, Fowlkes CC (2019) Weakly-supervised action localization with background modeling. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 5501–5510. https://doi.org/10.1109/ICCV.2019.00560
Wang J, Wang W, Gao W (2020) Fast and accurate action detection in videos with motion-centric attention model. IEEE Trans Circuits Syst Video Technol 30(1):117–130
Article Google Scholar
Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 2020, pp 1006–1016. https://doi.org/10.1109/CVPR42600.2020.00109
Liu Z, Wang L, Zhang Q, Gao Z, Niu Z, Zheng N, Hua G (2019) Weakly supervised temporal action localization through contrast based evaluation networks. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 3898–3907. https://doi.org/10.1109/ICCV.2019.00400
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S (2017) CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 1417–1426. https://doi.org/10.1109/CVPR.2017.155
Dai X, Singh B, Zhang G, Davis LS, Chen YQ (2017) Temporal context network for activity localization in videos. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 5727–5736. https://doi.org/10.1109/ICCV.2017.610
Heilbron FC, Barrios W, Escorcia V, Ghanem B (2017) SCC: semantic context cascade for efficient action detection. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 3175–3184. https://doi.org/10.1109/CVPR.2017.338
Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 7093–7102. https://doi.org/10.1109/ICCV.2019.00719
Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 3888–3897. https://doi.org/10.1109/ICCV.2019.00399
Sigurdsson GA, Divvala S, Farhadi A, Gupta A (2017) Asynchronous temporal fields for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 5650–5659. https://doi.org/10.1109/CVPR.2017.599
Piergiovanni A, Ryoo MS (2018) Learning latent super-events to detect multiple activities in videos. In: IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp 5304–5313. https://doi.org/10.1109/CVPR.2018.00556

Download references

Acknowledgements

This work was supported in part by the Foundation of National Natural Science Foundation of China under Grant 61973065, the Fundamental Research Funds for the Central Universities of China under Grant N172608005, N182612002 and N2026002, National Natural Science Foundation of China under Grant 61973065.

Author information

Authors and Affiliations

Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China
Fei Wang
College of Information Science and Engineering, Northeastern University, Shenyang, 110819, China
Guorui Wang, Yuxuan Du & Zhenquan He
Shenyang Institute of Automation Chinese Academy of Sciences, Shenyang, 110016, China
Yong Jiang

Authors

Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Guorui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxuan Du
View author publications
You can also search for this author in PubMed Google Scholar
Zhenquan He
View author publications
You can also search for this author in PubMed Google Scholar
Yong Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, F., Wang, G., Du, Y. et al. A two-stage temporal proposal network for precise action localization in untrimmed video. Int. J. Mach. Learn. & Cyber. 12, 2199–2211 (2021). https://doi.org/10.1007/s13042-021-01301-z

Download citation

Received: 17 June 2020
Accepted: 08 March 2021
Published: 30 March 2021
Issue Date: August 2021
DOI: https://doi.org/10.1007/s13042-021-01301-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A two-stage temporal proposal network for precise action localization in untrimmed video

Abstract

Access this article

Similar content being viewed by others

Faster learning of temporal action proposal via sparse multilevel boundary generator

TAN: a temporal-aware attention network with context-rich representation for boosting proposal generation

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A two-stage temporal proposal network for precise action localization in untrimmed video

Abstract

Access this article

Similar content being viewed by others

Faster learning of temporal action proposal via sparse multilevel boundary generator

TAN: a temporal-aware attention network with context-rich representation for boosting proposal generation

SegTAD: Precise Temporal Action Detection via Semantic Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation