Skip to main content
Log in

A two-stage temporal proposal network for precise action localization in untrimmed video

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

In this paper, we propose a two-stage temporal proposal algorithm for the action detection task of long untrimmed videos. In the first stage, we propose a novel prior-minor watershed algorithm for action proposals with precise prior watershed proposal algorithm and minor supplementary sliding window algorithm. Here, we propose the correctness discriminator to fill the proposals that watershed proposal algorithm may omit with the sliding window proposals. In the second stage, an extended context pooling (ECP) is firstly proposed with two modules (internal and context). The context information module of ECP can structure the proposals and enhance the extended features of action proposals. Different level of ECP is introduced to model the action proposal region and make its extended context region more targeted and precise. Then, we propose a temporal context regression network, which adopts a multi-task loss to realize the training of the temporal coordinate regression and the action/background classification simultaneously, and outputs the precise temporal boundaries of the proposals. Here, we also propose prior-minor ranking to balance the effect of the prior watershed proposals and the minor supplementary proposals. On three large scale benchmarks THUMOS14, ActivityNet (v1.2 and v1.3), and Charades, our approach achieves superior performances compared with other state-of-the-art methods and runs over 1020 frames per second (fps) on a single NVIDIA Titan-X Pascal GPU, indicating that our method can efficiently improve the precision of action localization task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 2678–2687. https://doi.org/10.1109/CVPR.2016.293

  2. Tu Z, Xie W, Dauwels J, Li B, Yuan J (2019) Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans Circuits Syst Video Technol 29(5):1423–1437

    Article  Google Scholar 

  3. Liu K, Gao L, Khan NM, Qi L, Guan L (2020) A multi-stream graph convolutional networks-hidden conditional random field model for skeleton-based action recognition. In: IEEE Transactions on Multimedia, vol 23, 2021, pp 64–76. https://doi.org/10.1109/TMM.2020.2974323

  4. Lee I, Kim D, Lee S (2020) 3D human behavior understanding using generalized TS-LSTM networks. In: IEEE Transactions on Multimedia, vol 23, 2021, pp415–428. https://doi.org/10.1109/TMM.2020.2978637

  5. Yang J, Liu W, Yuan J, Mei T (2020) Hierarchical soft quantization for skeleton-based human action recognition. In: IEEE Transactions on Multimedia, vol 23, 2021, pp 883–898. https://doi.org/10.1109/TMM.2020.2990082

  6. Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) Daps: deep action proposals for action understanding. In: European conference on computer vision. Springer, Cham, pp 768–784

  7. Shou Z, Wang D, Chang S (2016) Temporal action localization in untrimmed videos via multi-stage CNNs. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 1049–1058. https://doi.org/10.1109/CVPR.2016.119

  8. Gao J, Yang Z, Sun C, Chen K, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 3648–3656. https://doi.org/10.1109/ICCV.2017.392

  9. Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675

  10. Liu H, Wang S, Wang W, Cheng J (2020) Multi-scale based context-aware net for action detection. IEEE Trans Multimed 22(2):337–348

    Article  Google Scholar 

  11. Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Proceedings of the European conference on computer vision (ECCV), pp 68–83

  12. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2914–2923

  13. Huang J, Li N, Li T, Liu S, Li G (2020) Spatial-temporal context-aware online action detection and prediction. IEEE Trans Circuits Syst Video Technol 30(8):2650–2662

    Article  Google Scholar 

  14. Gong G, Wang X, Mu Y, Tian Q (2020) Learning temporal co-attention models for unsupervised video action localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp 9816–9825. https://doi.org/10.1109/CVPR42600.2020.00984

  15. Oneata D, Verbeek J, Schmid C (2013) Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013, pp 1817–1824. https://doi.org/10.1109/ICCV.2013.228

  16. Oneata D, Verbeek J, Schmid C (2014) Efficient action localization with approximately normalized fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 2545–2552. https://doi.org/10.1109/CVPR.2014.326

  17. Jain M, Gemert Jv, Jégou H, Bouthemy P, Snoek CGM (2014) In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 740–747. https://doi.org/10.1109/CVPR.2014.100

  18. Tang K, Yao B, Fei-Fei L, Koller D (2013) Combining the right features for complex event recognition. In: IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 2013, pp 2696–2703. https://doi.org/10.1109/ICCV.2013.335

  19. Jiang YG, Liu J, Roshan Zamir A, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/

  20. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698

  21. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowd sourcing data collection for activity understanding. In: European Conference on Computer Vision, Springer, Cham, pp 510–526

  22. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision—ECCV 2016. Lecture Notes in Computer Science, vol 9912. Springer, Cham. https://doi.org/10.1007/978-3-319-46484-8_2

  23. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 27th international conference on neural information processing systems, vol 1. pp 568–576

  24. Feichtenhofer C, Fan H, Malik J, He K (2018) Slowfast networks for video recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 6201–6210. https://doi.org/10.1109/ICCV.2019.00630

  25. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 4489–4497

  26. Diba A, Sharma V, Van Gool L, Stiefelhagen R (2019) Dynamonet: dynamic action and motion network. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 6191–6200. https://doi.org/10.1109/ICCV.2019.00629

  27. Girdhar R, Tran D, Torresani L, Ramanan D (2019) DistInit: learning video representations without a single labeled video. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 852–861. https://doi.org/10.1109/ICCV.2019.00094

  28. Yu T, Wang L, Da C, Gu H, Xiang S, Pan C (2019) Weakly semantic guided action recognition. IEEE Trans Multimed 21(10):2504–2517

    Article  Google Scholar 

  29. Chen G, Zhang C, Zou Y (2020) AFNet: temporal locality-aware network with dual structure for accurate and fast action detection. In: IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2020.3014555

  30. Wu H, Ma X, Li Y (2020) Convolutional networks with channel and STIPs attention model for action recognition in videos. IEEE Trans Multimed 22(9):2293–2306

    Article  Google Scholar 

  31. Zhang T, Zheng W, Cui Z, Zong Y, Li C, Zhou X, Yang J (2020) Deep manifold-to-manifold transforming network for skeleton-based action recognition. In: IEEE Transactions on Multimedia, vol 22(11), pp 2926–2937. https://doi.org/10.1109/TMM.2020.2966878

  32. Y. An, Y. Wang, Z. Li, Q. Yang, Yu (2019) PA3D: pose-action 3D machine for video recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp 7922–7931. https://doi.org/10.1109/CVPR.2019.00811

  33. Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: pose motion representation for action recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp 7024–7033. https://doi.org/10.1109/CVPR.2018.00734

  34. Marcon M, Paracchini MBM, Tubaro S (2019) A framework for interpreting, modeling and recognizing human body gestures through 3D eigenpostures. Int J Mach Learn Cybern 10(5):1205–1226

  35. Zhang S, Callaghan V (2021) Real-time human posture recognition using an adaptive hybrid classifier. Int J Mach Learn Cybern 12(2):489–499

  36. Uijlings JRR, Sande KEAVD, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171

    Article  Google Scholar 

  37. Zitnick,CL, Dollar P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision, Springer, Cham, pp 391–405

  38. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp 580–587. https://doi.org/10.1109/CVPR.2014.81

  39. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  40. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 779–788. https://doi.org/10.1109/CVPR.2016.91

  41. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: European conference on computer vision, Springer, Cham pp 21–37

  42. Cho S, Byun H (2016) A space-time graph optimization approach based on maximum cliques for action detection. IEEE Trans Circuits Syst Video Technol 26(4):661–672

    Article  Google Scholar 

  43. Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv:1705.01180

  44. Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675

  45. Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: submission to activitynet 2017. arxiv.org/abs/1707.06750

  46. Xu H, Das A, Saenko K (2017) R-C3D: region convolutional 3d network for temporal activity detection. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 5794–5803. https://doi.org/10.1109/ICCV.2017.617

  47. Xu M, Zhao C, Rojas DS, Thabet A, Ghanem B (2020) G-TAD: sub-graph localization for temporal action detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp 10153–10162. https://doi.org/10.1109/CVPR42600.2020.01017

  48. Fan L, Huang W, Gan C, Ermon S, Gong B, Huang J (2018) End-to-end learning of motion representation for video understanding. In: IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp 6016–6025. https://doi.org/10.1109/CVPR.2018.00630

  49. Pont-Tuset J, Arbeláez P, Barron JT, Marques F, Malik J (2017) Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans Pattern Anal Mach Intell 39(1):128–140

    Article  Google Scholar 

  50. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition (CVPR’06), New York, NY, USA, pp 2169–2178. https://doi.org/10.1109/CVPR.2006.68

  51. Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 1914–1923. https://doi.org/10.1109/CVPR.2016.211

  52. Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19

  53. Yuan J, Ni B, Yang X, Kassim AA (2016) Temporal action localization with pyramid of score distribution features. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 2016, pp 3093–3102. https://doi.org/10.1109/CVPR.2016.337

  54. Yu T, Ren Z, Li Y, Yan E, Xu N, Yuan J (2019) Temporal structure mining for weakly supervised action detection. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 5521–5530. https://doi.org/10.1109/ICCV.2019.00562

  55. Narayan S, Cholakkal H, Khan FS, Shao L (2019) 3C-Net: Category count and center loss for weakly-supervised action localization. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 8678–8686. https://doi.org/10.1109/ICCV.2019.00877

  56. Nguyen PX, Ramanan D, Fowlkes CC (2019) Weakly-supervised action localization with background modeling. In: IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp 5501–5510. https://doi.org/10.1109/ICCV.2019.00560

  57. Wang J, Wang W, Gao W (2020) Fast and accurate action detection in videos with motion-centric attention model. IEEE Trans Circuits Syst Video Technol 30(1):117–130

    Article  Google Scholar 

  58. Shi B, Dai Q, Mu Y, Wang J (2020) Weakly-supervised action localization by generative attention modeling. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, WA, USA, 2020, pp 1006–1016. https://doi.org/10.1109/CVPR42600.2020.00109

  59. Liu Z, Wang L, Zhang Q, Gao Z, Niu Z, Zheng N, Hua G (2019) Weakly supervised temporal action localization through contrast based evaluation networks. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 3898–3907. https://doi.org/10.1109/ICCV.2019.00400

  60. Shou Z, Chan J, Zareian A, Miyazawa K, Chang S (2017) CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 1417–1426. https://doi.org/10.1109/CVPR.2017.155

  61. Dai X, Singh B, Zhang G, Davis LS, Chen YQ (2017) Temporal context network for activity localization in videos. In: IEEE international conference on computer vision (ICCV), Venice, Italy, 2017, pp 5727–5736. https://doi.org/10.1109/ICCV.2017.610

  62. Heilbron FC, Barrios W, Escorcia V, Ghanem B (2017) SCC: semantic context cascade for efficient action detection. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 3175–3184. https://doi.org/10.1109/CVPR.2017.338

  63. Zeng R, Huang W, Gan C, Tan M, Rong Y, Zhao P, Huang J (2019) Graph convolutional networks for temporal action localization. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 7093–7102. https://doi.org/10.1109/ICCV.2019.00719

  64. Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: IEEE/CVF international conference on computer vision (ICCV), Seoul, Korea (South), 2019, pp 3888–3897. https://doi.org/10.1109/ICCV.2019.00399

  65. Sigurdsson GA, Divvala S, Farhadi A, Gupta A (2017) Asynchronous temporal fields for action recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, USA, 2017, pp 5650–5659. https://doi.org/10.1109/CVPR.2017.599

  66. Piergiovanni A, Ryoo MS (2018) Learning latent super-events to detect multiple activities in videos. In: IEEE/CVF conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 2018, pp 5304–5313. https://doi.org/10.1109/CVPR.2018.00556

Download references

Acknowledgements

This work was supported in part by the Foundation of National Natural Science Foundation of China under Grant 61973065, the Fundamental Research Funds for the Central Universities of China under Grant N172608005, N182612002 and N2026002, National Natural Science Foundation of China under Grant 61973065.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Wang, G., Du, Y. et al. A two-stage temporal proposal network for precise action localization in untrimmed video. Int. J. Mach. Learn. & Cyber. 12, 2199–2211 (2021). https://doi.org/10.1007/s13042-021-01301-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01301-z

Keywords

Navigation