Skip to main content
Log in

An attention-based bidirectional GRU network for temporal action proposals generation

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Temporal action detection is an important yet challenging task in video understanding task. Temporal action proposals generation is a common module in action detection, and it effects the performance of action detection greatly. The module requires methods not only generating proposals with accurate temporal boundaries, but also retrieving proposals to cover action instances with high recall using relative fewer proposals. To address these difficulties, we propose an Actionness Score Optimization Model to improve the accuracy of generated proposals by capturing global contextual information of untrimmed videos. Firstly, a deconvolution layer is utilized to learn a nonlinear upsampling for the extracted features, in both spatial and temporal domains. In order to reveal the contextual information, then we introduce the bidirectional gated recurrent unit to the network. Moreover, an attention mechanism is applied to the network so that it can focus on the most relevant parts of the information to obtain more reliable actionness scores. Finally, we validate the effectiveness of our proposed network on three challenging benchmark datasets, ActivityNet v1.2, ActivityNet v1.3, and THUMOS’14.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Yu H, Li G, Zhang W, Huang Q, Du D, Tian Q, Sebe N (2020) The unmanned aerial vehicle benchmark: object detection, tracking and baseline. Int J Comput Vis 128(5):1141–1159. https://doi.org/10.1007/s11263-019-01266-1

    Article  Google Scholar 

  2. Vallathan G, Ayeelyan J, Thirumalai CS, Mohan S, Srivastava G, Lin C-W (2021) Suspicious activity detection using deep learning in secure assisted living IoT environments. J Supercomput 77(4):3242–3260. https://doi.org/10.1007/s11227-020-03387-8

    Article  Google Scholar 

  3. Zhang K, Grauman K, Sha F (2018) Retrospective encoders for video summarization. In: 2018 European Conference on Computer Vision (ECCV), pp 391–408

  4. Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. In: 2018 European Conference on Computer Vision (ECCV), pp 358–374

  5. Hussain T, Muhammad K, Ding W, Lloret J, Baik SW, de Albuquerque VHC (2021) A comprehensive survey of multi-view video summarization. Pattern Recogn 109:107567. https://doi.org/10.1016/j.patcog.2020.107567

    Article  Google Scholar 

  6. Song J, Gao L, Liu L, Zhu X, Sebe N (2018) Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn 75:175–187. https://doi.org/10.1016/j.patcog.2017.03.021

    Article  Google Scholar 

  7. Dong J, Li X, Xu C, Yang X, Yang G, Wang X, Wang M (2021) Dual encoding for video retrieval by text. IEEE Trans Pattern Anal Mach Intell 1:21. https://doi.org/10.1109/TPAMI.2021.3059295

    Article  Google Scholar 

  8. Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: 2020 European Conference on Computer Vision (ECCV), pp 214–229

  9. Moltisanti D, Fidler S, Damen D (2019) Action recognition from single timestamp supervision in untrimmed videos. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9907–9916. https://doi.org/10.1109/CVPR.2019.01015

  10. Singh A, Chakraborty O, Varshney A, Panda R, Feris R, Saenko K, Das A (2021) Semi-supervised action recognition with temporal contrastive learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10384–10394. https://doi.org/10.1109/CVPR46437.2021.01025

  11. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510

  12. Cai D, Yao A, Chen Y (2021) Dynamic normalization and relay for video action recognition. In: Advances in neural information processing systems, vol 34, pp 11026–11040

  13. Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675

  14. Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1914–1923. https://doi.org/10.1109/CVPR.2016.211

  15. Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) DAPs: deep action proposals for action understanding. In: 2016 European Conference on Computer Vision (ECCV), pp 768–784

  16. Gao J, Yang Z, Sun C, Chen K, Nevatia R (2017) TURN TAP: temporal unit regression network for temporal action proposals. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 3648–3656. https://doi.org/10.1109/ICCV.2017.392

  17. Gao J, Shi Z, Li J, Wang G, Yuan Y, Ge S, Zhou X (2020) Accurate temporal action proposal generation with relation-aware pyramid network. In: 2020 the AAAI Conference on Artificial Intelligence, vol 34, pp 10810–10817. https://doi.org/10.1609/aaai.v34i07.6711

  18. Shou Z, Wang D, Chang S-F (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1049–1058. https://doi.org/10.1109/CVPR.2016.119

  19. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2933–2942. https://doi.org/10.1109/ICCV.2017.317

  20. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681. https://doi.org/10.1109/78.650093

    Article  Google Scholar 

  21. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698

  22. Idrees H, Zamir AR, Jiang Y, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The THUMOS challenge on action recognition for videos “in the wild’’. Comput Vis Image Understand 155(4):1–23

    Article  Google Scholar 

  23. Perš J, Sulić V, Kristan M, Perše M, Polanec K, Kovačič S (2010) Histograms of optical flow for efficient representation of body motion. Pattern Recogn Lett 31(11):1369–1376. https://doi.org/10.1016/j.patrec.2010.03.024

    Article  Google Scholar 

  24. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, pp 886–893. https://doi.org/10.1109/CVPR.2005.177

  25. Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79. https://doi.org/10.1007/s11263-012-0594-8

    Article  MathSciNet  Google Scholar 

  26. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp 3551–3558. https://doi.org/10.1109/ICCV.2013.441

  27. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223

  28. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: 2014 the 27th International Conference on Neural Information Processing Systems. NIPS’14, pp 568–576

  29. Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 12018–12027. https://doi.org/10.1109/CVPR.2019.01230

  30. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213

  31. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5534–5542. https://doi.org/10.1109/ICCV.2017.590

  32. Liu Q, Wang Z (2020) Progressive boundary refinement network for temporal action detection. In: 2020 the AAAI Conference on Artificial Intelligence, vol 34, pp 11612–11619. https://doi.org/10.1609/aaai.v34i07.6829

  33. Shou Z, Chan J, Zareian A, Miyazawa K, Chang S-F (2017) CDC: convolutional-De-Convolutional networks for precise temporal action localization in untrimmed videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1417–1426. https://doi.org/10.1109/CVPR.2017.155

  34. Jiyang Gao ZY, Nevatia R (2017) Cascaded boundary regression for temporal action detection. In: The British Machine Vision Conference (BMVC), pp 1–11. https://doi.org/10.5244/C.31.52

  35. Liu X, Wang Q, Hu Y, Tang X, Bai S, Bai X (2021) End-to-end temporal action detection with transformer. ArXiv abs/2106.10271

  36. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS—improving object detection with one line of code. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5562–5570. https://doi.org/10.1109/ICCV.2017.593

  37. Zhang G, Rao Y, Wang C, Zhou W, Ji X (2021) A deep learning method for video-based action recognition. IET Image Proc 15(12):3498–3511. https://doi.org/10.1049/ipr2.12303

    Article  Google Scholar 

  38. Roerdink JBTM, Meijster A (2003) The watershed transform: definitions, algorithms and parallelization strategies. Fund Inform 41(10):187–228

    MathSciNet  MATH  Google Scholar 

  39. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: 2016 European Conference on Computer Vision (ECCV), vol 9912, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2

  40. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 2015 the 32nd International Conference on Machine Learning (ICML), vol 37, pp 448–456

  41. Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation. In: Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV. Lecture Notes in Computer Science, vol 11208, pp 3–21

  42. Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, pp 3888–3897

  43. Wang W, Lin T, He D, Li F, Wen S, Wang L, Liu J (2021) Semi-supervised temporal action proposal generation via exploiting 2-d proposal map. IEEE Trans. Multim. 24:3624–3635

    Article  Google Scholar 

  44. Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. https://doi.org/10.1109/CVPR.2016.119

  45. Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp 6373–6382

  46. Zhang D, Dai X, Wang X, Wang YF (2018) S3d: Single shot multi-span detector via fully 3d convolutional networks. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK,September 3–6, 2018, p 293

  47. Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp 1130–1139

  48. Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part II. Lecture Notes in Computer Science, vol 11206, pp 70–85

  49. Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: submission to activitynet 2017. CVPR ActivityNet Workshop abs/1707.06750

Download references

Acknowledgements

This work was supported in part by funding from the National Natural Science Foundation of China (61876104, 62002061). Financial support for this study was provided by a grant from the National Natural Science Foundation of China. The authors wish to thank Prof. Jinwen Yan for his suggestions on preparing the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61876104 and 62002061).

Author information

Authors and Affiliations

Authors

Contributions

XL, JY, and ZC wrote the main manuscript text, and Jian-huang Lai prepared figures 1–4. All authors reviewed the manuscript.

Corresponding author

Correspondence to Zemin Cai.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

All data generated or analyzed during this study are available from the corresponding author on reasonable request.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liao, X., Yuan, J., Cai, Z. et al. An attention-based bidirectional GRU network for temporal action proposals generation. J Supercomput 79, 8322–8339 (2023). https://doi.org/10.1007/s11227-022-04973-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04973-8

Keywords

Navigation