Abstract
Temporal action detection is an important yet challenging task in video understanding task. Temporal action proposals generation is a common module in action detection, and it effects the performance of action detection greatly. The module requires methods not only generating proposals with accurate temporal boundaries, but also retrieving proposals to cover action instances with high recall using relative fewer proposals. To address these difficulties, we propose an Actionness Score Optimization Model to improve the accuracy of generated proposals by capturing global contextual information of untrimmed videos. Firstly, a deconvolution layer is utilized to learn a nonlinear upsampling for the extracted features, in both spatial and temporal domains. In order to reveal the contextual information, then we introduce the bidirectional gated recurrent unit to the network. Moreover, an attention mechanism is applied to the network so that it can focus on the most relevant parts of the information to obtain more reliable actionness scores. Finally, we validate the effectiveness of our proposed network on three challenging benchmark datasets, ActivityNet v1.2, ActivityNet v1.3, and THUMOS’14.
Similar content being viewed by others
References
Yu H, Li G, Zhang W, Huang Q, Du D, Tian Q, Sebe N (2020) The unmanned aerial vehicle benchmark: object detection, tracking and baseline. Int J Comput Vis 128(5):1141–1159. https://doi.org/10.1007/s11263-019-01266-1
Vallathan G, Ayeelyan J, Thirumalai CS, Mohan S, Srivastava G, Lin C-W (2021) Suspicious activity detection using deep learning in secure assisted living IoT environments. J Supercomput 77(4):3242–3260. https://doi.org/10.1007/s11227-020-03387-8
Zhang K, Grauman K, Sha F (2018) Retrospective encoders for video summarization. In: 2018 European Conference on Computer Vision (ECCV), pp 391–408
Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. In: 2018 European Conference on Computer Vision (ECCV), pp 358–374
Hussain T, Muhammad K, Ding W, Lloret J, Baik SW, de Albuquerque VHC (2021) A comprehensive survey of multi-view video summarization. Pattern Recogn 109:107567. https://doi.org/10.1016/j.patcog.2020.107567
Song J, Gao L, Liu L, Zhu X, Sebe N (2018) Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn 75:175–187. https://doi.org/10.1016/j.patcog.2017.03.021
Dong J, Li X, Xu C, Yang X, Yang G, Wang X, Wang M (2021) Dual encoding for video retrieval by text. IEEE Trans Pattern Anal Mach Intell 1:21. https://doi.org/10.1109/TPAMI.2021.3059295
Gabeur V, Sun C, Alahari K, Schmid C (2020) Multi-modal transformer for video retrieval. In: 2020 European Conference on Computer Vision (ECCV), pp 214–229
Moltisanti D, Fidler S, Damen D (2019) Action recognition from single timestamp supervision in untrimmed videos. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 9907–9916. https://doi.org/10.1109/CVPR.2019.01015
Singh A, Chakraborty O, Varshney A, Panda R, Feris R, Saenko K, Das A (2021) Semi-supervised action recognition with temporal contrastive learning. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10384–10394. https://doi.org/10.1109/CVPR46437.2021.01025
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Cai D, Yao A, Chen Y (2021) Dynamic normalization and relay for video action recognition. In: Advances in neural information processing systems, vol 34, pp 11026–11040
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6373–6382. https://doi.org/10.1109/CVPR.2017.675
Heilbron FC, Niebles JC, Ghanem B (2016) Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1914–1923. https://doi.org/10.1109/CVPR.2016.211
Escorcia V, Heilbron FC, Niebles JC, Ghanem B (2016) DAPs: deep action proposals for action understanding. In: 2016 European Conference on Computer Vision (ECCV), pp 768–784
Gao J, Yang Z, Sun C, Chen K, Nevatia R (2017) TURN TAP: temporal unit regression network for temporal action proposals. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 3648–3656. https://doi.org/10.1109/ICCV.2017.392
Gao J, Shi Z, Li J, Wang G, Yuan Y, Ge S, Zhou X (2020) Accurate temporal action proposal generation with relation-aware pyramid network. In: 2020 the AAAI Conference on Artificial Intelligence, vol 34, pp 10810–10817. https://doi.org/10.1609/aaai.v34i07.6711
Shou Z, Wang D, Chang S-F (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1049–1058. https://doi.org/10.1109/CVPR.2016.119
Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2933–2942. https://doi.org/10.1109/ICCV.2017.317
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681. https://doi.org/10.1109/78.650093
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 961–970. https://doi.org/10.1109/CVPR.2015.7298698
Idrees H, Zamir AR, Jiang Y, Gorban A, Laptev I, Sukthankar R, Shah M (2017) The THUMOS challenge on action recognition for videos “in the wild’’. Comput Vis Image Understand 155(4):1–23
Perš J, Sulić V, Kristan M, Perše M, Polanec K, Kovačič S (2010) Histograms of optical flow for efficient representation of body motion. Pattern Recogn Lett 31(11):1369–1376. https://doi.org/10.1016/j.patrec.2010.03.024
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol 1, pp 886–893. https://doi.org/10.1109/CVPR.2005.177
Wang H, Kläser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79. https://doi.org/10.1007/s11263-012-0594-8
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp 3551–3558. https://doi.org/10.1109/ICCV.2013.441
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1725–1732. https://doi.org/10.1109/CVPR.2014.223
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: 2014 the 27th International Conference on Neural Information Processing Systems. NIPS’14, pp 568–576
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 12018–12027. https://doi.org/10.1109/CVPR.2019.01230
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5534–5542. https://doi.org/10.1109/ICCV.2017.590
Liu Q, Wang Z (2020) Progressive boundary refinement network for temporal action detection. In: 2020 the AAAI Conference on Artificial Intelligence, vol 34, pp 11612–11619. https://doi.org/10.1609/aaai.v34i07.6829
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S-F (2017) CDC: convolutional-De-Convolutional networks for precise temporal action localization in untrimmed videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1417–1426. https://doi.org/10.1109/CVPR.2017.155
Jiyang Gao ZY, Nevatia R (2017) Cascaded boundary regression for temporal action detection. In: The British Machine Vision Conference (BMVC), pp 1–11. https://doi.org/10.5244/C.31.52
Liu X, Wang Q, Hu Y, Tang X, Bai S, Bai X (2021) End-to-end temporal action detection with transformer. ArXiv abs/2106.10271
Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-NMS—improving object detection with one line of code. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 5562–5570. https://doi.org/10.1109/ICCV.2017.593
Zhang G, Rao Y, Wang C, Zhou W, Ji X (2021) A deep learning method for video-based action recognition. IET Image Proc 15(12):3498–3511. https://doi.org/10.1049/ipr2.12303
Roerdink JBTM, Meijster A (2003) The watershed transform: definitions, algorithms and parallelization strategies. Fund Inform 41(10):187–228
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: 2016 European Conference on Computer Vision (ECCV), vol 9912, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: 2015 the 32nd International Conference on Machine Learning (ICML), vol 37, pp 448–456
Lin T, Zhao X, Su H, Wang C, Yang M (2018) BSN: boundary sensitive network for temporal action proposal generation. In: Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV. Lecture Notes in Computer Science, vol 11208, pp 3–21
Lin T, Liu X, Li X, Ding E, Wen S (2019) BMN: boundary-matching network for temporal action proposal generation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, pp 3888–3897
Wang W, Lin T, He D, Li F, Wen S, Wang L, Liu J (2021) Semi-supervised temporal action proposal generation via exploiting 2-d proposal map. IEEE Trans. Multim. 24:3624–3635
Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016. https://doi.org/10.1109/CVPR.2016.119
Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) SST: single-stream temporal action proposals. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21–26, 2017, pp 6373–6382
Zhang D, Dai X, Wang X, Wang YF (2018) S3d: Single shot multi-span detector via fully 3d convolutional networks. In: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK,September 3–6, 2018, p 293
Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster R-CNN architecture for temporal action localization. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, pp 1130–1139
Gao J, Chen K, Nevatia R (2018) CTAP: complementary temporal action proposal generation. In: Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part II. Lecture Notes in Computer Science, vol 11206, pp 70–85
Lin T, Zhao X, Shou Z (2017) Temporal convolution based action proposal: submission to activitynet 2017. CVPR ActivityNet Workshop abs/1707.06750
Acknowledgements
This work was supported in part by funding from the National Natural Science Foundation of China (61876104, 62002061). Financial support for this study was provided by a grant from the National Natural Science Foundation of China. The authors wish to thank Prof. Jinwen Yan for his suggestions on preparing the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (Grant Nos. 61876104 and 62002061).
Author information
Authors and Affiliations
Contributions
XL, JY, and ZC wrote the main manuscript text, and Jian-huang Lai prepared figures 1–4. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
All data generated or analyzed during this study are available from the corresponding author on reasonable request.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liao, X., Yuan, J., Cai, Z. et al. An attention-based bidirectional GRU network for temporal action proposals generation. J Supercomput 79, 8322–8339 (2023). https://doi.org/10.1007/s11227-022-04973-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04973-8