Abstract
Temporal action proposal generation aims to localize temporal segments of human activities in videos. Current boundary-based proposal generation methods can generate proposals with precise boundary but often suffer from the inferior quality of confidence scores used for proposal retrieving. In this article, we propose an effective and end-to-end action proposal generation method, named ProposalVLAD, with Proposal-Intra Exploring Network (PVPI-Net). We first propose a ProposalVLAD module to dynamically generate global features of the entire video, then we combine the global features and proposal local features to generate the final feature representations for all candidate proposals. Then, we design a novel Proposal-Intra Loss function (PI-Loss) to generate more reliable proposal confidence scores. Extensive experiments on large-scale and challenging datasets demonstrate the effectiveness of our proposed method. Experimental results show that our PVPI-Net achieves significant improvements on two benchmark datasets (i.e., THUMOS’14 and ActivityNet-1.3) and sets new records for temporal action detection task.
- [1] . 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the CVPR. 12487–12496.Google ScholarCross Ref
- [2] . 2017. SST: Single-stream temporal action proposals. In Proceedings of the CVPR. 2911–2920.Google ScholarCross Ref
- [3] . 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the CVPR. 961–970.Google ScholarCross Ref
- [4] . 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the CVPR. 6299–6308.Google ScholarCross Ref
- [5] . 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the CVPR. 1130–1139.Google ScholarCross Ref
- [6] . 2017. Temporal context network for activity localization in videos. In Proceedings of the ICCV. 5793–5802.Google ScholarCross Ref
- [7] . 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the CVPR. 1933–1941.Google ScholarCross Ref
- [8] . 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the ECCV. 68–83.Google ScholarDigital Library
- [9] . 2020. Accurate temporal action proposal generation with relation-aware pyramid network. In Proceedings of the AAAI, Vol. 34. 10810–10817.Google ScholarCross Ref
- [10] . 2017. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the ICCV. 3628–3636.Google ScholarCross Ref
- [11] . 2017. Cascaded boundary regression for temporal action detection. In Proceedings of the BMVC.Google ScholarCross Ref
- [12] . 2017. ActivityNet challenge 2017 summary. Retrieved from https://arxiv.org/abs/1710.08011.Google Scholar
- [13] . 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR. 580–587.Google ScholarDigital Library
- [14] . 2022. When did it happen? Duration-informed temporal localization of narrated actions in vlogs. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarDigital Library
- [15] . 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.Google Scholar
- [16] . 2017. Action tubelet detector for spatio-temporal action localization. In Proceedings of the ICCV. 4405–4413.Google ScholarCross Ref
- [17] . 2020. Actions as moving points. In Proceedings of the ECCV. 68–84.Google ScholarDigital Library
- [18] . 2020. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI, Vol. 34. 11499–11506.Google ScholarCross Ref
- [19] . 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the ICCV. 7083–7093.Google ScholarCross Ref
- [20] . 2019. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the ICCV. 3889–3898.Google ScholarCross Ref
- [21] . 2017. Single shot temporal action detection. In Proceedings of the ACM MM. 988–996.Google ScholarDigital Library
- [22] . 2017. Temporal convolution-based action proposal: Submission to ActivityNet 2017. Retrieved from https://arxiv.org/abs/1707.06750.Google Scholar
- [23] . 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the ECCV. 3–19.Google ScholarDigital Library
- [24] . 2021. Interpretable deep generative recommendation models.J. Mach. Learn. Res. 22 (2021), 202–1.Google Scholar
- [25] . 2019. Multi-granularity generator for temporal action proposal. In Proceedings of the CVPR. 3604–3613.Google ScholarCross Ref
- [26] . 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the CVPR. 344–353.Google ScholarCross Ref
- [27] . 2020. Action recognition using form and motion modalities. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1s (2020), 1–16.Google ScholarDigital Library
- [28] . 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the CVPR. 485–494.Google ScholarCross Ref
- [29] . 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the ICCV. 5533–5541.Google ScholarCross Ref
- [30] . 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the CVPR. 1049–1058.Google ScholarCross Ref
- [31] . 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the NeurIPS. 568–576.Google Scholar
- [32] . 2016. Untrimmed video classification for activity detection: submission to activitynet challenge. Retrieved from https://arxiv.org/abs/1607.01979.Google Scholar
- [33] . 2021. BSN++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI, Vol. 35. 2602–2610.Google ScholarCross Ref
- [34] . 2021. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the ICCV. 13526–13535.Google ScholarCross Ref
- [35] . 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the ICCV. 4489–4497.Google ScholarDigital Library
- [36] . 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the CVPR. 6450–6459.Google ScholarCross Ref
- [37] . 2011. Action recognition by dense trajectories. In Proceedings of the CVPR. 3169–3176.Google ScholarDigital Library
- [38] . 2013. Action recognition with improved trajectories. In Proceedings of the ICCV. 3551–3558.Google ScholarDigital Library
- [39] . 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the CVPR. 4325–4334.Google ScholarCross Ref
- [40] . 2015. Towards good practices for very deep two-stream ConvNets. Retrieved from https://arxiv.org/abs/1507.02159.Google Scholar
- [41] . 2022. Skeleton-based action recognition via adaptive cross-form learning. In Proceedings of the ACM MM. 1670–1678.Google ScholarDigital Library
- [42] . 2021. AMANet: Adaptive multi-path aggregation for learning human 2d-3d correspondences. IEEE Trans. Multimedia (2021), 1–1.
DOI: Google ScholarCross Ref - [43] . 2017. Rethinking spatiotemporal feature learning for video understanding. Retrieved from https://arxiv.org/abs/1712.04851.Google Scholar
- [44] . 2016. CUHK & ETHZ & SIAT submission to activitynet challenge 2016. Retrieved from https://arxiv.org/pdf/1608.00797.pdf.Google Scholar
- [45] . 2017. A pursuit of temporal accuracy in general activity detection. Retrieved from https://arxiv.org/abs/1703.02716.Google Scholar
- [46] . 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the CVPR. 10156–10165.Google ScholarCross Ref
- [47] . 2019. STEP: Spatio-temporal progressive learning for video action detection. In Proceedings of the CVPR. 264–272.Google ScholarCross Ref
- [48] . 2019. Graph convolutional networks for temporal action localization. In Proceedings of the ICCV. 7094–7103.Google ScholarCross Ref
- [49] . 2019. Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3 (2019), 1–16.Google ScholarDigital Library
- [50] . 2022. Progressive meta-learning with curriculum. IEEE Trans. Circ. Syst. Video Technol. 32, 9 (2022), 5916–5930.
DOI: Google ScholarCross Ref - [51] . 2017. Temporal action detection with structured segment networks. In Proceedings of the ICCV. 2914–2923.Google ScholarCross Ref
- [52] . 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 1–12.Google ScholarDigital Library
Index Terms
- ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal Generation
Recommendations
BSN: Boundary Sensitive Network for Temporal Action Proposal Generation
Computer Vision – ECCV 2018AbstractTemporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This ...
Boundary discrimination and proposal evaluation for temporal action proposal generation
AbstractTemporal action proposal generation for temporal action localization aims to capture temporal intervals that are likely to contain actions from untrimmed videos. Prevailing bottom-up proposal generation methods locate action boundaries (the start ...
Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation
AbstractDetecting temporal actions in long and untrimmed videos is a challenging and important field in computer vision. Generating high-quality proposals is a key step in temporal action detection. A high-quality proposal usually contains two main ...
Comments