skip to main content
research-article

ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal Generation

Authors Info & Claims
Published:25 February 2023Publication History
Skip Abstract Section

Abstract

Temporal action proposal generation aims to localize temporal segments of human activities in videos. Current boundary-based proposal generation methods can generate proposals with precise boundary but often suffer from the inferior quality of confidence scores used for proposal retrieving. In this article, we propose an effective and end-to-end action proposal generation method, named ProposalVLAD, with Proposal-Intra Exploring Network (PVPI-Net). We first propose a ProposalVLAD module to dynamically generate global features of the entire video, then we combine the global features and proposal local features to generate the final feature representations for all candidate proposals. Then, we design a novel Proposal-Intra Loss function (PI-Loss) to generate more reliable proposal confidence scores. Extensive experiments on large-scale and challenging datasets demonstrate the effectiveness of our proposed method. Experimental results show that our PVPI-Net achieves significant improvements on two benchmark datasets (i.e., THUMOS’14 and ActivityNet-1.3) and sets new records for temporal action detection task.

REFERENCES

  1. [1] Aafaq Nayyer, Akhtar Naveed, Liu Wei, Gilani Syed Zulqarnain, and Mian Ajmal. 2019. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the CVPR. 1248712496.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Buch Shyamal, Escorcia Victor, Shen Chuanqi, Ghanem Bernard, and Niebles Juan Carlos. 2017. SST: Single-stream temporal action proposals. In Proceedings of the CVPR. 29112920.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Heilbron Fabian Caba, Escorcia Victor, Ghanem Bernard, and Niebles Juan Carlos. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the CVPR. 961970.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Carreira Joao and Zisserman Andrew. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the CVPR. 62996308.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Chao Yu-Wei, Vijayanarasimhan Sudheendra, Seybold Bryan, Ross David A., Deng Jia, and Sukthankar Rahul. 2018. Rethinking the faster R-CNN architecture for temporal action localization. In Proceedings of the CVPR. 11301139.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Dai Xiyang, Singh Bharat, Zhang Guyue, Davis Larry S., and Chen Yan Qiu. 2017. Temporal context network for activity localization in videos. In Proceedings of the ICCV. 57935802.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Feichtenhofer Christoph, Pinz Axel, and Zisserman Andrew. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the CVPR. 19331941.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Gao Jiyang, Chen Kan, and Nevatia Ram. 2018. CTAP: Complementary temporal action proposal generation. In Proceedings of the ECCV. 6883.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Gao Jialin, Shi Zhixiang, Wang Guanshuo, Li Jiani, Yuan Yufeng, Ge Shiming, and Zhou Xi. 2020. Accurate temporal action proposal generation with relation-aware pyramid network. In Proceedings of the AAAI, Vol. 34. 1081010817.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Gao Jiyang, Yang Zhenheng, Chen Kan, Sun Chen, and Nevatia Ram. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the ICCV. 36283636.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Gao Jiyang, Yang Zhenheng, and Nevatia Ram. 2017. Cascaded boundary regression for temporal action detection. In Proceedings of the BMVC.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Ghanem Bernard, Niebles Juan Carlos, Snoek Cees, Heilbron Fabian Caba, Alwassel Humam, Krishna Ranjay, Escorcia Victor, Hata Kenji, and Buch Shyamal. 2017. ActivityNet challenge 2017 summary. Retrieved from https://arxiv.org/abs/1710.08011.Google ScholarGoogle Scholar
  13. [13] Girshick Ross, Donahue Jeff, Darrell Trevor, and Malik Jitendra. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR. 580587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Ignat Oana, Castro Santiago, Zhou Yuhang, Bao Jiajun, Shan Dandan, and Mihalcea Rada. 2022. When did it happen? Duration-informed temporal localization of narrated actions in vlogs. ACM Trans. Multimedia Comput. Commun. Appl. (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Jiang Y.-G., Liu J., Zamir A. Roshan, Toderici G., Laptev I., Shah M., and Sukthankar R.. 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. http://crcv.ucf.edu/THUMOS14/.Google ScholarGoogle Scholar
  16. [16] Kalogeiton Vicky, Weinzaepfel Philippe, Ferrari Vittorio, and Schmid Cordelia. 2017. Action tubelet detector for spatio-temporal action localization. In Proceedings of the ICCV. 44054413.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Li Yixuan, Wang Zixu, Wang Limin, and Wu Gangshan. 2020. Actions as moving points. In Proceedings of the ECCV. 6884.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Lin Chuming, Li Jian, Wang Yabiao, Tai Ying, Luo Donghao, Cui Zhipeng, Wang Chengjie, Li Jilin, Huang Feiyue, and Ji Rongrong. 2020. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI, Vol. 34. 1149911506.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Lin Ji, Gan Chuang, and Han Song. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the ICCV. 70837093.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Lin Tianwei, Liu Xiao, Li Xin, Ding Errui, and Wen Shilei. 2019. BMN: Boundary-matching network for temporal action proposal generation. In Proceedings of the ICCV. 38893898.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Lin Tianwei, Zhao Xu, and Shou Zheng. 2017. Single shot temporal action detection. In Proceedings of the ACM MM. 988996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Lin Tianwei, Zhao Xu, and Shou Zheng. 2017. Temporal convolution-based action proposal: Submission to ActivityNet 2017. Retrieved from https://arxiv.org/abs/1707.06750.Google ScholarGoogle Scholar
  23. [23] Lin Tianwei, Zhao Xu, Su Haisheng, Wang Chongjing, and Yang Ming. 2018. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the ECCV. 319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Liu Huafeng, Jing Liping, Wen Jingxuan, Xu Pengyu, Wang Jiaqi, Yu Jian, and Ng Michael K. 2021. Interpretable deep generative recommendation models.J. Mach. Learn. Res. 22 (2021), 202–1.Google ScholarGoogle Scholar
  25. [25] Liu Yuan, Ma Lin, Zhang Yifeng, Liu Wei, and Chang Shih-Fu. 2019. Multi-granularity generator for temporal action proposal. In Proceedings of the CVPR. 36043613.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Long Fuchen, Yao Ting, Qiu Zhaofan, Tian Xinmei, Luo Jiebo, and Mei Tao. 2019. Gaussian temporal awareness networks for action localization. In Proceedings of the CVPR. 344353.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Meng Quanling, Zhu Heyan, Zhang Weigang, Piao Xuefeng, and Zhang Aijie. 2020. Action recognition using form and motion modalities. ACM Trans. Multimedia Comput. Commun. Appl. 16, 1s (2020), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Qing Zhiwu, Su Haisheng, Gan Weihao, Wang Dongliang, Wu Wei, Wang Xiang, Qiao Yu, Yan Junjie, Gao Changxin, and Sang Nong. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the CVPR. 485494.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Qiu Zhaofan, Yao Ting, and Mei Tao. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the ICCV. 55335541.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Shou Zheng, Wang Dongang, and Chang Shih-Fu. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the CVPR. 10491058.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Simonyan Karen and Zisserman Andrew. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the NeurIPS. 568576.Google ScholarGoogle Scholar
  32. [32] Singh Gurkirt and Cuzzolin Fabio. 2016. Untrimmed video classification for activity detection: submission to activitynet challenge. Retrieved from https://arxiv.org/abs/1607.01979.Google ScholarGoogle Scholar
  33. [33] Su Haisheng, Gan Weihao, Wu Wei, Qiao Yu, and Yan Junjie. 2021. BSN++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. In Proceedings of the AAAI, Vol. 35. 26022610.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Tan Jing, Tang Jiaqi, Wang Limin, and Wu Gangshan. 2021. Relaxed transformer decoders for direct action proposal generation. In Proceedings of the ICCV. 1352613535.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Tran Du, Bourdev Lubomir, Fergus Rob, Torresani Lorenzo, and Paluri Manohar. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the ICCV. 44894497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Tran Du, Wang Heng, Torresani Lorenzo, Ray Jamie, LeCun Yann, and Paluri Manohar. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the CVPR. 64506459.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Wang Heng, Kläser Alexander, Schmid Cordelia, and Liu Cheng-Lin. 2011. Action recognition by dense trajectories. In Proceedings of the CVPR. 31693176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Wang Heng and Schmid Cordelia. 2013. Action recognition with improved trajectories. In Proceedings of the ICCV. 35513558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Wang Limin, Xiong Yuanjun, Lin Dahua, and Gool Luc Van. 2017. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the CVPR. 43254334.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Wang Limin, Xiong Yuanjun, Wang Zhe, and Qiao Yu. 2015. Towards good practices for very deep two-stream ConvNets. Retrieved from https://arxiv.org/abs/1507.02159.Google ScholarGoogle Scholar
  41. [41] Wang Xuanhan, Dai Yan, Gao Lianli, and Song Jingkuan. 2022. Skeleton-based action recognition via adaptive cross-form learning. In Proceedings of the ACM MM. 16701678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Wang Xuanhan, Gao Lianli, Song Jingkuan, Guo Yuyu, and Shen Heng Tao. 2021. AMANet: Adaptive multi-path aggregation for learning human 2d-3d correspondences. IEEE Trans. Multimedia (2021), 1–1. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Xie Saining, Sun Chen, Huang Jonathan, Tu Zhuowen, and Murphy Kevin. 2017. Rethinking spatiotemporal feature learning for video understanding. Retrieved from https://arxiv.org/abs/1712.04851.Google ScholarGoogle Scholar
  44. [44] Xiong Yuanjun, Wang Limin, Wang Zhe, Zhang Bowen, Song Hang, Li Wei, Lin Dahua, Qiao Yu, Gool Luc Van, and Tang Xiaoou. 2016. CUHK & ETHZ & SIAT submission to activitynet challenge 2016. Retrieved from https://arxiv.org/pdf/1608.00797.pdf.Google ScholarGoogle Scholar
  45. [45] Xiong Yuanjun, Zhao Yue, Wang Limin, Lin Dahua, and Tang Xiaoou. 2017. A pursuit of temporal accuracy in general activity detection. Retrieved from https://arxiv.org/abs/1703.02716.Google ScholarGoogle Scholar
  46. [46] Xu Mengmeng, Zhao Chen, Rojas David S, Thabet Ali, and Ghanem Bernard. 2020. G-tad: Sub-graph localization for temporal action detection. In Proceedings of the CVPR. 1015610165.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Yang Xitong, Yang Xiaodong, Liu Ming-Yu, Xiao Fanyi, Davis Larry S., and Kautz Jan. 2019. STEP: Spatio-temporal progressive learning for video action detection. In Proceedings of the CVPR. 264272.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zeng Runhao, Huang Wenbing, Tan Mingkui, Rong Yu, Zhao Peilin, Huang Junzhou, and Gan Chuang. 2019. Graph convolutional networks for temporal action localization. In Proceedings of the ICCV. 70947103.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Zhang Junxuan, Hu Haifeng, and Lu Xinlong. 2019. Moving foreground-aware visual attention and key volume mining for human action recognition. ACM Trans. Multimedia Comput. Commun. Appl. 15, 3 (2019), 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Zhang Ji, Song Jingkuan, Gao Lianli, Liu Ye, and Shen Heng Tao. 2022. Progressive meta-learning with curriculum. IEEE Trans. Circ. Syst. Video Technol. 32, 9 (2022), 5916–5930. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhao Yue, Xiong Yuanjun, Wang Limin, Wu Zhirong, Tang Xiaoou, and Lin Dahua. 2017. Temporal action detection with structured segment networks. In Proceedings of the ICCV. 29142923.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Zhu Suguo, Yang Xiaoxian, Yu Jun, Fang Zhenying, Wang Meng, and Huang Qingming. 2020. Proposal complementary action detection. ACM Trans. Multimedia Comput. Commun. Appl. 16, 2s (2020), 112.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ProposalVLAD with Proposal-Intra Exploring for Temporal Action Proposal Generation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3
      May 2023
      514 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3582886
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 February 2023
      • Online AM: 24 November 2022
      • Accepted: 6 November 2022
      • Revised: 6 October 2022
      • Received: 11 July 2022
      Published in tomm Volume 19, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)82
      • Downloads (Last 6 weeks)4

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format