Skip to main content

Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization

  • Conference paper
  • First Online:
Pattern Recognition (ICPR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15316))

Included in the following conference series:

  • 139 Accesses

Abstract

To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (CLIP) model to solve ZSTAL. However, these methods suffer from incomplete action proposals generated for unseen categories, since they follow a frame-level prediction paradigm and require hand-crafted post-processing to generate action proposals. To address this problem, in this work, we propose a novel model named Generalizable Action Proposal generator (GAP), which can interface seamlessly with CLIP and generate action proposals in a holistic way. Our GAP is built in a query-based architecture and trained with a proposal-level objective, enabling it to estimate proposal completeness and eliminate the hand-crafted post-processing. Based on this architecture, we propose an Action-aware Discrimination loss to enhance the category-agnostic dynamic information of actions. Besides, we introduce a Static-Dynamic Rectifying module that incorporates the generalizable static information from CLIP to refine the predicted proposals, which improves proposal completeness in a generalizable manner. Our experiments show that our GAP achieves state-of-the-art performance on two challenging ZSTAL benchmarks, i.e., Thumos14 and ActivityNet1.3. Specifically, our model obtains significant performance improvement over previous works on the two benchmarks, i.e., +3.2% and +3.4% average mAP, respectively. The code is available at https://github.com/Run542968/GAP.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: CVPR (2022)

    Google Scholar 

  2. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)

    Google Scholar 

  3. Cao, M., Yang, T., Weng, J., Zhang, C., Wang, J., Zou, Y.: LocVTP: video-text pre-training for temporal localization. In: ECCV (2022)

    Google Scholar 

  4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)

    Google Scholar 

  5. Cheng, F., Wang, X., Lei, J., Crandall, D., Bansal, M., Bertasius, G.: VindLU: a recipe for effective video-and-language pretraining. In: CVPR (2023)

    Google Scholar 

  6. Deng, C., Chen, Q., Qin, P., Chen, D., Wu, Q.: Prompt switch: efficient CLIP adaptation for text-video retrieval. In: ICCV (2023)

    Google Scholar 

  7. Du, J.R., et al.: Weakly-supervised temporal action localization by progressive complementary learning. arXiv (2022)

    Google Scholar 

  8. Feng, J.C., Hong, F.T., Zheng, W.S.: MIST: multiple instance self-training framework for video anomaly detection. In: CVPR (2021)

    Google Scholar 

  9. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  10. Hong, F.T., Feng, J.C., Xu, D., Shan, Y., Zheng, W.S.: Cross-modal consensus network for weakly supervised temporal action localization. In: ACM MM (2021)

    Google Scholar 

  11. Hong, F.T., Huang, X., Li, W.H., Zheng, W.S.: MINI-Net: multiple instance ranking network for video highlight detection. In: ECCV (2020)

    Google Scholar 

  12. Huang, J., Li, Y., Feng, J., Wu, X., Sun, X., Ji, R.: Clover: towards a unified video-language alignment and fusion model. In: CVPR (2023)

    Google Scholar 

  13. Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014). http://crcv.ucf.edu/THUMOS14/

  14. Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: ECCV (2022)

    Google Scholar 

  15. Ju, C., et al.: Multi-modal prompting for low-shot temporal action localization. arXiv (2023)

    Google Scholar 

  16. Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. (1955)

    Google Scholar 

  17. Li, D., Li, J., Li, H., Niebles, J.C., Hoi, S.C.: Align and prompt: video-and-language pre-training with entity prompts. In: CVPR (2022)

    Google Scholar 

  18. Li, Y.M., Huang, W.J., Wang, A.L., Zeng, L.A., Meng, J.K., Zheng, W.S.: EgoExo-Fitness: towards egocentric and exocentric full-body action understanding. In: ECCV (2024)

    Google Scholar 

  19. Li, Y.M., Zeng, L.A., Meng, J.K., Zheng, W.S.: Continual action assessment via task-consistent score-discriminative feature distribution modeling. TCSVT (2024)

    Google Scholar 

  20. Lin, C., et al.: Learning salient boundary feature for anchor-free temporal action localization. In: CVPR (2021)

    Google Scholar 

  21. Lin, K.Q., et al.: UniVTG: towards unified video-language temporal grounding. In: ICCV (2023)

    Google Scholar 

  22. Lin, K.Y., et al.: Rethinking CLIP-based video learners in cross-domain open-vocabulary action recognition. arXiv (2024)

    Google Scholar 

  23. Lin, K.Y., Du, J.R., Gao, Y., Zhou, J., Zheng, W.S.: Diversifying spatial-temporal perception for video domain generalization. In: NeurIPS (2024)

    Google Scholar 

  24. Lin, K.Y., Zhou, J., Zheng, W.S.: Human-centric transformer for domain adaptive action recognition. TPAMI (2024)

    Google Scholar 

  25. Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV (2019)

    Google Scholar 

  26. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: ECCV (2018)

    Google Scholar 

  27. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)

    Google Scholar 

  28. Liu, X., et al.: End-to-end temporal action detection with transformer. TIP (2022)

    Google Scholar 

  29. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017)

    Google Scholar 

  30. Luo, D., Huang, J., Gong, S., Jin, H., Liu, Y.: Towards generalisable video moment retrieval: visual-dynamic injection to image-text pre-training. In: CVPR (2023)

    Google Scholar 

  31. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)

    Google Scholar 

  32. Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR (2023)

    Google Scholar 

  33. Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Zero-shot temporal action detection via vision-language prompting. In: ECCV (2022)

    Google Scholar 

  34. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  35. Phan, T., Vo, K., Le, D., Doretto, G., Adjeroh, D., Le, N.: ZEETAD: adapting pretrained vision-language model for zero-shot end-to-end temporal action detection. In: WACV (2024)

    Google Scholar 

  36. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  37. Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)

    Google Scholar 

  38. Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: TriDet: temporal action detection with relative boundary modeling. In: CVPR (2023)

    Google Scholar 

  39. Shi, D., et al.: ReAct: temporal action detection with relational queries. In: ECCV (2022)

    Google Scholar 

  40. Sun, S., Gong, X.: Hierarchical semantic contrast for scene-aware video anomaly detection. In: CVPR (2023)

    Google Scholar 

  41. Tan, J., Tang, J., Wang, L., Wu, G.: Relaxed transformer decoders for direct action proposal generation. In: ICCV (2021)

    Google Scholar 

  42. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  43. Wang, A.L., Lin, K.Y., Du, J.R., Meng, J., Zheng, W.S.: Event-guided procedure planning from instructional videos with text supervision. In: ICCV (2023)

    Google Scholar 

  44. Wu, W., Luo, H., Fang, B., Wang, J., Ouyang, W.: Cap4Video: what can auxiliary captions do for text-video retrieval? In: CVPR (2023)

    Google Scholar 

  45. Xu, H., et al.: VideoCLIP: contrastive pre-training for zero-shot video-text understanding. arXiv (2021)

    Google Scholar 

  46. Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-TAD: sub-graph localization for temporal action detection. In: CVPR (2020)

    Google Scholar 

  47. Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: CVPR (2016)

    Google Scholar 

  48. Zhang, C., et al.: Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In: CVPR (2023)

    Google Scholar 

  49. Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: ECCV. Springer (2022)

    Google Scholar 

  50. Zhou, J., Liang, J., Lin, K.Y., Yang, J., Zheng, W.S.: ActionHub: a large-scale action video description dataset for zero-shot action recognition. arXiv (2024)

    Google Scholar 

  51. Zhou, J., Lin, K.Y., Li, H., Zheng, W.S.: Graph-based high-order relation modeling for long-term action recognition. In: CVPR (2021)

    Google Scholar 

  52. Zhou, J., Lin, K.Y., Qiu, Y.K., Zheng, W.S.: TwinFormer: fine-to-coarse temporal modeling for long-term action recognition. TMM (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported partially by NSFC (No. 62206315), Guangdong NSF Project (No. 2023B1515040025, No. 2024A1515010101), Guangzhou Basic and Applied Basic Research Scheme (No. 2024A04J4067).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingke Meng .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 191 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Du, JR., Lin, KY., Meng, J., Zheng, WS. (2025). Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15316. Springer, Cham. https://doi.org/10.1007/978-3-031-78444-6_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-78444-6_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-78443-9

  • Online ISBN: 978-3-031-78444-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics