Skip to main content

Fine-Grained Dynamic Network for Generic Event Boundary Detection

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art. The code is available at https://github.com/Ziwei-Zheng/DyBDet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aakur, S.N., Sarkar, S.: A perceptual prediction framework for self supervised event segmentation. In: CVPR, pp. 1197–1206 (2019)

    Google Scholar 

  2. Alayrac, J.B., Laptev, I., Sivic, J., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions. In: ICCV, pp. 2127–2136 (2017)

    Google Scholar 

  3. Chen, Z., Li, Y., Bengio, S., Si, S.: You Look Twice: GaterNet for dynamic filter selection in CNNs. In: CVPR, pp. 9172–9180 (2019)

    Google Scholar 

  4. Cheng, F., Bertasius, G.: TallFormer: temporal action localization with a long-memory transformer. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13694, pp. 503–521. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_29

  5. Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: CVPR, pp. 7373–7382 (2021)

    Google Scholar 

  6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)

    Google Scholar 

  7. Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR, pp. 6508–6516 (2018)

    Google Scholar 

  8. Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: CVPR, pp. 3575–3584 (2019)

    Google Scholar 

  9. Gygli, M.: Ridiculously fast shot boundary detection with fully convolutional neural networks. In: 2018 International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2018)

    Google Scholar 

  10. Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. IEEE TPAMI 44(11), 7436–7456 (2021)

    Article  Google Scholar 

  11. Han, Y., et al.: Latency-aware unified dynamic networks for efficient image recognition. IEEE TPAMI (2024)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  13. Hong, D., Li, C., Wen, L., Wang, X., Zhang, L.: Generic event boundary detection challenge at CVPR 2021 technical report: cascaded temporal attention network (CASTANET). arXiv preprint arXiv:2107.00239 (2021)

  14. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  15. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)

    Google Scholar 

  16. Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9

    Chapter  Google Scholar 

  17. Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)

    Google Scholar 

  18. Huynh, V.T., Yang, H.J., Lee, G.S., Kim, S.H.: Generic event boundary detection in video with pyramid features. arXiv preprint arXiv:2301.04288 (2023)

  19. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456. PMLR (2015)

    Google Scholar 

  20. Kang, H., Kim, J., Kim, K., Kim, T., Kim, S.J.: Winning the CVPR’2021 kinetics-GEBD challenge: contrastive learning approach. arXiv preprint arXiv:2106.11549 (2021)

  21. Kang, H., Kim, J., Kim, T., Kim, S.J.: UBoCo: unsupervised boundary contrastive learning for generic event boundary detection. In: CVPR, pp. 20073–20082 (2022)

    Google Scholar 

  22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  23. Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal CNNs for fine-grained action segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 36–52. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_3

    Chapter  Google Scholar 

  24. Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: CVPR, pp. 6742–6751 (2018)

    Google Scholar 

  25. Li, C., et al.: Structured context transformer for generic event boundary detection. arXiv preprint arXiv:2206.02985 (2022)

  26. Li, C., Wang, X., Wen, L., Hong, D., Luo, T., Zhang, L.: End-to-end compressed video representation learning for generic event boundary detection. In: CVPR, pp. 13967–13976 (2022)

    Google Scholar 

  27. Li, Y., et al.: Learning dynamic routing for semantic segmentation. In: CVPR, pp. 8553–8562 (2020)

    Google Scholar 

  28. Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV, pp. 3889–3898 (2019)

    Google Scholar 

  29. Mao, X., et al.: Towards robust vision transformer. In: CVPR, pp. 12042–12051 (2022)

    Google Scholar 

  30. Ming, Q., Zhou, Z., Miao, L., Zhang, H., Li, L.: Dynamic anchor learning for arbitrary-oriented object detection. In: AAAI, vol. 35, pp. 2355–2363 (2021)

    Google Scholar 

  31. Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Post-processing temporal action detection. In: CVPR, pp. 18837–18845 (2023)

    Google Scholar 

  32. Radvansky, G.A., Zacks, J.M.: Event perception. Wiley Interdisc. Rev. Cogn. Sci. 2(6), 608–620 (2011)

    Article  Google Scholar 

  33. Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra-and inter-action understanding via temporal action parsing. In: CVPR, pp. 730–739 (2020)

    Google Scholar 

  34. Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: TriDet: temporal action detection with relative boundary modeling. In: CVPR, pp. 18857–18866 (2023)

    Google Scholar 

  35. Shou, M.Z., Lei, S.W., Wang, W., Ghadiyaram, D., Feiszli, M.: Generic event boundary detection: a benchmark for event segmentation. In: ICCV, pp. 8075–8084 (2021)

    Google Scholar 

  36. Souček, T., Moravec, J., Lokoč, J.: TransNet: a deep network for fast detection of common shot transitions. arXiv preprint arXiv:1906.03363 (2019)

  37. Tan, J., Wang, Y., Wu, G., Wang, L.: Temporal perceiver: a general architecture for arbitrary boundary detection. IEEE TPAMI 45(10), 12506–12520 (2023)

    Article  Google Scholar 

  38. Tang, J., Liu, Z., Qian, C., Wu, W., Wang, L.: Progressive attention on multi-level dense difference maps for generic event boundary detection. In: CVPR, pp. 3355–3364 (2022)

    Google Scholar 

  39. Tang, S., Feng, L., Kuang, Z., Chen, Y., Zhang, W.: Fast video shot transition localization with deep structured models. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11361, pp. 577–592. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20887-5_36

    Chapter  Google Scholar 

  40. Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5552–5561 (2019)

    Google Scholar 

  41. Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)

  42. Wang, J., Li, F., An, Y., Zhang, X., Sun, H.: Towards robust LiDAR-camera fusion in BEV space via mutual deformable attention and temporal aggregation. IEEE TCSVT 34(7), 5753–5764 (2024)

    Google Scholar 

  43. Wang, X., Yu, F., Dou, Z.Y., Darrell, T., Gonzalez, J.E.: SkipNet: learning dynamic routing in convolutional networks. In: ECCV, pp. 409–424 (2018)

    Google Scholar 

  44. Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: ICCV, pp. 16249–16258 (2021)

    Google Scholar 

  45. Wang, Y., et al.: AdaFocus V2: end-to-end training of spatial dynamic networks for video recognition. In: CVPR, pp. 20030–20040. IEEE (2022)

    Google Scholar 

  46. Xie, Z., Zhang, Z., Zhu, X., Huang, G., Lin, S.: Spatially adaptive inference with stochastic feature sampling and interpolation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 531–548. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_31

    Chapter  Google Scholar 

  47. Yang, L., Han, Y., Chen, X., Song, S., Dai, J., Huang, G.: Resolution adaptive networks for efficient inference. In: CVPR, pp. 2369–2378 (2020)

    Google Scholar 

  48. Yang, L., et al.: CondenseNet V2: sparse feature reactivation for deep networks. In: CVPR, pp. 3569–3578 (2021)

    Google Scholar 

  49. Yang, L., Zheng, Z., Wang, J., Song, S., Huang, G., Li, F.: AdaDet: an adaptive object detection system based on early-exit neural networks. IEEE Trans. Cogn. Dev. Syst. 16(1), 332–345 (2023)

    Article  Google Scholar 

  50. Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13664, pp. 492–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_29

  51. Zheng, Z., et al.: Dynamic spatial focus for efficient compressed video action recognition. IEEE TCSVT 34(2), 695–708 (2024)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115803, the Natural Science Basic Research Plan in Shaanxi Province of China under Grant 2023-JC-JQ-51, and the National Natural Science Foundation of China under Grants 62206215.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fan Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, Z., He, L., Yang, L., Li, F. (2025). Fine-Grained Dynamic Network for Generic Event Boundary Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72775-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72774-0

  • Online ISBN: 978-3-031-72775-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics