Abstract
Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art. The code is available at https://github.com/Ziwei-Zheng/DyBDet.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aakur, S.N., Sarkar, S.: A perceptual prediction framework for self supervised event segmentation. In: CVPR, pp. 1197–1206 (2019)
Alayrac, J.B., Laptev, I., Sivic, J., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions. In: ICCV, pp. 2127–2136 (2017)
Chen, Z., Li, Y., Bengio, S., Si, S.: You Look Twice: GaterNet for dynamic filter selection in CNNs. In: CVPR, pp. 9172–9180 (2019)
Cheng, F., Bertasius, G.: TallFormer: temporal action localization with a long-memory transformer. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13694, pp. 503–521. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_29
Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: CVPR, pp. 7373–7382 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR, pp. 6508–6516 (2018)
Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: CVPR, pp. 3575–3584 (2019)
Gygli, M.: Ridiculously fast shot boundary detection with fully convolutional neural networks. In: 2018 International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2018)
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. IEEE TPAMI 44(11), 7436–7456 (2021)
Han, Y., et al.: Latency-aware unified dynamic networks for efficient image recognition. IEEE TPAMI (2024)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hong, D., Li, C., Wen, L., Wang, X., Zhang, L.: Generic event boundary detection challenge at CVPR 2021 technical report: cascaded temporal attention network (CASTANET). arXiv preprint arXiv:2107.00239 (2021)
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)
Huynh, V.T., Yang, H.J., Lee, G.S., Kim, S.H.: Generic event boundary detection in video with pyramid features. arXiv preprint arXiv:2301.04288 (2023)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456. PMLR (2015)
Kang, H., Kim, J., Kim, K., Kim, T., Kim, S.J.: Winning the CVPR’2021 kinetics-GEBD challenge: contrastive learning approach. arXiv preprint arXiv:2106.11549 (2021)
Kang, H., Kim, J., Kim, T., Kim, S.J.: UBoCo: unsupervised boundary contrastive learning for generic event boundary detection. In: CVPR, pp. 20073–20082 (2022)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal CNNs for fine-grained action segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 36–52. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_3
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: CVPR, pp. 6742–6751 (2018)
Li, C., et al.: Structured context transformer for generic event boundary detection. arXiv preprint arXiv:2206.02985 (2022)
Li, C., Wang, X., Wen, L., Hong, D., Luo, T., Zhang, L.: End-to-end compressed video representation learning for generic event boundary detection. In: CVPR, pp. 13967–13976 (2022)
Li, Y., et al.: Learning dynamic routing for semantic segmentation. In: CVPR, pp. 8553–8562 (2020)
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV, pp. 3889–3898 (2019)
Mao, X., et al.: Towards robust vision transformer. In: CVPR, pp. 12042–12051 (2022)
Ming, Q., Zhou, Z., Miao, L., Zhang, H., Li, L.: Dynamic anchor learning for arbitrary-oriented object detection. In: AAAI, vol. 35, pp. 2355–2363 (2021)
Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Post-processing temporal action detection. In: CVPR, pp. 18837–18845 (2023)
Radvansky, G.A., Zacks, J.M.: Event perception. Wiley Interdisc. Rev. Cogn. Sci. 2(6), 608–620 (2011)
Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra-and inter-action understanding via temporal action parsing. In: CVPR, pp. 730–739 (2020)
Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: TriDet: temporal action detection with relative boundary modeling. In: CVPR, pp. 18857–18866 (2023)
Shou, M.Z., Lei, S.W., Wang, W., Ghadiyaram, D., Feiszli, M.: Generic event boundary detection: a benchmark for event segmentation. In: ICCV, pp. 8075–8084 (2021)
Souček, T., Moravec, J., Lokoč, J.: TransNet: a deep network for fast detection of common shot transitions. arXiv preprint arXiv:1906.03363 (2019)
Tan, J., Wang, Y., Wu, G., Wang, L.: Temporal perceiver: a general architecture for arbitrary boundary detection. IEEE TPAMI 45(10), 12506–12520 (2023)
Tang, J., Liu, Z., Qian, C., Wu, W., Wang, L.: Progressive attention on multi-level dense difference maps for generic event boundary detection. In: CVPR, pp. 3355–3364 (2022)
Tang, S., Feng, L., Kuang, Z., Chen, Y., Zhang, W.: Fast video shot transition localization with deep structured models. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11361, pp. 577–592. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20887-5_36
Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5552–5561 (2019)
Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
Wang, J., Li, F., An, Y., Zhang, X., Sun, H.: Towards robust LiDAR-camera fusion in BEV space via mutual deformable attention and temporal aggregation. IEEE TCSVT 34(7), 5753–5764 (2024)
Wang, X., Yu, F., Dou, Z.Y., Darrell, T., Gonzalez, J.E.: SkipNet: learning dynamic routing in convolutional networks. In: ECCV, pp. 409–424 (2018)
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: ICCV, pp. 16249–16258 (2021)
Wang, Y., et al.: AdaFocus V2: end-to-end training of spatial dynamic networks for video recognition. In: CVPR, pp. 20030–20040. IEEE (2022)
Xie, Z., Zhang, Z., Zhu, X., Huang, G., Lin, S.: Spatially adaptive inference with stochastic feature sampling and interpolation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 531–548. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_31
Yang, L., Han, Y., Chen, X., Song, S., Dai, J., Huang, G.: Resolution adaptive networks for efficient inference. In: CVPR, pp. 2369–2378 (2020)
Yang, L., et al.: CondenseNet V2: sparse feature reactivation for deep networks. In: CVPR, pp. 3569–3578 (2021)
Yang, L., Zheng, Z., Wang, J., Song, S., Huang, G., Li, F.: AdaDet: an adaptive object detection system based on early-exit neural networks. IEEE Trans. Cogn. Dev. Syst. 16(1), 332–345 (2023)
Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13664, pp. 492–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_29
Zheng, Z., et al.: Dynamic spatial focus for efficient compressed video action recognition. IEEE TCSVT 34(2), 695–708 (2024)
Acknowledgement
This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115803, the Natural Science Basic Research Plan in Shaanxi Province of China under Grant 2023-JC-JQ-51, and the National Natural Science Foundation of China under Grants 62206215.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zheng, Z., He, L., Yang, L., Li, F. (2025). Fine-Grained Dynamic Network for Generic Event Boundary Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-72775-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)