Fine-Grained Dynamic Network for Generic Event Boundary Detection

Zheng, Ziwei; He, Lijun; Yang, Le; Li, Fan

doi:10.1007/978-3-031-72775-7_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15101))

Included in the following conference series:

European Conference on Computer Vision

374 Accesses

Abstract

Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art. The code is available at https://github.com/Ziwei-Zheng/DyBDet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Local Compressed Video Stream Learning for Generic Event Boundary Detection

Article 01 November 2023

HierGAT: hierarchical spatial-temporal network with graph and transformer for video HOI detection

Article 14 December 2024

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

References

Aakur, S.N., Sarkar, S.: A perceptual prediction framework for self supervised event segmentation. In: CVPR, pp. 1197–1206 (2019)
Google Scholar
Alayrac, J.B., Laptev, I., Sivic, J., Lacoste-Julien, S.: Joint discovery of object states and manipulation actions. In: ICCV, pp. 2127–2136 (2017)
Google Scholar
Chen, Z., Li, Y., Bengio, S., Si, S.: You Look Twice: GaterNet for dynamic filter selection in CNNs. In: CVPR, pp. 9172–9180 (2019)
Google Scholar
Cheng, F., Bertasius, G.: TallFormer: temporal action localization with a long-memory transformer. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13694, pp. 503–521. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19830-4_29
Dai, X., et al.: Dynamic head: unifying object detection heads with attentions. In: CVPR, pp. 7373–7382 (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Google Scholar
Ding, L., Xu, C.: Weakly-supervised action segmentation with iterative soft boundary assignment. In: CVPR, pp. 6508–6516 (2018)
Google Scholar
Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: CVPR, pp. 3575–3584 (2019)
Google Scholar
Gygli, M.: Ridiculously fast shot boundary detection with fully convolutional neural networks. In: 2018 International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. IEEE (2018)
Google Scholar
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. IEEE TPAMI 44(11), 7436–7456 (2021)
Article Google Scholar
Han, Y., et al.: Latency-aware unified dynamic networks for efficient image recognition. IEEE TPAMI (2024)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hong, D., Li, C., Wen, L., Wang, X., Zhang, L.: Generic event boundary detection challenge at CVPR 2021 technical report: cascaded temporal attention network (CASTANET). arXiv preprint arXiv:2107.00239 (2021)
Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)
Google Scholar
Huang, D.-A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 137–153. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_9
Chapter Google Scholar
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.: Multi-scale dense networks for resource efficient image classification. In: ICLR (2018)
Google Scholar
Huynh, V.T., Yang, H.J., Lee, G.S., Kim, S.H.: Generic event boundary detection in video with pyramid features. arXiv preprint arXiv:2301.04288 (2023)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456. PMLR (2015)
Google Scholar
Kang, H., Kim, J., Kim, K., Kim, T., Kim, S.J.: Winning the CVPR’2021 kinetics-GEBD challenge: contrastive learning approach. arXiv preprint arXiv:2106.11549 (2021)
Kang, H., Kim, J., Kim, T., Kim, S.J.: UBoCo: unsupervised boundary contrastive learning for generic event boundary detection. In: CVPR, pp. 20073–20082 (2022)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal CNNs for fine-grained action segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 36–52. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_3
Chapter Google Scholar
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: CVPR, pp. 6742–6751 (2018)
Google Scholar
Li, C., et al.: Structured context transformer for generic event boundary detection. arXiv preprint arXiv:2206.02985 (2022)
Li, C., Wang, X., Wen, L., Hong, D., Luo, T., Zhang, L.: End-to-end compressed video representation learning for generic event boundary detection. In: CVPR, pp. 13967–13976 (2022)
Google Scholar
Li, Y., et al.: Learning dynamic routing for semantic segmentation. In: CVPR, pp. 8553–8562 (2020)
Google Scholar
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: ICCV, pp. 3889–3898 (2019)
Google Scholar
Mao, X., et al.: Towards robust vision transformer. In: CVPR, pp. 12042–12051 (2022)
Google Scholar
Ming, Q., Zhou, Z., Miao, L., Zhang, H., Li, L.: Dynamic anchor learning for arbitrary-oriented object detection. In: AAAI, vol. 35, pp. 2355–2363 (2021)
Google Scholar
Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Post-processing temporal action detection. In: CVPR, pp. 18837–18845 (2023)
Google Scholar
Radvansky, G.A., Zacks, J.M.: Event perception. Wiley Interdisc. Rev. Cogn. Sci. 2(6), 608–620 (2011)
Article Google Scholar
Shao, D., Zhao, Y., Dai, B., Lin, D.: Intra-and inter-action understanding via temporal action parsing. In: CVPR, pp. 730–739 (2020)
Google Scholar
Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: TriDet: temporal action detection with relative boundary modeling. In: CVPR, pp. 18857–18866 (2023)
Google Scholar
Shou, M.Z., Lei, S.W., Wang, W., Ghadiyaram, D., Feiszli, M.: Generic event boundary detection: a benchmark for event segmentation. In: ICCV, pp. 8075–8084 (2021)
Google Scholar
Souček, T., Moravec, J., Lokoč, J.: TransNet: a deep network for fast detection of common shot transitions. arXiv preprint arXiv:1906.03363 (2019)
Tan, J., Wang, Y., Wu, G., Wang, L.: Temporal perceiver: a general architecture for arbitrary boundary detection. IEEE TPAMI 45(10), 12506–12520 (2023)
Article Google Scholar
Tang, J., Liu, Z., Qian, C., Wu, W., Wang, L.: Progressive attention on multi-level dense difference maps for generic event boundary detection. In: CVPR, pp. 3355–3364 (2022)
Google Scholar
Tang, S., Feng, L., Kuang, Z., Chen, Y., Zhang, W.: Fast video shot transition localization with deep structured models. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11361, pp. 577–592. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20887-5_36
Chapter Google Scholar
Tran, D., Wang, H., Torresani, L., Feiszli, M.: Video classification with channel-separated convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5552–5561 (2019)
Google Scholar
Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
Wang, J., Li, F., An, Y., Zhang, X., Sun, H.: Towards robust LiDAR-camera fusion in BEV space via mutual deformable attention and temporal aggregation. IEEE TCSVT 34(7), 5753–5764 (2024)
Google Scholar
Wang, X., Yu, F., Dou, Z.Y., Darrell, T., Gonzalez, J.E.: SkipNet: learning dynamic routing in convolutional networks. In: ECCV, pp. 409–424 (2018)
Google Scholar
Wang, Y., Chen, Z., Jiang, H., Song, S., Han, Y., Huang, G.: Adaptive focus for efficient video recognition. In: ICCV, pp. 16249–16258 (2021)
Google Scholar
Wang, Y., et al.: AdaFocus V2: end-to-end training of spatial dynamic networks for video recognition. In: CVPR, pp. 20030–20040. IEEE (2022)
Google Scholar
Xie, Z., Zhang, Z., Zhu, X., Huang, G., Lin, S.: Spatially adaptive inference with stochastic feature sampling and interpolation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 531–548. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_31
Chapter Google Scholar
Yang, L., Han, Y., Chen, X., Song, S., Dai, J., Huang, G.: Resolution adaptive networks for efficient inference. In: CVPR, pp. 2369–2378 (2020)
Google Scholar
Yang, L., et al.: CondenseNet V2: sparse feature reactivation for deep networks. In: CVPR, pp. 3569–3578 (2021)
Google Scholar
Yang, L., Zheng, Z., Wang, J., Song, S., Huang, G., Li, F.: AdaDet: an adaptive object detection system based on early-exit neural networks. IEEE Trans. Cogn. Dev. Syst. 16(1), 332–345 (2023)
Article Google Scholar
Zhang, C.L., Wu, J., Li, Y.: ActionFormer: localizing moments of actions with transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13664, pp. 492–510. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_29
Zheng, Z., et al.: Dynamic spatial focus for efficient compressed video action recognition. IEEE TCSVT 34(2), 695–708 (2024)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Science and Technology Major Project under Grant 2022ZD0115803, the Natural Science Basic Research Plan in Shaanxi Province of China under Grant 2023-JC-JQ-51, and the National Natural Science Foundation of China under Grants 62206215.

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Ziwei Zheng, Lijun He, Le Yang & Fan Li

Authors

Ziwei Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Lijun He
View author publications
You can also search for this author in PubMed Google Scholar
Le Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fan Li .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, Z., He, L., Yang, L., Li, F. (2025). Fine-Grained Dynamic Network for Generic Event Boundary Detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15101. Springer, Cham. https://doi.org/10.1007/978-3-031-72775-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-72775-7_7
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72774-0
Online ISBN: 978-3-031-72775-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-Grained Dynamic Network for Generic Event Boundary Detection