BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

Lee, Pilhyeon; Byun, Hyeran

doi:10.1007/978-3-031-72627-9_13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15060))

Included in the following conference series:

European Conference on Computer Vision

461 Accesses

Abstract

Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code is available here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Boundary-Aware Temporal Sentence Grounding with Adaptive Proposal Refinement

Context-aware network with foreground recalibration for grounding natural language in video

Article 26 February 2021

Denoised Dual-Level Contrastive Network for Weakly-Supervised Temporal Sentence Grounding

References

Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)
Google Scholar
Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. In: EMNLP, pp. 9810–9823 (2021)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: EMNLP, pp. 162–171 (2018)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40(4), 834–848 (2017)
Article Google Scholar
Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: AAAI, vol. 34, pp. 10551–10558 (2020)
Google Scholar
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: end-to-end object detection with dynamic attention. In: ICCV, pp. 2988–2997 (2021)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763 (2019)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: CVPR, pp. 5267–5275 (2017)
Google Scholar
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention. In: ICCV, pp. 3621–3630 (2021)
Google Scholar
Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: mining activity concepts for language-based temporal localization. In: WACV, pp. 245–253. IEEE (2019)
Google Scholar
Gemmeke, J.F., et al.: Audio set: An ontology and human-labeled dataset for audio events. In: ICASSP, pp. 776–780. IEEE (2017)
Google Scholar
Ghosh, S., Agarwal, A., Parekh, Z., Hauptmann, A.G.: Excl: extractive clip localization using natural language descriptions. In: NAACL, pp. 1984–1990 (2019)
Google Scholar
Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J.: Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding. In: ECCV, pp. 130–147. Springer (2022). https://doi.org/10.1007/978-3-031-20059-5_8
Jang, J., Park, J., Kim, J., Kwon, H., Sohn, K.: Knowing where to focus: Event-aware transformer for video grounding. In: ICCV, pp. 13846–13856 (2023)
Google Scholar
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. on Audio, Speech, and Language Process. 28, 2880–2894 (2020)
Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)
Article MathSciNet Google Scholar
Lee, P., Kim, T., Shim, M., Wee, D., Byun, H.: Decomposed cross-modal distillation for rgb-based temporal action detection. In: CVPR, pp. 2373–2383 (2023)
Google Scholar
Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: Neurips, vol. 34, pp. 11846–11858 (2021)
Google Scholar
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 447–463. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_27
Chapter Google Scholar
Li, F., et al.: Lite detr: an interleaved multi-scale encoder for efficient detr. In: CVPR, pp. 18558–18567 (2023)
Google Scholar
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: accelerate detr training by introducing query denoising. In: CVPR, pp. 13619–13627 (2022)
Google Scholar
Li, H., Cao, M., Cheng, X., Li, Y., Zhu, Z., Zou, Y.: G2l: semantically aligned and uniform video grounding via geodesic and game theory. In: ICCV, pp. 12032–12042 (2023)
Google Scholar
Li, P., et al.: Momentdiff: generative video moment retrieval from random to real. In: Neurips (2023)
Google Scholar
Lin, C., et al.: Learning salient boundary feature for anchor-free temporal action localization. In: CVPR, pp. 3320–3329 (2021)
Google Scholar
Lin, K.Q., et al.: Univtg: towards unified video-language temporal grounding. In: ICCV, pp. 2794–2804 (2023)
Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1
Chapter Google Scholar
Lin, Y., Yuan, Y., Zhang, Z., Li, C., Zheng, N., Hu, H.: Detr does not need multi-scale or locality design. In: ICCV, pp. 6545–6554 (2023)
Google Scholar
Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: CVPR, pp. 11235–11244 (2021)
Google Scholar
Liu, F., Wei, H., Zhao, W., Li, G., Peng, J., Li, Z.: Wb-detr: transformer-based detector without backbone. In: ICCV, pp. 2979–2987 (2021)
Google Scholar
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM MM, pp. 843–851 (2018)
Google Scholar
Liu, S., et al.: Dab-detr: dynamic anchor boxes are better queries for detr. In: ICLR (2022)
Google Scholar
Liu, S., et al.: Detection transformer with stable matching. In: ICCV, pp. 6491–6500 (2023)
Google Scholar
Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: Umt: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Lu, C., Chen, L., Tan, C., Li, X., Xiao, J.: Debug: a dense bottom-up grounding approach for natural language video localization. In: EMNLP-IJCNLP, pp. 5144–5153 (2019)
Google Scholar
Meng, D., et al.: Conditional detr for fast training convergence. In: ICCV, pp. 3651–3660 (2021)
Google Scholar
Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR, pp. 23023–23033 (2023)
Google Scholar
Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: CVPR, pp. 10810–10819 (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Article Google Scholar
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Google Scholar
Roh, B., Shin, J., Shin, W., Kim, S.: Sparse detr: efficient end-to-end object detection with learnable sparsity. In: ICLR (2022)
Google Scholar
Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for tv baseball programs. In: ACM MM, pp. 105–115 (2000)
Google Scholar
Shao, D., Xiong, Yu., Zhao, Y., Huang, Q., Qiao, Yu., Lin, D.: Find and focus: retrieve and localize video events with natural language queries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 202–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_13
Chapter Google Scholar
Sharghi, A., Laurel, J.S., Gong, B.: Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In: CVPR, pp. 4788–4797 (2017)
Google Scholar
Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: Tridet: temporal action detection with relative boundary modeling. In: CVPR, pp. 18857–18866 (2023)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Sun, Z., Cao, S., Yang, Y., Kitani, K.M.: Rethinking transformer-based set prediction for object detection. In: ICCV, pp. 3611–3620 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Neurips, vol. 30 (2017)
Google Scholar
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: query design for transformer-based detector. In: AAAI, vol. 36, pp. 2567–2575 (2022)
Google Scholar
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, vol. 36, pp. 2613–2623 (2022)
Google Scholar
Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. In: AAAI, vol. 35, pp. 2986–2994 (2021)
Google Scholar
Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)
Google Scholar
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI, vol. 33, pp. 9062–9069 (2019)
Google Scholar
Xu, Y., Sun, Y., Li, Y., Shi, Y., Zhu, X., Du, S.: Mh-detr: video moment and highlight detection with cross-modal transformer. In: ACM MM (2023)
Google Scholar
Yan, S., et al.: Unloc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)
Google Scholar
Ye, M., et al.: Cascade-detr: delving into high-quality universal object detection. In: ICCV, pp. 6704–6714 (2023)
Google Scholar
Yuan, Y., Lan, X., Wang, X., Chen, L., Wang, Z., Zhu, W.: A closer look at temporal sentence grounding in videos: Dataset and metric. In: Proc. 2nd Int. Workshop on Human-Centric Multimedia Analysis, pp. 13–21 (2021)
Google Scholar
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Neurips, vol. 32 (2019)
Google Scholar
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI, vol. 33, pp. 9159–9166 (2019)
Google Scholar
Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: CVPR, pp. 10287–10296 (2020)
Google Scholar
Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: Man: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: CVPR, pp. 1247–1257 (2019)
Google Scholar
Zhang, G., Luo, Z., Yu, Y., Cui, K., Lu, S.: Accelerating detr convergence via semantic-aligned matching. In: CVPR, pp. 949–958 (2022)
Google Scholar
Zhang, H., et al.: Dino: detr with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2023)
Google Scholar
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: ACL (2020)
Google Scholar
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Chapter Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34, pp. 12870–12877 (2020)
Google Scholar
Zhang, S., Su, J., Luo, J.: Exploiting temporal relationships in video moment localization with natural language. In: ACM MM, pp. 1230–1238 (2019)
Google Scholar
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: ACM SIGIR, pp. 655–664 (2019)
Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV, pp. 2914–2923 (2017)
Google Scholar
Zheng, D., Dong, W., Hu, H., Chen, X., Wang, Y.: Less is more: focus attention for efficient detr. In: ICCV, pp. 6674–6683 (2023)
Google Scholar
Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: more deformable, better results. In: CVPR, pp. 9308–9316 (2019)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2021)
Google Scholar

Download references

Acknowledgements

This project was supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT) (No. 2022R1A2B5B02001467; RS-2024-00346364).

Author information

Authors and Affiliations

Department of Artificial Intelligence, Inha University, Incheon, South Korea
Pilhyeon Lee
Department of Computer Science, Yonsei University, Seoul, South Korea
Hyeran Byun

Authors

Pilhyeon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hyeran Byun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pilhyeon Lee .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6569 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, P., Byun, H. (2025). BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-72627-9_13
Published: 20 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72626-2
Online ISBN: 978-3-031-72627-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos