Abstract
Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code is available here.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)
Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. In: EMNLP, pp. 9810–9823 (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: EMNLP, pp. 162–171 (2018)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40(4), 834–848 (2017)
Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: AAAI, vol. 34, pp. 10551–10558 (2020)
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: end-to-end object detection with dynamic attention. In: ICCV, pp. 2988–2997 (2021)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763 (2019)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: CVPR, pp. 5267–5275 (2017)
Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention. In: ICCV, pp. 3621–3630 (2021)
Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: mining activity concepts for language-based temporal localization. In: WACV, pp. 245–253. IEEE (2019)
Gemmeke, J.F., et al.: Audio set: An ontology and human-labeled dataset for audio events. In: ICASSP, pp. 776–780. IEEE (2017)
Ghosh, S., Agarwal, A., Parekh, Z., Hauptmann, A.G.: Excl: extractive clip localization using natural language descriptions. In: NAACL, pp. 1984–1990 (2019)
Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J.: Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding. In: ECCV, pp. 130–147. Springer (2022). https://doi.org/10.1007/978-3-031-20059-5_8
Jang, J., Park, J., Kim, J., Kwon, H., Sohn, K.: Knowing where to focus: Event-aware transformer for video grounding. In: ICCV, pp. 13846–13856 (2023)
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. on Audio, Speech, and Language Process. 28, 2880–2894 (2020)
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)
Lee, P., Kim, T., Shim, M., Wee, D., Byun, H.: Decomposed cross-modal distillation for rgb-based temporal action detection. In: CVPR, pp. 2373–2383 (2023)
Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: Neurips, vol. 34, pp. 11846–11858 (2021)
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 447–463. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_27
Li, F., et al.: Lite detr: an interleaved multi-scale encoder for efficient detr. In: CVPR, pp. 18558–18567 (2023)
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: accelerate detr training by introducing query denoising. In: CVPR, pp. 13619–13627 (2022)
Li, H., Cao, M., Cheng, X., Li, Y., Zhu, Z., Zou, Y.: G2l: semantically aligned and uniform video grounding via geodesic and game theory. In: ICCV, pp. 12032–12042 (2023)
Li, P., et al.: Momentdiff: generative video moment retrieval from random to real. In: Neurips (2023)
Lin, C., et al.: Learning salient boundary feature for anchor-free temporal action localization. In: CVPR, pp. 3320–3329 (2021)
Lin, K.Q., et al.: Univtg: towards unified video-language temporal grounding. In: ICCV, pp. 2794–2804 (2023)
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1
Lin, Y., Yuan, Y., Zhang, Z., Li, C., Zheng, N., Hu, H.: Detr does not need multi-scale or locality design. In: ICCV, pp. 6545–6554 (2023)
Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: CVPR, pp. 11235–11244 (2021)
Liu, F., Wei, H., Zhao, W., Li, G., Peng, J., Li, Z.: Wb-detr: transformer-based detector without backbone. In: ICCV, pp. 2979–2987 (2021)
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM MM, pp. 843–851 (2018)
Liu, S., et al.: Dab-detr: dynamic anchor boxes are better queries for detr. In: ICLR (2022)
Liu, S., et al.: Detection transformer with stable matching. In: ICCV, pp. 6491–6500 (2023)
Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: Umt: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Lu, C., Chen, L., Tan, C., Li, X., Xiao, J.: Debug: a dense bottom-up grounding approach for natural language video localization. In: EMNLP-IJCNLP, pp. 5144–5153 (2019)
Meng, D., et al.: Conditional detr for fast training convergence. In: ICCV, pp. 3651–3660 (2021)
Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR, pp. 23023–23033 (2023)
Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: CVPR, pp. 10810–10819 (2020)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Roh, B., Shin, J., Shin, W., Kim, S.: Sparse detr: efficient end-to-end object detection with learnable sparsity. In: ICLR (2022)
Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for tv baseball programs. In: ACM MM, pp. 105–115 (2000)
Shao, D., Xiong, Yu., Zhao, Y., Huang, Q., Qiao, Yu., Lin, D.: Find and focus: retrieve and localize video events with natural language queries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 202–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_13
Sharghi, A., Laurel, J.S., Gong, B.: Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In: CVPR, pp. 4788–4797 (2017)
Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: Tridet: temporal action detection with relative boundary modeling. In: CVPR, pp. 18857–18866 (2023)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Sun, Z., Cao, S., Yang, Y., Kitani, K.M.: Rethinking transformer-based set prediction for object detection. In: ICCV, pp. 3611–3620 (2021)
Vaswani, A., et al.: Attention is all you need. In: Neurips, vol. 30 (2017)
Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: query design for transformer-based detector. In: AAAI, vol. 36, pp. 2567–2575 (2022)
Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, vol. 36, pp. 2613–2623 (2022)
Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. In: AAAI, vol. 35, pp. 2986–2994 (2021)
Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI, vol. 33, pp. 9062–9069 (2019)
Xu, Y., Sun, Y., Li, Y., Shi, Y., Zhu, X., Du, S.: Mh-detr: video moment and highlight detection with cross-modal transformer. In: ACM MM (2023)
Yan, S., et al.: Unloc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)
Ye, M., et al.: Cascade-detr: delving into high-quality universal object detection. In: ICCV, pp. 6704–6714 (2023)
Yuan, Y., Lan, X., Wang, X., Chen, L., Wang, Z., Zhu, W.: A closer look at temporal sentence grounding in videos: Dataset and metric. In: Proc. 2nd Int. Workshop on Human-Centric Multimedia Analysis, pp. 13–21 (2021)
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Neurips, vol. 32 (2019)
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI, vol. 33, pp. 9159–9166 (2019)
Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: CVPR, pp. 10287–10296 (2020)
Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: Man: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: CVPR, pp. 1247–1257 (2019)
Zhang, G., Luo, Z., Yu, Y., Cui, K., Lu, S.: Accelerating detr convergence via semantic-aligned matching. In: CVPR, pp. 949–958 (2022)
Zhang, H., et al.: Dino: detr with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2023)
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: ACL (2020)
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34, pp. 12870–12877 (2020)
Zhang, S., Su, J., Luo, J.: Exploiting temporal relationships in video moment localization with natural language. In: ACM MM, pp. 1230–1238 (2019)
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: ACM SIGIR, pp. 655–664 (2019)
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV, pp. 2914–2923 (2017)
Zheng, D., Dong, W., Hu, H., Chen, X., Wang, Y.: Less is more: focus attention for efficient detr. In: ICCV, pp. 6674–6683 (2023)
Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: more deformable, better results. In: CVPR, pp. 9308–9316 (2019)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2021)
Acknowledgements
This project was supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT) (No. 2022R1A2B5B02001467; RS-2024-00346364).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lee, P., Byun, H. (2025). BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-72627-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72626-2
Online ISBN: 978-3-031-72627-9
eBook Packages: Computer ScienceComputer Science (R0)