Skip to main content

BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15060))

Included in the following conference series:

  • 461 Accesses

Abstract

Temporal sentence grounding aims to localize moments relevant to a language description. Recently, DETR-like approaches achieved notable progress by predicting the center and length of a target moment. However, they suffer from the issue of center misalignment raised by the inherent ambiguity of moment centers, leading to inaccurate predictions. To remedy this problem, we propose a novel boundary-oriented moment formulation. In our paradigm, the model no longer needs to find the precise center but instead suffices to predict any anchor point within the interval, from which the boundaries are directly estimated. Based on this idea, we design a boundary-aligned moment detection transformer, equipped with a dual-pathway decoding process. Specifically, it refines the anchor and boundaries within parallel pathways using global and boundary-focused attention, respectively. This separate design allows the model to focus on desirable regions, enabling precise refinement of moment predictions. Further, we propose a quality-based ranking method, ensuring that proposals with high localization qualities are prioritized over incomplete ones. Experiments on three benchmarks validate the effectiveness of the proposed methods. The code is available here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)

    Google Scholar 

  2. Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. In: EMNLP, pp. 9810–9823 (2021)

    Google Scholar 

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  4. Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: EMNLP, pp. 162–171 (2018)

    Google Scholar 

  5. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40(4), 834–848 (2017)

    Article  Google Scholar 

  6. Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: AAAI, vol. 34, pp. 10551–10558 (2020)

    Google Scholar 

  7. Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic detr: end-to-end object detection with dynamic attention. In: ICCV, pp. 2988–2997 (2021)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  9. Escorcia, V., Soldan, M., Sivic, J., Ghanem, B., Russell, B.: Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763 (2019)

  10. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)

    Google Scholar 

  11. Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: CVPR, pp. 5267–5275 (2017)

    Google Scholar 

  12. Gao, P., Zheng, M., Wang, X., Dai, J., Li, H.: Fast convergence of detr with spatially modulated co-attention. In: ICCV, pp. 3621–3630 (2021)

    Google Scholar 

  13. Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: mining activity concepts for language-based temporal localization. In: WACV, pp. 245–253. IEEE (2019)

    Google Scholar 

  14. Gemmeke, J.F., et al.: Audio set: An ontology and human-labeled dataset for audio events. In: ICASSP, pp. 776–780. IEEE (2017)

    Google Scholar 

  15. Ghosh, S., Agarwal, A., Parekh, Z., Hauptmann, A.G.: Excl: extractive clip localization using natural language descriptions. In: NAACL, pp. 1984–1990 (2019)

    Google Scholar 

  16. Hao, J., Sun, H., Ren, P., Wang, J., Qi, Q., Liao, J.: Can shuffling video benefit temporal bias problem: A novel training framework for temporal grounding. In: ECCV, pp. 130–147. Springer (2022). https://doi.org/10.1007/978-3-031-20059-5_8

  17. Jang, J., Park, J., Kim, J., Kwon, H., Sohn, K.: Knowing where to focus: Event-aware transformer for video grounding. In: ICCV, pp. 13846–13856 (2023)

    Google Scholar 

  18. Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., Plumbley, M.D.: Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. on Audio, Speech, and Language Process. 28, 2880–2894 (2020)

    Google Scholar 

  19. Kuhn, H.W.: The hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  20. Lee, P., Kim, T., Shim, M., Wee, D., Byun, H.: Decomposed cross-modal distillation for rgb-based temporal action detection. In: CVPR, pp. 2373–2383 (2023)

    Google Scholar 

  21. Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: Neurips, vol. 34, pp. 11846–11858 (2021)

    Google Scholar 

  22. Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVR: a large-scale dataset for video-subtitle moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 447–463. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_27

    Chapter  Google Scholar 

  23. Li, F., et al.: Lite detr: an interleaved multi-scale encoder for efficient detr. In: CVPR, pp. 18558–18567 (2023)

    Google Scholar 

  24. Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: accelerate detr training by introducing query denoising. In: CVPR, pp. 13619–13627 (2022)

    Google Scholar 

  25. Li, H., Cao, M., Cheng, X., Li, Y., Zhu, Z., Zou, Y.: G2l: semantically aligned and uniform video grounding via geodesic and game theory. In: ICCV, pp. 12032–12042 (2023)

    Google Scholar 

  26. Li, P., et al.: Momentdiff: generative video moment retrieval from random to real. In: Neurips (2023)

    Google Scholar 

  27. Lin, C., et al.: Learning salient boundary feature for anchor-free temporal action localization. In: CVPR, pp. 3320–3329 (2021)

    Google Scholar 

  28. Lin, K.Q., et al.: Univtg: towards unified video-language temporal grounding. In: ICCV, pp. 2794–2804 (2023)

    Google Scholar 

  29. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1

    Chapter  Google Scholar 

  30. Lin, Y., Yuan, Y., Zhang, Z., Li, C., Zheng, N., Hu, H.: Detr does not need multi-scale or locality design. In: ICCV, pp. 6545–6554 (2023)

    Google Scholar 

  31. Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: CVPR, pp. 11235–11244 (2021)

    Google Scholar 

  32. Liu, F., Wei, H., Zhao, W., Li, G., Peng, J., Li, Z.: Wb-detr: transformer-based detector without backbone. In: ICCV, pp. 2979–2987 (2021)

    Google Scholar 

  33. Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: ACM MM, pp. 843–851 (2018)

    Google Scholar 

  34. Liu, S., et al.: Dab-detr: dynamic anchor boxes are better queries for detr. In: ICLR (2022)

    Google Scholar 

  35. Liu, S., et al.: Detection transformer with stable matching. In: ICCV, pp. 6491–6500 (2023)

    Google Scholar 

  36. Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: Umt: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)

    Google Scholar 

  37. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  38. Lu, C., Chen, L., Tan, C., Li, X., Xiao, J.: Debug: a dense bottom-up grounding approach for natural language video localization. In: EMNLP-IJCNLP, pp. 5144–5153 (2019)

    Google Scholar 

  39. Meng, D., et al.: Conditional detr for fast training convergence. In: ICCV, pp. 3651–3660 (2021)

    Google Scholar 

  40. Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR, pp. 23023–23033 (2023)

    Google Scholar 

  41. Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: CVPR, pp. 10810–10819 (2020)

    Google Scholar 

  42. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

    Google Scholar 

  43. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  44. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)

    Article  Google Scholar 

  45. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)

    Google Scholar 

  46. Roh, B., Shin, J., Shin, W., Kim, S.: Sparse detr: efficient end-to-end object detection with learnable sparsity. In: ICLR (2022)

    Google Scholar 

  47. Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for tv baseball programs. In: ACM MM, pp. 105–115 (2000)

    Google Scholar 

  48. Shao, D., Xiong, Yu., Zhao, Y., Huang, Q., Qiao, Yu., Lin, D.: Find and focus: retrieve and localize video events with natural language queries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 202–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_13

    Chapter  Google Scholar 

  49. Sharghi, A., Laurel, J.S., Gong, B.: Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In: CVPR, pp. 4788–4797 (2017)

    Google Scholar 

  50. Shi, D., Zhong, Y., Cao, Q., Ma, L., Li, J., Tao, D.: Tridet: temporal action detection with relative boundary modeling. In: CVPR, pp. 18857–18866 (2023)

    Google Scholar 

  51. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google Scholar 

  52. Sun, Z., Cao, S., Yang, Y., Kitani, K.M.: Rethinking transformer-based set prediction for object detection. In: ICCV, pp. 3611–3620 (2021)

    Google Scholar 

  53. Vaswani, A., et al.: Attention is all you need. In: Neurips, vol. 30 (2017)

    Google Scholar 

  54. Wang, Y., Zhang, X., Yang, T., Sun, J.: Anchor detr: query design for transformer-based detector. In: AAAI, vol. 36, pp. 2567–2575 (2022)

    Google Scholar 

  55. Wang, Z., Wang, L., Wu, T., Li, T., Wu, G.: Negative sample matters: a renaissance of metric learning for temporal grounding. In: AAAI, vol. 36, pp. 2613–2623 (2022)

    Google Scholar 

  56. Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. In: AAAI, vol. 35, pp. 2986–2994 (2021)

    Google Scholar 

  57. Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)

    Google Scholar 

  58. Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: AAAI, vol. 33, pp. 9062–9069 (2019)

    Google Scholar 

  59. Xu, Y., Sun, Y., Li, Y., Shi, Y., Zhu, X., Du, S.: Mh-detr: video moment and highlight detection with cross-modal transformer. In: ACM MM (2023)

    Google Scholar 

  60. Yan, S., et al.: Unloc: a unified framework for video localization tasks. In: ICCV, pp. 13623–13633 (2023)

    Google Scholar 

  61. Ye, M., et al.: Cascade-detr: delving into high-quality universal object detection. In: ICCV, pp. 6704–6714 (2023)

    Google Scholar 

  62. Yuan, Y., Lan, X., Wang, X., Chen, L., Wang, Z., Zhu, W.: A closer look at temporal sentence grounding in videos: Dataset and metric. In: Proc. 2nd Int. Workshop on Human-Centric Multimedia Analysis, pp. 13–21 (2021)

    Google Scholar 

  63. Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Neurips, vol. 32 (2019)

    Google Scholar 

  64. Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: AAAI, vol. 33, pp. 9159–9166 (2019)

    Google Scholar 

  65. Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: CVPR, pp. 10287–10296 (2020)

    Google Scholar 

  66. Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: Man: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: CVPR, pp. 1247–1257 (2019)

    Google Scholar 

  67. Zhang, G., Luo, Z., Yu, Y., Cui, K., Lu, S.: Accelerating detr convergence via semantic-aligned matching. In: CVPR, pp. 949–958 (2022)

    Google Scholar 

  68. Zhang, H., et al.: Dino: detr with improved denoising anchor boxes for end-to-end object detection. In: ICLR (2023)

    Google Scholar 

  69. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: ACL (2020)

    Google Scholar 

  70. Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47

    Chapter  Google Scholar 

  71. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34, pp. 12870–12877 (2020)

    Google Scholar 

  72. Zhang, S., Su, J., Luo, J.: Exploiting temporal relationships in video moment localization with natural language. In: ACM MM, pp. 1230–1238 (2019)

    Google Scholar 

  73. Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: ACM SIGIR, pp. 655–664 (2019)

    Google Scholar 

  74. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV, pp. 2914–2923 (2017)

    Google Scholar 

  75. Zheng, D., Dong, W., Hu, H., Chen, X., Wang, Y.: Less is more: focus attention for efficient detr. In: ICCV, pp. 6674–6683 (2023)

    Google Scholar 

  76. Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: more deformable, better results. In: CVPR, pp. 9308–9316 (2019)

    Google Scholar 

  77. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

Download references

Acknowledgements

This project was supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT) (No. 2022R1A2B5B02001467; RS-2024-00346364).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pilhyeon Lee .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6569 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lee, P., Byun, H. (2025). BAM-DETR: Boundary-Aligned Moment Detection Transformer for Temporal Sentence Grounding in Videos. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15060. Springer, Cham. https://doi.org/10.1007/978-3-031-72627-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72627-9_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72626-2

  • Online ISBN: 978-3-031-72627-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics