Skip to main content

Time-Frequency Mutual Learning for Moment Retrieval and Highlight Detection

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15035))

Included in the following conference series:

  • 237 Accesses

Abstract

Moment Retrieval and Highlight Detection (MR/HD) aims to concurrently retrieve relevant moments and predict clip-wise saliency scores according to a given textual query. Previous MR/HD works have overlooked explicit modeling of static-dynamic visual information described by the language query, which could lead to inaccurate predictions especially when the queried event describes both static appearances and dynamic motions. In this work, we consider learning the static interaction and dynamic reasoning from the time domain and frequency domain respectively, and propose a novel Time-Frequency Mutual Learning framework (TFML) which mainly consists of a time-domain branch, a frequency-domain branch, and a time-frequency aggregation branch. The time-domain branch learns to attend to the static visual information related to the textual query. In the frequency-domain branch, we introduce the Short-Time Fourier Transform (STFT) for dynamic modeling by attending to the frequency contents within varied segments. The time-frequency aggregation branch integrates the information from these two branches. To promote the mutual complementation of time-domain and frequency-domain information, we further employ a mutual learning strategy in concise and effective two-way loop, which enables the branches to collaboratively reason and achieve time-frequency consistent prediction. Extensive experiments on QVHighlights and TVSum demonstrate the effectiveness of our proposed framework as compared with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)

    Google Scholar 

  2. Chi, L., Jiang, B., Mu, Y.: Fast fourier convolution. In: NeurIPS (2020)

    Google Scholar 

  3. Guo, G., Han, L., Wang, L., Zhang, D., Han, J.: Semantic-aware knowledge distillation with parameter-free feature uniformization. Vis. Intell. 1(1), 6 (2023)

    Article  MATH  Google Scholar 

  4. Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., Luo, P.: Online knowledge distillation via collaborative learning. In: CVPR, pp. 11020–11029 (2020)

    Google Scholar 

  5. Hoffmann, D.T., Behrmann, N., Gall, J., Brox, T., Noroozi, M.: Ranking info noise contrastive estimation: boosting contrastive learning via ranked positives. In: AAAI, vol. 36, pp. 897–905 (2022)

    Google Scholar 

  6. Huang, Y.H., He, Y., Yuan, Y.J., Lai, Y.K., Gao, L.: Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In: CVPR (2022)

    Google Scholar 

  7. Huang, Z., Niu, G., Liu, X., Ding, W., Xiao, X., Wu, H., Peng, X.: Learning with noisy correspondence for cross-modal matching. 34, 29406–29419 (2021)

    Google Scholar 

  8. Jia, Z., Sun, S., Liu, G., Liu, B.: Mssd: multi-scale self-distillation for object detection. Vis. Intell. 2(1), 8 (2024)

    Article  MATH  Google Scholar 

  9. Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)

    Google Scholar 

  10. Lei, J., Yu, L., Berg, T.L., Bansal, M.: Tvr: a large-scale dataset for video-subtitle moment retrieval. In: ECCV, pp. 447–463 (2020)

    Google Scholar 

  11. Liang, T., Tan, C., Xia, B., Zheng, W.S., Hu, J.F.: Ranking distillation for open-ended video question answering with insufficient labels. In: CVPR (2024)

    Google Scholar 

  12. Lin, K.Q., Zhang, P., Chen, J., Pramanick, S., Gao, D., Wang, A.J., Yan, R., Shou, M.Z.: Univtg: towards unified video-language temporal grounding. In: ICCV, pp. 2794–2804 (2023)

    Google Scholar 

  13. Lin, Z., Tan, C., Hu, J.F., Jin, Z., Ye, T., Zheng, W.S.: Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In: CVPR (2023)

    Google Scholar 

  14. Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: CVPR, pp. 3707–3715 (2015)

    Google Scholar 

  15. Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)

    Google Scholar 

  16. Ma, X., Yang, M., Li, Y., Hu, P., Lv, J., Peng, X.: Cross-modal retrieval with noisy correspondence via consistency refining and mining. TIP (2024)

    Google Scholar 

  17. Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial lstm networks. In: CVPR, pp. 202–211 (2017)

    Google Scholar 

  18. Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR (2023)

    Google Scholar 

  19. Rabiner, L.R., Gold, B.: Theory and Application of Digital Signal Processing. Prentice-Hall, Englewood Cliffs (1975)

    MATH  Google Scholar 

  20. Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. In: NeurIPS (2021)

    Google Scholar 

  21. Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)

    Google Scholar 

  22. Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)

    Google Scholar 

  23. Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)

    Google Scholar 

  24. Tan, C., Hu, J.F., Zheng, W.S.: Context alignment network for video moment retrieval. In: CICAI, pp. 514–525 (2022)

    Google Scholar 

  25. Tan, C., Hu, J.F., Zheng, W.S.: Matching and localizing: a simple yet effective framework for human-centric spatio-temporal video grounding. In: CICAI (2022)

    Google Scholar 

  26. Tan, C., Lai, J., Zheng, W.S., Hu, J.F.: Siamese learning with joint alignment and regression for weakly-supervised video paragraph grounding. In: CVPR, pp. 13569–13580 (2024)

    Google Scholar 

  27. Tan, C., Lin, Z., Hu, J.F., Zheng, W.S., Lai, J.: Hierarchical semantic correspondence networks for video paragraph grounding. In: CVPR, pp. 18973–18982 (2023)

    Google Scholar 

  28. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)

    Google Scholar 

  29. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)

    Google Scholar 

  30. Wang, L., Liu, D., Puri, R., Metaxas, D.N.: Learning trailer moments in full-length movies with co-contrastive attention. In: ECCV, pp. 300–316 (2020)

    Google Scholar 

  31. Xiao, J., Zhou, P., Chua, T.S., Yan, S.: Video graph transformer for video question answering. In: ECCV, pp. 39–58 (2022)

    Google Scholar 

  32. Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: Learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)

    Google Scholar 

  33. Xu, M., Wang, H., Ni, B., Zhu, R., Sun, Z., Wang, C.: Cross-category video highlight detection via set-based learning. In: ICCV, pp. 7970–7979 (2021)

    Google Scholar 

  34. Xu, T., Zhu, X.F., Wu, X.J.: Learning spatio-temporal discriminative model for affine subspace based visual object tracking. Vis. Intell. 1(1), 4 (2023)

    Article  MATH  Google Scholar 

  35. Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: ECCV, pp. 766–782 (2016)

    Google Scholar 

  36. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34 (2020)

    Google Scholar 

  37. Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: CVPR, pp. 4320–4328 (2018)

    Google Scholar 

Download references

Acknowledgement

This work was supported partially by the NSFC (U21A20471, U22A2095, 62076260, 61772570), Guangdong Natural Science Funds Project (2023B1515040025), Guangdong NSF for Distinguished Young Scholar (2022B1515020009), and Guangzhou Science and Technology Plan Project (202201011134).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-Fang Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhong, Y., Liang, T., Hu, JF. (2025). Time-Frequency Mutual Learning for Moment Retrieval and Highlight Detection. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_3

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-8620-6_3

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-8619-0

  • Online ISBN: 978-981-97-8620-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics