Abstract
Moment Retrieval and Highlight Detection (MR/HD) aims to concurrently retrieve relevant moments and predict clip-wise saliency scores according to a given textual query. Previous MR/HD works have overlooked explicit modeling of static-dynamic visual information described by the language query, which could lead to inaccurate predictions especially when the queried event describes both static appearances and dynamic motions. In this work, we consider learning the static interaction and dynamic reasoning from the time domain and frequency domain respectively, and propose a novel Time-Frequency Mutual Learning framework (TFML) which mainly consists of a time-domain branch, a frequency-domain branch, and a time-frequency aggregation branch. The time-domain branch learns to attend to the static visual information related to the textual query. In the frequency-domain branch, we introduce the Short-Time Fourier Transform (STFT) for dynamic modeling by attending to the frequency contents within varied segments. The time-frequency aggregation branch integrates the information from these two branches. To promote the mutual complementation of time-domain and frequency-domain information, we further employ a mutual learning strategy in concise and effective two-way loop, which enables the branches to collaboratively reason and achieve time-frequency consistent prediction. Extensive experiments on QVHighlights and TVSum demonstrate the effectiveness of our proposed framework as compared with state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV, pp. 5803–5812 (2017)
Chi, L., Jiang, B., Mu, Y.: Fast fourier convolution. In: NeurIPS (2020)
Guo, G., Han, L., Wang, L., Zhang, D., Han, J.: Semantic-aware knowledge distillation with parameter-free feature uniformization. Vis. Intell. 1(1), 6 (2023)
Guo, Q., Wang, X., Wu, Y., Yu, Z., Liang, D., Hu, X., Luo, P.: Online knowledge distillation via collaborative learning. In: CVPR, pp. 11020–11029 (2020)
Hoffmann, D.T., Behrmann, N., Gall, J., Brox, T., Noroozi, M.: Ranking info noise contrastive estimation: boosting contrastive learning via ranked positives. In: AAAI, vol. 36, pp. 897–905 (2022)
Huang, Y.H., He, Y., Yuan, Y.J., Lai, Y.K., Gao, L.: Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In: CVPR (2022)
Huang, Z., Niu, G., Liu, X., Ding, W., Xiao, X., Wu, H., Peng, X.: Learning with noisy correspondence for cross-modal matching. 34, 29406–29419 (2021)
Jia, Z., Sun, S., Liu, G., Liu, B.: Mssd: multi-scale self-distillation for object detection. Vis. Intell. 2(1), 8 (2024)
Lei, J., Berg, T.L., Bansal, M.: Detecting moments and highlights in videos via natural language queries. In: NeurIPS (2021)
Lei, J., Yu, L., Berg, T.L., Bansal, M.: Tvr: a large-scale dataset for video-subtitle moment retrieval. In: ECCV, pp. 447–463 (2020)
Liang, T., Tan, C., Xia, B., Zheng, W.S., Hu, J.F.: Ranking distillation for open-ended video question answering with insufficient labels. In: CVPR (2024)
Lin, K.Q., Zhang, P., Chen, J., Pramanick, S., Gao, D., Wang, A.J., Yan, R., Shou, M.Z.: Univtg: towards unified video-language temporal grounding. In: ICCV, pp. 2794–2804 (2023)
Lin, Z., Tan, C., Hu, J.F., Jin, Z., Ye, T., Zheng, W.S.: Collaborative static and dynamic vision-language streams for spatio-temporal video grounding. In: CVPR (2023)
Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: CVPR, pp. 3707–3715 (2015)
Liu, Y., Li, S., Wu, Y., Chen, C.W., Shan, Y., Qie, X.: Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: CVPR, pp. 3042–3051 (2022)
Ma, X., Yang, M., Li, Y., Hu, P., Lv, J., Peng, X.: Cross-modal retrieval with noisy correspondence via consistency refining and mining. TIP (2024)
Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial lstm networks. In: CVPR, pp. 202–211 (2017)
Moon, W., Hyun, S., Park, S., Park, D., Heo, J.P.: Query-dependent video representation for moment retrieval and highlight detection. In: CVPR (2023)
Rabiner, L.R., Gold, B.: Theory and Application of Digital Signal Processing. Prentice-Hall, Englewood Cliffs (1975)
Rao, Y., Zhao, W., Zhu, Z., Lu, J., Zhou, J.: Global filter networks for image classification. In: NeurIPS (2021)
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)
Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)
Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)
Tan, C., Hu, J.F., Zheng, W.S.: Context alignment network for video moment retrieval. In: CICAI, pp. 514–525 (2022)
Tan, C., Hu, J.F., Zheng, W.S.: Matching and localizing: a simple yet effective framework for human-centric spatio-temporal video grounding. In: CICAI (2022)
Tan, C., Lai, J., Zheng, W.S., Hu, J.F.: Siamese learning with joint alignment and regression for weakly-supervised video paragraph grounding. In: CVPR, pp. 13569–13580 (2024)
Tan, C., Lin, Z., Hu, J.F., Zheng, W.S., Lai, J.: Hierarchical semantic correspondence networks for video paragraph grounding. In: CVPR, pp. 18973–18982 (2023)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS, pp. 6000–6010 (2017)
Wang, L., Liu, D., Puri, R., Metaxas, D.N.: Learning trailer moments in full-length movies with co-contrastive attention. In: ECCV, pp. 300–316 (2020)
Xiao, J., Zhou, P., Chua, T.S., Yan, S.: Video graph transformer for video question answering. In: ECCV, pp. 39–58 (2022)
Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: Learning highlight detection from video duration. In: CVPR, pp. 1258–1267 (2019)
Xu, M., Wang, H., Ni, B., Zhu, R., Sun, Z., Wang, C.: Cross-category video highlight detection via set-based learning. In: ICCV, pp. 7970–7979 (2021)
Xu, T., Zhu, X.F., Wu, X.J.: Learning spatio-temporal discriminative model for affine subspace based visual object tracking. Vis. Intell. 1(1), 4 (2023)
Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: ECCV, pp. 766–782 (2016)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI, vol. 34 (2020)
Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: CVPR, pp. 4320–4328 (2018)
Acknowledgement
This work was supported partially by the NSFC (U21A20471, U22A2095, 62076260, 61772570), Guangdong Natural Science Funds Project (2023B1515040025), Guangdong NSF for Distinguished Young Scholar (2022B1515020009), and Guangzhou Science and Technology Plan Project (202201011134).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhong, Y., Liang, T., Hu, JF. (2025). Time-Frequency Mutual Learning for Moment Retrieval and Highlight Detection. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_3
Download citation
DOI: https://doi.org/10.1007/978-981-97-8620-6_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8619-0
Online ISBN: 978-981-97-8620-6
eBook Packages: Computer ScienceComputer Science (R0)