Abstract
Temporal sentence grounding in video (TSGV) focuses on identifying the most pertinent temporal segment within an untrimmed video, given a natural language query. Its principal aim is to ascertain and retrieve the specific moment that impeccably aligns with the given query. Although the existing methods have done much research in this field and achieved specific achievements, there are still problems of massive calculation and insufficient grounding. Our method mainly focuses on obtaining better video and query features performing cross-modal feature fusion better, and locating more accurately when dealing with this problem. We propose an efficient Cross-Modal Grounding Network (CMGN) to balance the amount of computation and localization accuracy. In our proposed structure, we obtain the local context information through a bidirectional Gated Recurrent Unit (GRU). We obtain the start and end boundary characteristics for a better video presentation. Then, the two-channel structure, divided into a start channel and an end channel, captures the temporal relationships among several video segments sharing common boundaries. To validate the effectiveness of our method, extensive experiments were conducted on two datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5277–5285 (2017)
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)
Dong, J., et al.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4065–4080 (2021)
Liu, M., Wang, X., Nie, L., et al.: Attentive moment retrieval in videos. In: 41nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 15–24 (2018)
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 655–664 (2019)
Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 162–171 (2018)
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Advances in Neural Information Processing Systems (NIPS), pp. 534–544 (2019)
Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12168–12175 (2020)
Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: Man: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1247–1257 (2019)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI Conference on Artificial Intelligence, vol. 34, no. 07 (2020)
Wang, H., Zha, Z.J., Li, L., Liu, D., Luo, J.: Structured multi-level interaction network for video moment localization via language query. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7026–7035 (2021)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)
Liu, D., Qu, X., Liu, X. Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross- and self- modal graph attention network for query-based moment localization. In: 28th ACM International Conference on Multimedia, pp. 4070–4078 (2020)
Soldan, M., Xu, M., Qu, S., Tegner, J., Ghanem, B.: VLG-net: video-language graph matching network for video grounding. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 3217–3227 (2020)
Zhang, B., Jiang, B., Yang, C., Pang, L.: Dual-channel localization networks for moment retrieval with natural language. In: International Conference on Multimedia Retrieval (ICMR), pp. 351–359 (2022)
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)
Xiao, S., Chen, L., Zhang, S., Ji, W., Shao, J., Ye, L., Xiao, J.: Boundary proposal network for two-stage natural language video localization. In: AAAI Conference on Artificial Intelligence, vol. 35, no. 04, pp. 2986–2994 (2021)
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 9062–9069 (2019)
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: The 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 6543–6554 (2020)
Gao, J., Xu, C.: Fast video moment retrieval. In: IEEE/CVF International Conference on Computer Vision, pp. 1523–1532 (2021)
Li, K., Guo, D., Wang, M.: Proposal-free video grounding with contextual pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 03, pp. 1902–1910 (2021)
Chung, J., Gulcehre, C., Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Acknowledgements
This work was supported in part by National Natural Science Foundation of China under Grant 62072169, 62172156, Natural Science Foundation of Hunan Province under Grant 2021JJ30152.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, Q., Jiang, B., Zhang, B., Yang, C. (2024). CMGN: Cross-Modal Grounding Network for Temporal Sentence Retrieval in Video. In: Sun, Y., Lu, T., Wang, T., Fan, H., Liu, D., Du, B. (eds) Computer Supported Cooperative Work and Social Computing. ChineseCSCW 2023. Communications in Computer and Information Science, vol 2013. Springer, Singapore. https://doi.org/10.1007/978-981-99-9640-7_19
Download citation
DOI: https://doi.org/10.1007/978-981-99-9640-7_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-9639-1
Online ISBN: 978-981-99-9640-7
eBook Packages: Computer ScienceComputer Science (R0)