Skip to main content

CMGN: Cross-Modal Grounding Network for Temporal Sentence Retrieval in Video

  • Conference paper
  • First Online:
Computer Supported Cooperative Work and Social Computing (ChineseCSCW 2023)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2013))

  • 165 Accesses

Abstract

Temporal sentence grounding in video (TSGV) focuses on identifying the most pertinent temporal segment within an untrimmed video, given a natural language query. Its principal aim is to ascertain and retrieve the specific moment that impeccably aligns with the given query. Although the existing methods have done much research in this field and achieved specific achievements, there are still problems of massive calculation and insufficient grounding. Our method mainly focuses on obtaining better video and query features performing cross-modal feature fusion better, and locating more accurately when dealing with this problem. We propose an efficient Cross-Modal Grounding Network (CMGN) to balance the amount of computation and localization accuracy. In our proposed structure, we obtain the local context information through a bidirectional Gated Recurrent Unit (GRU). We obtain the start and end boundary characteristics for a better video presentation. Then, the two-channel structure, divided into a start channel and an end channel, captures the temporal relationships among several video segments sharing common boundaries. To validate the effectiveness of our method, extensive experiments were conducted on two datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5277–5285 (2017)

    Google Scholar 

  2. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)

    Google Scholar 

  3. Dong, J., et al.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4065–4080 (2021)

    Google Scholar 

  4. Liu, M., Wang, X., Nie, L., et al.: Attentive moment retrieval in videos. In: 41nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 15–24 (2018)

    Google Scholar 

  5. Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 655–664 (2019)

    Google Scholar 

  6. Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 162–171 (2018)

    Google Scholar 

  7. Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Advances in Neural Information Processing Systems (NIPS), pp. 534–544 (2019)

    Google Scholar 

  8. Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12168–12175 (2020)

    Google Scholar 

  9. Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: Man: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1247–1257 (2019)

    Google Scholar 

  10. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI Conference on Artificial Intelligence, vol. 34, no. 07 (2020)

    Google Scholar 

  11. Wang, H., Zha, Z.J., Li, L., Liu, D., Luo, J.: Structured multi-level interaction network for video moment localization via language query. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7026–7035 (2021)

    Google Scholar 

  12. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)

    Google Scholar 

  13. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497 (2015)

    Google Scholar 

  14. Liu, D., Qu, X., Liu, X. Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross- and self- modal graph attention network for query-based moment localization. In: 28th ACM International Conference on Multimedia, pp. 4070–4078 (2020)

    Google Scholar 

  15. Soldan, M., Xu, M., Qu, S., Tegner, J., Ghanem, B.: VLG-net: video-language graph matching network for video grounding. In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 3217–3227 (2020)

    Google Scholar 

  16. Zhang, B., Jiang, B., Yang, C., Pang, L.: Dual-channel localization networks for moment retrieval with natural language. In: International Conference on Multimedia Retrieval (ICMR), pp. 351–359 (2022)

    Google Scholar 

  17. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)

    Article  Google Scholar 

  18. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)

    Google Scholar 

  19. Xiao, S., Chen, L., Zhang, S., Ji, W., Shao, J., Ye, L., Xiao, J.: Boundary proposal network for two-stage natural language video localization. In: AAAI Conference on Artificial Intelligence, vol. 35, no. 04, pp. 2986–2994 (2021)

    Google Scholar 

  20. Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 9062–9069 (2019)

    Google Scholar 

  21. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: The 58th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 6543–6554 (2020)

    Google Scholar 

  22. Gao, J., Xu, C.: Fast video moment retrieval. In: IEEE/CVF International Conference on Computer Vision, pp. 1523–1532 (2021)

    Google Scholar 

  23. Li, K., Guo, D., Wang, M.: Proposal-free video grounding with contextual pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 03, pp. 1902–1910 (2021)

    Google Scholar 

  24. Chung, J., Gulcehre, C., Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China under Grant 62072169, 62172156, Natural Science Foundation of Hunan Province under Grant 2021JJ30152.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Jiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Q., Jiang, B., Zhang, B., Yang, C. (2024). CMGN: Cross-Modal Grounding Network for Temporal Sentence Retrieval in Video. In: Sun, Y., Lu, T., Wang, T., Fan, H., Liu, D., Du, B. (eds) Computer Supported Cooperative Work and Social Computing. ChineseCSCW 2023. Communications in Computer and Information Science, vol 2013. Springer, Singapore. https://doi.org/10.1007/978-981-99-9640-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-9640-7_19

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-9639-1

  • Online ISBN: 978-981-99-9640-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics