Skip to main content
Log in

Spatiotemporal contrastive modeling for video moment retrieval

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract 

With the rapid development of social networks, video data has been growing explosively. As one of the important social mediums, spatiotemporal characteristics of videos have attracted considerable attention in recommendation system and video understanding. In this paper, we discuss the video moment retrieval (VMR) task, which locates moments in a video based on different textual queries. Existing methods are of two pipelines: 1) proposal-free approaches are mainly in modifying multi-modal interaction strategy; 2) proposal-based methods are dedicated to designing advanced proposal generation paradigm. Recently, contrastive representation learning has been successfully applied to the field of video understanding. From a new perspective, we propose a new VMR framework, named spatiotemporal contrastive network (STCNet), to learn discriminative boundary features of video grounding by contrast learning. To be specific, we propose a boundary matching sampling module for dense negative sample sampling. The contrast learning can refine the feature representations in the training phase without any additional cost in inference. On three public datasets, Charades-STA, ActivityNet Captions and TACoS, our proposed method performs competitive performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

Notes

  1. http://activity-net.org/challenges/2016/download.html#c3d

References

  1. Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)

  2. Tian, H., Tao, Y., Pouyanfar, S., Chen, S.-C., Shyu, M.-L.: Multimodal deep representation learning for video classification. World Wide Web 22 (3), 1325–1341 (2019)

    Article  Google Scholar 

  3. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

  4. Guo, Y., Zhang, J., Gao, L.: Exploiting long-term temporal dynamics for video captioning. World Wide Web 22(2), 735–749 (2019)

    Article  Google Scholar 

  5. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  6. Men, Q., Leung, H., Yang, Y.: Self-feeding frequency estimation and eating action recognition from skeletal representation using kinect. World Wide Web 22(3), 1343–1358 (2019)

    Article  Google Scholar 

  7. Gao, J., Xu, C.: Fast video moment retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1523–1532 (2021)

  8. Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.-S.: Attentive moment retrieval in videos. In: The 41St International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 15–24 (2018)

  9. Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10810–10819 (2020)

  10. Li, K., Guo, D., Wang, M.: Proposal-free video grounding with contextual pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1902–1910 (2021)

  11. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 988–996 (2017)

  12. Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos. In: Advances in Neural Information Processing Systems, pp. 536–546 (2019)

  13. Yuan, Y., Mei, T., Zhu, W.: To find where you talk: Temporal sentence localization in video with attention based location regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9159–9166 (2019)

  14. Wang, W., Gao, J., Yang, X., Xu, C.: Learning coarse-to-fine graph neural networks for video-text retrieval. IEEE Trans. Multimedia 23, 2386–2397 (2020)

    Article  Google Scholar 

  15. Jing, W., Nie, X., Cui, C., Xi, X., Yang, G., Yin, Y.: Global-view hashing: harnessing global relations in near-duplicate video retrieval. World Wide Web 22(2), 771–789 (2019)

    Article  Google Scholar 

  16. Li, X., Zhou, Z., Chen, L., Gao, L.: Residual attention-based lstm for video captioning. World Wide Web 22(2), 621–636 (2019)

    Article  Google Scholar 

  17. Liu, K., Liu, W., Ma, H., Huang, W., Dong, X.: Generalized zero-shot learning for action recognition with web-scale video data. World Wide Web 22(2), 807–824 (2019)

    Article  Google Scholar 

  18. Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9062–9069 (2019)

  19. Zhang, D., Dai, X., Wang, X., Wang, Y.-F., Davis, L.S.: Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1247–1257 (2019)

  20. Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.-S.: Temporally grounding natural sentence in video. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 162–171 (2018)

  21. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)

  22. Lu, C., Chen, L., Tan, C., Li, X., Xiao, J.: Debug: a dense bottom-up grounding approach for natural language video localization. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 5144–5153 (2019)

  23. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6543–6554 (2020)

  24. Zhang, Z., Zhao, Z., Zhang, Z., Lin, Z., Wang, Q., Hong, R.: Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE Trans. Multimedia 23, 3306–3317 (2020)

    Article  Google Scholar 

  25. Lorre, G., Rabarisoa, J., Orcesi, A., Ainouz, S., Canu, S.: Temporal contrastive pretraining for video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 662–670 (2020)

  26. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

  27. Sun, C., Baradel, F., Murphy, K., Schmid, C.: Learning video representations using contrastive bidirectional transformer. arXiv:1906.05743 (2019)

  28. Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)

  29. Yu, A. W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., Le, Q. V.: Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 (2018)

  30. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

  31. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

  32. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  33. Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898 (2019)

  34. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)

  35. Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: European Conference on Computer Vision, pp. 144–157. Springer (2012)

  36. Fabian Caba Heilbron, B. G., Escorcia, V., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

  37. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6543–6554 (2020)

  38. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

  39. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  40. Kingma, D. P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)

  41. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)

  42. Ge, R., Gao, J., Chen, K., Nevatia, R.: Mac: Mining activity concepts for language-based temporal localization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 245–253. IEEE (2019)

  43. Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 655–664 (2019)

  44. Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12168–12175 (2020)

  45. Xiao, S., Chen, L., Zhang, S., Ji, W., Shao, J., Ye, L., Xiao, J.: Boundary proposal network for two-stage natural language video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2986–2994 (2021)

  46. Chen, S., Jiang, W., Liu, W., Jiang, Y.-G.: Learning modality interaction for temporal sentence localization and event captioning in videos. In: European Conference on Computer Vision, pp. 333–351. Springer (2020)

  47. Chen, S., Jiang, Y.-G.: Semantic proposal for activity localization in videos via sentence query. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8199–8206 (2019)

  48. He, D., Zhao, X., Huang, J., Li, F., Liu, X., Wen, S.: Read, watch, and move: Reinforcement learning for temporally grounding natural language descriptions in videos. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8393–8400 (2019)

  49. Wang, W., Huang, Y., Wang, L.: Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 334–343 (2019)

  50. Chen, L., Lu, C., Tang, S., Xiao, J., Zhang, D., Tan, C., Li, X.: Rethinking the bottom-up framework for query-based video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10551–10558 (2020)

  51. Wu, J., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12386–12393 (2020)

  52. Rodriguez, C., Marrese-Taylor, E., Saleh, F. S., Li, H., Gould, S.: Proposal-Free Temporal Moment Localization of a Natural-Language Query in Video Using Guided Attention. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 2464–2473 (2020)

  53. Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10287–10296 (2020)

  54. Liu, X., Nie, X., Teng, J., Lian, L., Yin, Y.: Single-shot semantic matching network for moment localization in videos. ACM Trans. Multimedia Comput. Commun. Appl. 17(3), 1–14 (2021)

    Google Scholar 

  55. Hu, Y., Liu, M., Su, X., Gao, Z., Nie, L.: Video moment localization via deep cross-modal hashing. IEEE Trans. Image Process. 30, 4667–4677 (2021)

    Article  MathSciNet  Google Scholar 

  56. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network. 2(7) arXiv:1503.02531 (2015)

Download references

Acknowledgements

Not applicable

Funding

This research was supported by the National Natural Science Foundation of China (NSFC) under grants 61876058, 61725203, 62020106007, and U20A20183.

Author information

Authors and Affiliations

Authors

Contributions

Kun Li and Guoliang Chen designed the proposed method. Yi Wang and Kun Li wrote the main manuscript text. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Kun Li or Guoliang Chen.

Ethics declarations

Human and Animal Ethics

Not applicable

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

Not applicable

Consent for Publication

Not applicable

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Spatiotemporal Data Management and Analytics for Recommend Guest Editors: Shuo Shang, Xiangliang Zhang and Panos Kalnis

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Li, K., Chen, G. et al. Spatiotemporal contrastive modeling for video moment retrieval. World Wide Web 26, 1525–1544 (2023). https://doi.org/10.1007/s11280-022-01105-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-022-01105-3

Keywords

Navigation