Skip to main content

Rethinking Weakly-Supervised Video Temporal Grounding From a Game Perspective

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15103))

Included in the following conference series:

  • 380 Accesses

Abstract

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction. Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization. Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

X. Fang, Z. Xiong, W. Fang — contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Albarelli, A., Rodola, E., Torsello, A.: A game-theoretic approach to fine surface registration without initial motion estimation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 430–437. IEEE (2010)

    Google Scholar 

  2. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)

    Google Scholar 

  3. Bachrach, Y., Markakis, E., Resnick, E., Procaccia, A.D., Rosenschein, J.S., Saberi, A.: Approximating power indices: theoretical and empirical analysis. Auton. Agent. Multi-Agent Syst. 20, 105–122 (2010)

    Article  Google Scholar 

  4. Banzhaf, J.F., III.: Weighted voting doesn’t work: a mathematical analysis. Rutgers L. Rev. 19, 317 (1964)

    Google Scholar 

  5. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  6. Chalkiadakis, G., Elkind, E., Wooldridge, M.: Computational aspects of cooperative game theory. Syn. Lect. Artif. Intell. Mach. Learn. 5(6), 1–168 (2011)

    Google Scholar 

  7. Chen, J., Luo, W., Zhang, W., Ma, L.: Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 267–275 (2022)

    Google Scholar 

  8. Chen, J., Ma, L., Chen, X., Jie, Z., Luo, J.: Localizing natural language in videos. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8175–8182 (2019)

    Google Scholar 

  9. Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10551–10558 (2020)

    Google Scholar 

  10. Chen, Z., Ma, L., Luo, W., Tang, P., Wong, K.Y.K.: Look closer to ground better: weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308 (2020)

  11. Datta, A., Sen, S., Zick, Y.: Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In: 2016 IEEE Symposium on Security and Privacy, pp. 598–617. IEEE (2016)

    Google Scholar 

  12. Deng, S., Wen, J., Liu, C., Yan, K., Xu, G., Xu, Y.: Projective incomplete multi-view clustering. IEEE Trans. Neural Netw. Learn. Syst. 35(8), 1–13 (2023). https://doi.org/10.1109/TNNLS.2023.3242473

  13. Dong, J., et al.: Partially relevant video retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 246–257 (2022)

    Google Scholar 

  14. Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimedia 20(12), 3377–3388 (2018)

    Article  Google Scholar 

  15. Dong, J., et al.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4065–4080 (2022)

    Google Scholar 

  16. Dong, J., et al.: From region to patch: attribute-aware foreground-background contrastive learning for fine-grained fashion retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1273–1282 (2023)

    Google Scholar 

  17. Dong, J., Sun, S., Liu, Z., Chen, S., Liu, B., Wang, X.: Hierarchical contrast for unsupervised skeleton-based action representation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 525–533 (2023)

    Google Scholar 

  18. Dong, J., et al.: Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Trans. Circuits Syst. Video Technol. 32(8), 5680–5694 (2022)

    Article  Google Scholar 

  19. Donoser, M., Bischof, H.: Diffusion processes for retrieval revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1320–1327 (2013)

    Google Scholar 

  20. Dowdall, J., Pavlidis, I.T., Tsiamyrtzis, P.: Coalitional tracking in facial infrared imaging and beyond. In: 2006 Conference on Computer Vision and Pattern Recognition Workshop, pp. 134–134. IEEE (2006)

    Google Scholar 

  21. Fang, X., Easwaran, A., Genest, B.: Uncertainty-guided appearance-motion association network for out-of-distribution action detection. In: IEEE International Conference on Multimedia Information Processing and Retrieval (2024)

    Google Scholar 

  22. Fang, X., et al.: Not all inputs are valid: towards open-set video moment retrieval using language. In: Proceedings of the 32th ACM International Conference on Multimedia (2024)

    Google Scholar 

  23. Fang, X., Hu, Y.: Double self-weighted multi-view clustering via adaptive view fusion. arXiv preprint arXiv:2011.10396 (2020)

  24. Fang, X., Hu, Y., Zhou, P., Wu, D.: ANIMC: a soft approach for autoweighted noisy and incomplete multiview clustering. IEEE Trans. Artif. Intell. 3(2), 192–206 (2021)

    Article  Google Scholar 

  25. Fang, X., Hu, Y., Zhou, P., Wu, D.O.: V3H: view variation and view heredity for incomplete multiview clustering. IEEE Trans. Artif. Intell. 1(3), 233–247 (2020)

    Article  Google Scholar 

  26. Fang, X., Hu, Y., Zhou, P., Wu, D.O.: Unbalanced incomplete multi-view clustering via the scheme of view evolution: weak views are meat; strong views do eat. IEEE Trans. Emerg. Top. Comput. Intell. 6(4), 913–927 (2021)

    Article  Google Scholar 

  27. Fang, X., et al.: Annotations are not all you need: a cross-modal knowledge transfer network for unsupervised temporal sentence grounding. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 8721–8733 (2023)

    Google Scholar 

  28. Fang, X., et al.: Fewer steps, better performance: efficient cross-modal clip trimming for video moment retrieval using language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1735–1743 (2024)

    Google Scholar 

  29. Fang, X., Liu, D., Zhou, P., Hu, Y.: Multi-modal cross-domain alignment network for video moment retrieval. IEEE Trans. Multimedia 25, 7517–7532 (2022)

    Article  Google Scholar 

  30. Fang, X., Liu, D., Zhou, P., Nan, G.: You can ground earlier than see: an effective and efficient pipeline for temporal sentence grounding in compressed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2448–2460 (2023)

    Google Scholar 

  31. Fang, X., Liu, D., Zhou, P., Xu, Z., Li, R.: Hierarchical local-global transformer for temporal sentence grounding. IEEE Trans. Multimedia 26 (2023)

    Google Scholar 

  32. Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)

    Google Scholar 

  33. Gao, M., Davis, L., Socher, R., Xiong, C.: WSLLN: weakly supervised natural language localization networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 1481–1487 (2019)

    Google Scholar 

  34. Grabisch, M., Roubens, M.: An axiomatic approach to the concept of interaction among players in cooperative games. Int. J. Game Theory 28, 547–565 (1999)

    Article  MathSciNet  Google Scholar 

  35. Guo, C., Liu, D., Zhou, P.: A hybird alignment loss for temporal moment localization with natural language. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)

    Google Scholar 

  36. Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recognition: dataset, method, and application. IEEE Trans. Circuits Syst. Video Technol. 34(7), 6238–6252 (2024)

    Article  Google Scholar 

  37. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with temporal language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1380–1390 (2018)

    Google Scholar 

  38. Huang, J., Liu, Y., Gong, S., Jin, H.: Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7199–7208 (2021)

    Google Scholar 

  39. Jiang, L., Wang, C., Ning, X., Yu, Z.: LTTPoint: a MLP-based point cloud classification method with local topology transformation module. In: 2023 7th Asian Conference on Artificial Intelligence Technology (ACAIT), pp. 783–789. IEEE (2023)

    Google Scholar 

  40. Jin, P., et al.: Video-text as game players: hierarchical Banzhaf interaction for cross-modal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2472–2482 (2023)

    Google Scholar 

  41. Jin, S., Wang, S., Fang, F.: Game theoretical analysis on capacity configuration for microgrid based on multi-agent system. Int. J. Electr. Power Energy Syst. 125, 106485 (2021)

    Article  Google Scholar 

  42. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  43. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)

    Google Scholar 

  44. Leech, D.: Computation of Power Indices (2002)

    Google Scholar 

  45. Lehrer, E.: An axiomatization of the Banzhaf value. Int. J. Game Theory 17, 89–99 (1988)

    Article  MathSciNet  Google Scholar 

  46. Li, H., Cao, M., Cheng, X., Li, Y., Zhu, Z., Zou, Y.: G2L: semantically aligned and uniform video grounding via geodesic and game theory. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12032–12042 (2023)

    Google Scholar 

  47. Li, J., et al.: Fine-grained semantically aligned vision-language pre-training. Adv. Neural Inf. Process. Syst. 35, 7290–7303 (2022)

    Google Scholar 

  48. Lin, K.Q., et al.: UniVTG: towards unified video-language temporal grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2794–2804 (2023)

    Google Scholar 

  49. Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., Liu, H.: Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11539–11546 (2020)

    Google Scholar 

  50. Liu, C., Wen, J., Luo, X., Huang, C., Wu, Z., Xu, Y.: DICNet: deep instance-level contrastive network for double incomplete multi-view multi-label classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 8807–8815 (2023)

    Google Scholar 

  51. Liu, C., Wen, J., Luo, X., Xu, Y.: Incomplete multi-view multi-label learning via label-guided masked view and category-aware transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 8816–8824 (2023)

    Google Scholar 

  52. Liu, C., Wen, J., Wu, Z., Luo, X., Huang, C., Xu, Y.: Information recovery-driven deep incomplete multiview clustering network. In: IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11 (2023)

    Google Scholar 

  53. Liu, D., Fang, X., Hu, W., Zhou, P.: Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding. IEEE Trans. Multimedia 25, 8539–8553 (2023)

    Article  Google Scholar 

  54. Liu, D., et al.: Unsupervised domain adaptative temporal sentence localization with mutual information maximization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 3567–3575 (2024)

    Google Scholar 

  55. Liu, D., Fang, X., Zhou, P., Di, X., Lu, W., Cheng, Y.: Hypotheses tree building for one-shot temporal sentence localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1640–1648 (2023)

    Google Scholar 

  56. Liu, D., Hu, W.: Learning to focus on the foreground for temporal sentence grounding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 5532–5541 (2022)

    Google Scholar 

  57. Liu, D., Hu, W.: Skimming, locating, then perusing: a human-like framework for natural language video localization. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4536–4545 (2022)

    Google Scholar 

  58. Liu, D., et al.: Filling the information gap between video and query for language-driven moment retrieval. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4190–4199 (2023)

    Google Scholar 

  59. Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11235–11244 (2021)

    Google Scholar 

  60. Liu, D., et al.: Transform-equivariant consistency learning for temporal sentence grounding. ACM Trans. Multimedia Comput. Commun. Appl. 20(4), 1–19 (2024)

    Article  Google Scholar 

  61. Liu, D., et al.: Towards robust temporal activity localization learning with noisy labels. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 16630–16642 (2024)

    Google Scholar 

  62. Liu, D., Qu, X., Hu, W.: Reducing the vision and language bias for temporal sentence grounding. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4092–4101 (2022)

    Google Scholar 

  63. Liu, D., Qu, X., Liu, X.Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross-and self-modal graph attention network for query-based moment localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4070–4078 (2020)

    Google Scholar 

  64. Liu, D., Zhou, P.: Jointly visual-and semantic-aware graph memory networks for temporal sentence localization in videos. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

    Google Scholar 

  65. Liu, D., Zhou, P., Xu, Z., Wang, H., Li, R.: Few-shot temporal sentence grounding via memory-guided semantic learning. IEEE Trans. Circuits Syst. Video Technol. 33(5), 2491–2505 (2022)

    Article  Google Scholar 

  66. Liu, D., et al.: Conditional video diffusion network for fine-grained temporal sentence grounding. IEEE Trans. Multimedia 26 (2023)

    Google Scholar 

  67. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777 (2017)

    Google Scholar 

  68. Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., Yoo, C.D.: VLANet: video-language alignment network for weakly-supervised video moment retrieval. In: Proceedings of the European Conference on Computer Vision, pp. 156–171 (2020)

    Google Scholar 

  69. Ma, W.C., Huang, D.A., Lee, N., Kitani, K.M.: Forecasting interactive dynamics of pedestrians with fictitious play. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 774–782 (2017)

    Google Scholar 

  70. Ma, Y., Liu, Y., Wang, L., Kang, W., Qiao, Y., Wang, Y.: Dual masked modeling for weakly-supervised temporal boundary discovery. IEEE Trans. Multimedia 26 (2023)

    Google Scholar 

  71. Matsui, Y., Matsui, T.: NP-completeness for calculating power indices of weighted majority games. Theoret. Comput. Sci. 263(1–2), 305–310 (2001)

    Article  MathSciNet  Google Scholar 

  72. Michalak, T.P., Aadithya, K.V., Szczepanski, P.L., Ravindran, B., Jennings, N.R.: Efficient computation of the Shapley value for game-theoretic network centrality. J. Artif. Intell. Res. 46, 607–650 (2013)

    Article  MathSciNet  Google Scholar 

  73. Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11592–11601 (2019)

    Google Scholar 

  74. Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10810–10819 (2020)

    Google Scholar 

  75. Ning, E., Wang, C., Zhang, H., Ning, X., Tiwari, P.: Occluded person re-identification with deep learning: a survey and perspectives. Exp. Syst. Appl. 239, 122419 (2023)

    Google Scholar 

  76. Ning, E., Wang, Y., Wang, C., Zhang, H., Ning, X.: Enhancement, integration, expansion: activating representation of detailed features for occluded person re-identification. Neural Netw. 169, 532–541 (2024)

    Article  Google Scholar 

  77. Ning, E., Zhang, C., Wang, C., Ning, X., Chen, H., Bai, X.: Pedestrian re-ID based on feature consistency and contrast enhancement. Displays 79, 102467 (2023)

    Article  Google Scholar 

  78. Nowak, A.S.: On an axiomatization of the Banzhaf value without the additivity axiom. Int. J. Game Theory 26, 137–141 (1997)

    Article  MathSciNet  Google Scholar 

  79. Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  80. Osborne, M.J., Rubinstein, A.: A Course in Game Theory. MIT Press (1994)

    Google Scholar 

  81. Patel, R., Garnelo, M., Gemp, I., Dyer, C., Bachrach, Y.: Game-theoretic vocabulary selection via the Shapley value and Banzhaf index. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2789–2798 (2021)

    Google Scholar 

  82. Pavan, M., Pelillo, M.: A new graph-theoretic approach to clustering and segmentation. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, Proceedings, vol. 1, p. I. IEEE (2003)

    Google Scholar 

  83. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

    Google Scholar 

  84. Rodola, E., Bronstein, A.M., Albarelli, A., Bergamasco, F., Torsello, A.: A game-theoretic approach to deformable shape matching. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 182–189. IEEE (2012)

    Google Scholar 

  85. Shapley, L.S., et al.: A Value for n-Person Games (1953)

    Google Scholar 

  86. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: European Conference on Computer Vision, pp. 510–526 (2016)

    Google Scholar 

  87. Song, Y., et al.: MARN: multi-level attentional reconstruction networks for weakly supervised video temporal grounding. Neurocomputing 554, 126625 (2023)

    Article  Google Scholar 

  88. Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv preprint arXiv:2003.07048 (2020)

  89. Tan, R., Xu, H., Saenko, K., Plummer, B.A.: LoGAN: latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)

    Google Scholar 

  90. Tang, K., et al.: RepPVConv: attentively fusing reparameterized voxel features for efficient 3D point cloud perception. Vis. Comput. 39(11), 5577–5588 (2023)

    Article  Google Scholar 

  91. Tang, K., Lou, T., Peng, W., Chen, N., Shi, Y., Wang, W.: Effective single-step adversarial training with energy-based models. IEEE Trans. Emerg. Top. Comput. Intell. (2024). https://doi.org/10.1109/TETCI.2024.3378652

    Article  Google Scholar 

  92. Tang, K., et al.: Decision fusion networks for image classification. IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2022.3196129

    Article  Google Scholar 

  93. Tang, K., et al.: Rethinking perturbation directions for imperceptible adversarial attacks on point clouds. IEEE Internet Things J. 10(6), 5158–5169 (2022)

    Article  Google Scholar 

  94. Tang, K., et al.: Reparameterization head for efficient multi-input networks. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6190–6194 (2024). https://doi.org/10.1109/ICASSP48485.2024.10447574

  95. Tang, K., et al.: Reparameterization head for efficient multi-input networks. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6190–6194. IEEE (2024)

    Google Scholar 

  96. Torsello, A., Bulo, S.R., Pelillo, M.: Grouping with asymmetric affinities: a game-theoretic perspective. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, pp. 292–299. IEEE (2006)

    Google Scholar 

  97. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  98. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  99. Wang, C., Ning, X., Li, W., Bai, X., Gao, X.: 3D person re-identification based on global semantic guidance and local feature aggregation. IEEE Trans. Circuits Syst. Video Technol. 34(6) (2023)

    Google Scholar 

  100. Wang, C., Ning, X., Sun, L., Zhang, L., Li, W., Bai, X.: Learning discriminative features by covering local geometric space for point cloud analysis. IEEE Trans. Geosci. Remote Sens. 60, 1–15 (2022)

    Google Scholar 

  101. Wang, C., Wang, C., Li, W., Wang, H.: A brief survey on RGB-D semantic segmentation using deep learning. Displays 70, 102080 (2021)

    Article  Google Scholar 

  102. Wang, C., Wang, H., Ning, X., Shengwei, T., Li, W.: 3D point cloud classification method based on dynamic coverage of local area. J. Softw. 34(4), 1962–1976 (2022)

    Google Scholar 

  103. Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12168–12175 (2020)

    Google Scholar 

  104. Wang, Y., Deng, J., Zhou, W., Li, H.: Weakly supervised temporal adjacent network for language grounding. IEEE Trans. Multimedia 24, 3276–3286 (2021)

    Article  Google Scholar 

  105. Wang, Z., Chen, J., Jiang, Y.G.: Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1459–1468 (2021)

    Google Scholar 

  106. Wen, J., et al.: Deep double incomplete multi-view multi-label learning with incomplete labels and missing views. IEEE Trans. Neural Netw. Learn. Syst. 35(8), 1–13 (2023). https://doi.org/10.1109/TNNLS.2023.3260349

  107. Wen, J., Zhang, Z., Li, Z.J.: A survey on incomplete multiview clustering. IEEE Trans. Syst. Man Cybern. Syst. 53(2), 1136–1149 (2023)

    Google Scholar 

  108. Winter, E.: The shapley value. In: Handbook of Game Theory with Economic Applications, vol. 3, pp. 2025–2054 (2002)

    Google Scholar 

  109. Wu, H., et al.: Atomic-action-based contrastive network for weakly supervised temporal language grounding. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1523–1528. IEEE (2023)

    Google Scholar 

  110. Xiong, Z., Liu, D., Zhou, P.: Gaussian kernel-based cross modal network for spatio-temporal video grounding. In: IEEE International Conference on Image Processing (ICIP), pp. 2481–2485 (2022)

    Google Scholar 

  111. Xiong, Z., Liu, D., Zhou, P., Zhu, J.: Tracking objects and activities with attention for temporal sentence grounding. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)

    Google Scholar 

  112. Yang, W., Zhang, T., Zhang, Y., Wu, F.: Local correspondence network for weakly supervised temporal sentence grounding. IEEE Trans. Image Process. 30, 3252–3262 (2021)

    Article  Google Scholar 

  113. Yu, Z., Li, L., Xie, J., Wang, C., Li, W., Ning, X.: Pedestrian 3D shape understanding for person re-identification via multi-view learning. IEEE Trans. Circuits Syst. Video Technol. 34(7) (2024)

    Google Scholar 

  114. Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 536–546 (2019)

    Google Scholar 

  115. Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10287–10296 (2020)

    Google Scholar 

  116. Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1247–1257 (2019)

    Google Scholar 

  117. Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6543–6554 (2020)

    Google Scholar 

  118. Zhang, H., Xie, Y., Zheng, L., Zhang, D., Zhang, Q.: Interpreting multivariate Shapley interactions in DNNs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10877–10886 (2021)

    Google Scholar 

  119. Zhang, H., et al.: Deep learning-based 3D point cloud classification: a systematic survey and outlook. Displays 79, 102456 (2023)

    Article  Google Scholar 

  120. Zhang, H., Wang, C., Yu, L., Tian, S., Ning, X., Rodrigues, J.: PointGT: a method for point-cloud classification and segmentation based on local geometric transformation. IEEE Trans. Multimedia 26 (2024)

    Google Scholar 

  121. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)

    Google Scholar 

  122. Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 655–664 (2019)

    Google Scholar 

  123. Zhang, Z., Zhao, Z., Lin, Z., He, X., et al.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural. Inf. Process. Syst. 33, 18123–18134 (2020)

    Google Scholar 

  124. Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment localization with contrastive negative sample mining. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3517–3525 (2022)

    Google Scholar 

  125. Zheng, M., Huang, Y., Chen, Q., Peng, Y., Liu, Y.: Weakly supervised temporal sentence grounding with Gaussian-based contrastive proposal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15555–15564 (2022)

    Google Scholar 

  126. Zheng, Q., et al.: Progressive localization networks for language-based moment localization. ACM Trans. Multimedia Comput. Commun. Appl. 19(2), 1–21 (2023)

    Article  Google Scholar 

  127. Zhu, J., et al.: Rethinking the video sampling and reasoning strategies for temporal sentence grounding. arXiv preprint arXiv:2301.00514 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pan Zhou or Daizong Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fang, X. et al. (2025). Rethinking Weakly-Supervised Video Temporal Grounding From a Game Perspective. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15103. Springer, Cham. https://doi.org/10.1007/978-3-031-72995-9_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72995-9_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72994-2

  • Online ISBN: 978-3-031-72995-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics