Abstract
This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction. Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization. Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.
X. Fang, Z. Xiong, W. Fang — contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Albarelli, A., Rodola, E., Torsello, A.: A game-theoretic approach to fine surface registration without initial motion estimation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 430–437. IEEE (2010)
Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)
Bachrach, Y., Markakis, E., Resnick, E., Procaccia, A.D., Rosenschein, J.S., Saberi, A.: Approximating power indices: theoretical and empirical analysis. Auton. Agent. Multi-Agent Syst. 20, 105–122 (2010)
Banzhaf, J.F., III.: Weighted voting doesn’t work: a mathematical analysis. Rutgers L. Rev. 19, 317 (1964)
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Chalkiadakis, G., Elkind, E., Wooldridge, M.: Computational aspects of cooperative game theory. Syn. Lect. Artif. Intell. Mach. Learn. 5(6), 1–168 (2011)
Chen, J., Luo, W., Zhang, W., Ma, L.: Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 267–275 (2022)
Chen, J., Ma, L., Chen, X., Jie, Z., Luo, J.: Localizing natural language in videos. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8175–8182 (2019)
Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10551–10558 (2020)
Chen, Z., Ma, L., Luo, W., Tang, P., Wong, K.Y.K.: Look closer to ground better: weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308 (2020)
Datta, A., Sen, S., Zick, Y.: Algorithmic transparency via quantitative input influence: theory and experiments with learning systems. In: 2016 IEEE Symposium on Security and Privacy, pp. 598–617. IEEE (2016)
Deng, S., Wen, J., Liu, C., Yan, K., Xu, G., Xu, Y.: Projective incomplete multi-view clustering. IEEE Trans. Neural Netw. Learn. Syst. 35(8), 1–13 (2023). https://doi.org/10.1109/TNNLS.2023.3242473
Dong, J., et al.: Partially relevant video retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 246–257 (2022)
Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for image and video caption retrieval. IEEE Trans. Multimedia 20(12), 3377–3388 (2018)
Dong, J., et al.: Dual encoding for video retrieval by text. IEEE Trans. Pattern Anal. Mach. Intell. 44(8), 4065–4080 (2022)
Dong, J., et al.: From region to patch: attribute-aware foreground-background contrastive learning for fine-grained fashion retrieval. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1273–1282 (2023)
Dong, J., Sun, S., Liu, Z., Chen, S., Liu, B., Wang, X.: Hierarchical contrast for unsupervised skeleton-based action representation learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 525–533 (2023)
Dong, J., et al.: Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Trans. Circuits Syst. Video Technol. 32(8), 5680–5694 (2022)
Donoser, M., Bischof, H.: Diffusion processes for retrieval revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1320–1327 (2013)
Dowdall, J., Pavlidis, I.T., Tsiamyrtzis, P.: Coalitional tracking in facial infrared imaging and beyond. In: 2006 Conference on Computer Vision and Pattern Recognition Workshop, pp. 134–134. IEEE (2006)
Fang, X., Easwaran, A., Genest, B.: Uncertainty-guided appearance-motion association network for out-of-distribution action detection. In: IEEE International Conference on Multimedia Information Processing and Retrieval (2024)
Fang, X., et al.: Not all inputs are valid: towards open-set video moment retrieval using language. In: Proceedings of the 32th ACM International Conference on Multimedia (2024)
Fang, X., Hu, Y.: Double self-weighted multi-view clustering via adaptive view fusion. arXiv preprint arXiv:2011.10396 (2020)
Fang, X., Hu, Y., Zhou, P., Wu, D.: ANIMC: a soft approach for autoweighted noisy and incomplete multiview clustering. IEEE Trans. Artif. Intell. 3(2), 192–206 (2021)
Fang, X., Hu, Y., Zhou, P., Wu, D.O.: V3H: view variation and view heredity for incomplete multiview clustering. IEEE Trans. Artif. Intell. 1(3), 233–247 (2020)
Fang, X., Hu, Y., Zhou, P., Wu, D.O.: Unbalanced incomplete multi-view clustering via the scheme of view evolution: weak views are meat; strong views do eat. IEEE Trans. Emerg. Top. Comput. Intell. 6(4), 913–927 (2021)
Fang, X., et al.: Annotations are not all you need: a cross-modal knowledge transfer network for unsupervised temporal sentence grounding. In: Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 8721–8733 (2023)
Fang, X., et al.: Fewer steps, better performance: efficient cross-modal clip trimming for video moment retrieval using language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 1735–1743 (2024)
Fang, X., Liu, D., Zhou, P., Hu, Y.: Multi-modal cross-domain alignment network for video moment retrieval. IEEE Trans. Multimedia 25, 7517–7532 (2022)
Fang, X., Liu, D., Zhou, P., Nan, G.: You can ground earlier than see: an effective and efficient pipeline for temporal sentence grounding in compressed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2448–2460 (2023)
Fang, X., Liu, D., Zhou, P., Xu, Z., Li, R.: Hierarchical local-global transformer for temporal sentence grounding. IEEE Trans. Multimedia 26 (2023)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)
Gao, M., Davis, L., Socher, R., Xiong, C.: WSLLN: weakly supervised natural language localization networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 1481–1487 (2019)
Grabisch, M., Roubens, M.: An axiomatic approach to the concept of interaction among players in cooperative games. Int. J. Game Theory 28, 547–565 (1999)
Guo, C., Liu, D., Zhou, P.: A hybird alignment loss for temporal moment localization with natural language. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2022)
Guo, D., Li, K., Hu, B., Zhang, Y., Wang, M.: Benchmarking micro-action recognition: dataset, method, and application. IEEE Trans. Circuits Syst. Video Technol. 34(7), 6238–6252 (2024)
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with temporal language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1380–1390 (2018)
Huang, J., Liu, Y., Gong, S., Jin, H.: Cross-sentence temporal and semantic relations in video activity localisation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7199–7208 (2021)
Jiang, L., Wang, C., Ning, X., Yu, Z.: LTTPoint: a MLP-based point cloud classification method with local topology transformation module. In: 2023 7th Asian Conference on Artificial Intelligence Technology (ACAIT), pp. 783–789. IEEE (2023)
Jin, P., et al.: Video-text as game players: hierarchical Banzhaf interaction for cross-modal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2472–2482 (2023)
Jin, S., Wang, S., Fang, F.: Game theoretical analysis on capacity configuration for microgrid based on multi-agent system. Int. J. Electr. Power Energy Syst. 125, 106485 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)
Leech, D.: Computation of Power Indices (2002)
Lehrer, E.: An axiomatization of the Banzhaf value. Int. J. Game Theory 17, 89–99 (1988)
Li, H., Cao, M., Cheng, X., Li, Y., Zhu, Z., Zou, Y.: G2L: semantically aligned and uniform video grounding via geodesic and game theory. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12032–12042 (2023)
Li, J., et al.: Fine-grained semantically aligned vision-language pre-training. Adv. Neural Inf. Process. Syst. 35, 7290–7303 (2022)
Lin, K.Q., et al.: UniVTG: towards unified video-language temporal grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2794–2804 (2023)
Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., Liu, H.: Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11539–11546 (2020)
Liu, C., Wen, J., Luo, X., Huang, C., Wu, Z., Xu, Y.: DICNet: deep instance-level contrastive network for double incomplete multi-view multi-label classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 8807–8815 (2023)
Liu, C., Wen, J., Luo, X., Xu, Y.: Incomplete multi-view multi-label learning via label-guided masked view and category-aware transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 8816–8824 (2023)
Liu, C., Wen, J., Wu, Z., Luo, X., Huang, C., Xu, Y.: Information recovery-driven deep incomplete multiview clustering network. In: IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11 (2023)
Liu, D., Fang, X., Hu, W., Zhou, P.: Exploring optical-flow-guided motion and detection-based appearance for temporal sentence grounding. IEEE Trans. Multimedia 25, 8539–8553 (2023)
Liu, D., et al.: Unsupervised domain adaptative temporal sentence localization with mutual information maximization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 3567–3575 (2024)
Liu, D., Fang, X., Zhou, P., Di, X., Lu, W., Cheng, Y.: Hypotheses tree building for one-shot temporal sentence localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1640–1648 (2023)
Liu, D., Hu, W.: Learning to focus on the foreground for temporal sentence grounding. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 5532–5541 (2022)
Liu, D., Hu, W.: Skimming, locating, then perusing: a human-like framework for natural language video localization. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4536–4545 (2022)
Liu, D., et al.: Filling the information gap between video and query for language-driven moment retrieval. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 4190–4199 (2023)
Liu, D., et al.: Context-aware biaffine localizing network for temporal sentence grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11235–11244 (2021)
Liu, D., et al.: Transform-equivariant consistency learning for temporal sentence grounding. ACM Trans. Multimedia Comput. Commun. Appl. 20(4), 1–19 (2024)
Liu, D., et al.: Towards robust temporal activity localization learning with noisy labels. In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 16630–16642 (2024)
Liu, D., Qu, X., Hu, W.: Reducing the vision and language bias for temporal sentence grounding. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4092–4101 (2022)
Liu, D., Qu, X., Liu, X.Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross-and self-modal graph attention network for query-based moment localization. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4070–4078 (2020)
Liu, D., Zhou, P.: Jointly visual-and semantic-aware graph memory networks for temporal sentence localization in videos. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Liu, D., Zhou, P., Xu, Z., Wang, H., Li, R.: Few-shot temporal sentence grounding via memory-guided semantic learning. IEEE Trans. Circuits Syst. Video Technol. 33(5), 2491–2505 (2022)
Liu, D., et al.: Conditional video diffusion network for fine-grained temporal sentence grounding. IEEE Trans. Multimedia 26 (2023)
Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777 (2017)
Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., Yoo, C.D.: VLANet: video-language alignment network for weakly-supervised video moment retrieval. In: Proceedings of the European Conference on Computer Vision, pp. 156–171 (2020)
Ma, W.C., Huang, D.A., Lee, N., Kitani, K.M.: Forecasting interactive dynamics of pedestrians with fictitious play. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 774–782 (2017)
Ma, Y., Liu, Y., Wang, L., Kang, W., Qiao, Y., Wang, Y.: Dual masked modeling for weakly-supervised temporal boundary discovery. IEEE Trans. Multimedia 26 (2023)
Matsui, Y., Matsui, T.: NP-completeness for calculating power indices of weighted majority games. Theoret. Comput. Sci. 263(1–2), 305–310 (2001)
Michalak, T.P., Aadithya, K.V., Szczepanski, P.L., Ravindran, B., Jennings, N.R.: Efficient computation of the Shapley value for game-theoretic network centrality. J. Artif. Intell. Res. 46, 607–650 (2013)
Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11592–11601 (2019)
Mun, J., Cho, M., Han, B.: Local-global video-text interactions for temporal grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10810–10819 (2020)
Ning, E., Wang, C., Zhang, H., Ning, X., Tiwari, P.: Occluded person re-identification with deep learning: a survey and perspectives. Exp. Syst. Appl. 239, 122419 (2023)
Ning, E., Wang, Y., Wang, C., Zhang, H., Ning, X.: Enhancement, integration, expansion: activating representation of detailed features for occluded person re-identification. Neural Netw. 169, 532–541 (2024)
Ning, E., Zhang, C., Wang, C., Ning, X., Chen, H., Bai, X.: Pedestrian re-ID based on feature consistency and contrast enhancement. Displays 79, 102467 (2023)
Nowak, A.S.: On an axiomatization of the Banzhaf value without the additivity axiom. Int. J. Game Theory 26, 137–141 (1997)
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Osborne, M.J., Rubinstein, A.: A Course in Game Theory. MIT Press (1994)
Patel, R., Garnelo, M., Gemp, I., Dyer, C., Bachrach, Y.: Game-theoretic vocabulary selection via the Shapley value and Banzhaf index. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2789–2798 (2021)
Pavan, M., Pelillo, M.: A new graph-theoretic approach to clustering and segmentation. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003, Proceedings, vol. 1, p. I. IEEE (2003)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Rodola, E., Bronstein, A.M., Albarelli, A., Bergamasco, F., Torsello, A.: A game-theoretic approach to deformable shape matching. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 182–189. IEEE (2012)
Shapley, L.S., et al.: A Value for n-Person Games (1953)
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: European Conference on Computer Vision, pp. 510–526 (2016)
Song, Y., et al.: MARN: multi-level attentional reconstruction networks for weakly supervised video temporal grounding. Neurocomputing 554, 126625 (2023)
Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv preprint arXiv:2003.07048 (2020)
Tan, R., Xu, H., Saenko, K., Plummer, B.A.: LoGAN: latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)
Tang, K., et al.: RepPVConv: attentively fusing reparameterized voxel features for efficient 3D point cloud perception. Vis. Comput. 39(11), 5577–5588 (2023)
Tang, K., Lou, T., Peng, W., Chen, N., Shi, Y., Wang, W.: Effective single-step adversarial training with energy-based models. IEEE Trans. Emerg. Top. Comput. Intell. (2024). https://doi.org/10.1109/TETCI.2024.3378652
Tang, K., et al.: Decision fusion networks for image classification. IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2022.3196129
Tang, K., et al.: Rethinking perturbation directions for imperceptible adversarial attacks on point clouds. IEEE Internet Things J. 10(6), 5158–5169 (2022)
Tang, K., et al.: Reparameterization head for efficient multi-input networks. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6190–6194 (2024). https://doi.org/10.1109/ICASSP48485.2024.10447574
Tang, K., et al.: Reparameterization head for efficient multi-input networks. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6190–6194. IEEE (2024)
Torsello, A., Bulo, S.R., Pelillo, M.: Grouping with asymmetric affinities: a game-theoretic perspective. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1, pp. 292–299. IEEE (2006)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Wang, C., Ning, X., Li, W., Bai, X., Gao, X.: 3D person re-identification based on global semantic guidance and local feature aggregation. IEEE Trans. Circuits Syst. Video Technol. 34(6) (2023)
Wang, C., Ning, X., Sun, L., Zhang, L., Li, W., Bai, X.: Learning discriminative features by covering local geometric space for point cloud analysis. IEEE Trans. Geosci. Remote Sens. 60, 1–15 (2022)
Wang, C., Wang, C., Li, W., Wang, H.: A brief survey on RGB-D semantic segmentation using deep learning. Displays 70, 102080 (2021)
Wang, C., Wang, H., Ning, X., Shengwei, T., Li, W.: 3D point cloud classification method based on dynamic coverage of local area. J. Softw. 34(4), 1962–1976 (2022)
Wang, J., Ma, L., Jiang, W.: Temporally grounding language queries in videos by contextual boundary-aware prediction. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12168–12175 (2020)
Wang, Y., Deng, J., Zhou, W., Li, H.: Weakly supervised temporal adjacent network for language grounding. IEEE Trans. Multimedia 24, 3276–3286 (2021)
Wang, Z., Chen, J., Jiang, Y.G.: Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1459–1468 (2021)
Wen, J., et al.: Deep double incomplete multi-view multi-label learning with incomplete labels and missing views. IEEE Trans. Neural Netw. Learn. Syst. 35(8), 1–13 (2023). https://doi.org/10.1109/TNNLS.2023.3260349
Wen, J., Zhang, Z., Li, Z.J.: A survey on incomplete multiview clustering. IEEE Trans. Syst. Man Cybern. Syst. 53(2), 1136–1149 (2023)
Winter, E.: The shapley value. In: Handbook of Game Theory with Economic Applications, vol. 3, pp. 2025–2054 (2002)
Wu, H., et al.: Atomic-action-based contrastive network for weakly supervised temporal language grounding. In: 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 1523–1528. IEEE (2023)
Xiong, Z., Liu, D., Zhou, P.: Gaussian kernel-based cross modal network for spatio-temporal video grounding. In: IEEE International Conference on Image Processing (ICIP), pp. 2481–2485 (2022)
Xiong, Z., Liu, D., Zhou, P., Zhu, J.: Tracking objects and activities with attention for temporal sentence grounding. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE (2023)
Yang, W., Zhang, T., Zhang, Y., Wu, F.: Local correspondence network for weakly supervised temporal sentence grounding. IEEE Trans. Image Process. 30, 3252–3262 (2021)
Yu, Z., Li, L., Xie, J., Wang, C., Li, W., Ning, X.: Pedestrian 3D shape understanding for person re-identification via multi-view learning. IEEE Trans. Circuits Syst. Video Technol. 34(7) (2024)
Yuan, Y., Ma, L., Wang, J., Liu, W., Zhu, W.: Semantic conditioned dynamic modulation for temporal sentence grounding in videos. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 536–546 (2019)
Zeng, R., Xu, H., Huang, W., Chen, P., Tan, M., Gan, C.: Dense regression network for video grounding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10287–10296 (2020)
Zhang, D., Dai, X., Wang, X., Wang, Y.F., Davis, L.S.: MAN: moment alignment network for natural language moment retrieval via iterative graph adjustment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1247–1257 (2019)
Zhang, H., Sun, A., Jing, W., Zhou, J.T.: Span-based localizing network for natural language video localization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6543–6554 (2020)
Zhang, H., Xie, Y., Zheng, L., Zhang, D., Zhang, Q.: Interpreting multivariate Shapley interactions in DNNs. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 10877–10886 (2021)
Zhang, H., et al.: Deep learning-based 3D point cloud classification: a systematic survey and outlook. Displays 79, 102456 (2023)
Zhang, H., Wang, C., Yu, L., Tian, S., Ning, X., Rodrigues, J.: PointGT: a method for point-cloud classification and segmentation based on local geometric transformation. IEEE Trans. Multimedia 26 (2024)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)
Zhang, Z., Lin, Z., Zhao, Z., Xiao, Z.: Cross-modal interaction networks for query-based moment retrieval in videos. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 655–664 (2019)
Zhang, Z., Zhao, Z., Lin, Z., He, X., et al.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural. Inf. Process. Syst. 33, 18123–18134 (2020)
Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment localization with contrastive negative sample mining. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3517–3525 (2022)
Zheng, M., Huang, Y., Chen, Q., Peng, Y., Liu, Y.: Weakly supervised temporal sentence grounding with Gaussian-based contrastive proposal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15555–15564 (2022)
Zheng, Q., et al.: Progressive localization networks for language-based moment localization. ACM Trans. Multimedia Comput. Commun. Appl. 19(2), 1–21 (2023)
Zhu, J., et al.: Rethinking the video sampling and reasoning strategies for temporal sentence grounding. arXiv preprint arXiv:2301.00514 (2023)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fang, X. et al. (2025). Rethinking Weakly-Supervised Video Temporal Grounding From a Game Perspective. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15103. Springer, Cham. https://doi.org/10.1007/978-3-031-72995-9_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-72995-9_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72994-2
Online ISBN: 978-3-031-72995-9
eBook Packages: Computer ScienceComputer Science (R0)