Skip to main content

Denoised Dual-Level Contrastive Network for Weakly-Supervised Temporal Sentence Grounding

  • Conference paper
  • First Online:
Computational Visual Media (CVM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14593))

Included in the following conference series:

  • 96 Accesses

Abstract

The task of temporal sentence grounding aims to localize the target moment corresponding to a given natural language query. Due to the large burden of labeling the temporal boundaries, weakly-supervised methods have drawn increasing attention. Most of the weakly-supervised methods heavily rely on aligning the visual and textual modalities, ignoring modeling the confusing snippets within a video and non-discriminative snippets across different videos. Moreover, the error-prone caused by the sparsity of video-level labels is not well explored, which brings noisy activations and is not robust to real-world applications. In this paper, we present a novel Denoised Dual-level Contrastive Network, namely DDCNet, to overcome the above limitations. Particularly, DDCNet is equipped with a dual-level contrastive loss to explicitly address the incomplete predictions by simultaneously minimizing the intra-video and inter-video loss. Moreover, a ranking weight strategy is presented to select high-quality positive and negative pairs during training. Afterward, an effective pseudo-label denoised process is introduced to alleviate the noisy activations caused by the video-level annotations, thereby leading to more accurate predictions. Comprehensive experiments are conducted on two widely used benchmarks, i.e., Charades-STA and ActivityNet Captions, manifesting the superiority of our method in comparison to existing weakly-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)

    Google Scholar 

  2. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  3. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9

    Chapter  Google Scholar 

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  5. Chen, J., Luo, W., Zhang, W., Ma, L.: Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding 36(01), 267–275 (2022)

    Google Scholar 

  6. Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 162–171 (2018)

    Google Scholar 

  7. Chen, S., Jiang, Y.G.: Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8425–8435 (2021)

    Google Scholar 

  8. Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)

    Google Scholar 

  9. Chen, Z., Ma, L., Luo, W., Tang, P., Wong, K.Y.K.: Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308 (2020)

  10. Collins, R.T., et al.: A system for video surveillance and monitoring. Vsam Final Report 2000(1–68), 1 (2000)

    Google Scholar 

  11. Da, C., Zhang, Y., Zheng, Y., Pan, P., Xu, Y., Pan, C.: Asynce: disentangling false-positives for weakly-supervised video grounding. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1129–1137 (2021)

    Google Scholar 

  12. Duan, X., et al.: Weakly supervised dense event captioning in videos. Adv. Neural. Inf. Process. Syst. 31 (2018)

    Google Scholar 

  13. Fang, Z., Kong, S., Wang, Z., Fowlkes, C., Yang, Y.: Weak supervision and referring attention for temporal-textual association learning. arXiv preprint arXiv:2006.11747 (2020)

  14. Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)

    Google Scholar 

  15. Gao, M., Davis, L.S., Socher, R., Xiong, C.: Wslln: weakly supervised natural language localization networks. arXiv preprint arXiv:1909.00239 (2019)

  16. Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)

    Google Scholar 

  17. Islam, A., Radke, R.: Weakly supervised temporal action localization using deep metric learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 547–556 (2020)

    Google Scholar 

  18. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

    Google Scholar 

  19. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  20. Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)

    Google Scholar 

  21. Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., Liu, H.: Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11539–11546 (2020)

    Google Scholar 

  22. Loper, E., Bird, S.: Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002)

    Google Scholar 

  23. Luo, F., Chen, S., Chen, J., Wu, Z., Jiang, Y.G.: Self-supervised learning for semi-supervised temporal language grounding. arXiv preprint arXiv:2109.11475 (2021)

  24. Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., Yoo, C.D.: VLANet: video-language alignment network for weakly-supervised video moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 156–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_10

    Chapter  Google Scholar 

  25. Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 533–542 (2002)

    Google Scholar 

  26. Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11592–11601 (2019)

    Google Scholar 

  27. Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1470–1479 (2021)

    Google Scholar 

  28. Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214 (2021)

    Google Scholar 

  29. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014)

    Google Scholar 

  30. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31

    Chapter  Google Scholar 

  31. Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inform. Process. Syst. 29 (2016)

    Google Scholar 

  32. Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv preprint arXiv:2003.07048 (2020)

  33. Tan, R., Xu, H., Saenko, K., Plummer, B.A.: Logan: latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)

    Google Scholar 

  34. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 776–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_45

    Chapter  Google Scholar 

  35. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  36. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)

    Google Scholar 

  37. Wang, Y., Deng, J., Zhou, W., Li, H.: Weakly supervised temporal adjacent network for language grounding. IEEE Trans. Multimedia (2021)

    Google Scholar 

  38. Wang, Y., Zhou, W., Li, H.: Fine-grained semantic alignment network for weakly supervised temporal language grounding. arXiv preprint arXiv:2210.11933 (2022)

  39. Wang, Z., Chen, J., Jiang, Y.G.: Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1459–1468 (2021)

    Google Scholar 

  40. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)

    Google Scholar 

  41. Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9062–9069 (2019)

    Google Scholar 

  42. Xu, Y., Cao, P., Kong, Y., Wang, Y.: L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. Adv. Neural Inform. Process. Syst. 32 (2019)

    Google Scholar 

  43. Yang, W., Zhang, T., Zhang, Y., Wu, F.: Local correspondence network for weakly supervised temporal sentence grounding. IEEE Trans. Image Process. 30, 3252–3262 (2021)

    Article  Google Scholar 

  44. Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9159–9166 (2019)

    Google Scholar 

  45. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)

    Google Scholar 

  46. Zhang, Z., Lin, Z., Zhao, Z., Zhu, J., He, X.: Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4098–4106 (2020)

    Google Scholar 

  47. Zhang, Z., Zhao, Z., Lin, Z., He, X., et al.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural. Inf. Process. Syst. 33, 18123–18134 (2020)

    Google Scholar 

  48. Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment localization with contrastive negative sample mining. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 1, p. 3 (2022)

    Google Scholar 

  49. Zheng, M., Huang, Y., Chen, Q., Peng, Y., Liu, Y.: Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15555–15564 (2022)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (NSFC) (Grant 62376265).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-Yu Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, Y., Zhang, XY., Shi, H. (2024). Denoised Dual-Level Contrastive Network for Weakly-Supervised Temporal Sentence Grounding. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14593. Springer, Singapore. https://doi.org/10.1007/978-981-97-2092-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2092-7_14

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2091-0

  • Online ISBN: 978-981-97-2092-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics