Denoised Dual-Level Contrastive Network for Weakly-Supervised Temporal Sentence Grounding

Zhang, Yaru; Zhang, Xiao-Yu; Shi, Haichao

doi:10.1007/978-981-97-2092-7_14

Yaru Zhang^9,10,
Xiao-Yu Zhang⁹ &
Haichao Shi⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14593))

Included in the following conference series:

International Conference on Computational Visual Media

96 Accesses

Abstract

The task of temporal sentence grounding aims to localize the target moment corresponding to a given natural language query. Due to the large burden of labeling the temporal boundaries, weakly-supervised methods have drawn increasing attention. Most of the weakly-supervised methods heavily rely on aligning the visual and textual modalities, ignoring modeling the confusing snippets within a video and non-discriminative snippets across different videos. Moreover, the error-prone caused by the sparsity of video-level labels is not well explored, which brings noisy activations and is not robust to real-world applications. In this paper, we present a novel Denoised Dual-level Contrastive Network, namely DDCNet, to overcome the above limitations. Particularly, DDCNet is equipped with a dual-level contrastive loss to explicitly address the incomplete predictions by simultaneously minimizing the intra-video and inter-video loss. Moreover, a ranking weight strategy is presented to select high-quality positive and negative pairs during training. Afterward, an effective pseudo-label denoised process is introduced to alleviate the noisy activations caused by the video-level annotations, thereby leading to more accurate predictions. Comprehensive experiments are conducted on two widely used benchmarks, i.e., Charades-STA and ActivityNet Captions, manifesting the superiority of our method in comparison to existing weakly-supervised methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5803–5812 (2017)
Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 139–156. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_9
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chen, J., Luo, W., Zhang, W., Ma, L.: Explore inter-contrast between videos via composition for weakly supervised temporal sentence grounding 36(01), 267–275 (2022)
Google Scholar
Chen, J., Chen, X., Ma, L., Jie, Z., Chua, T.S.: Temporally grounding natural sentence in video. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 162–171 (2018)
Google Scholar
Chen, S., Jiang, Y.G.: Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8425–8435 (2021)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758 (2021)
Google Scholar
Chen, Z., Ma, L., Luo, W., Tang, P., Wong, K.Y.K.: Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308 (2020)
Collins, R.T., et al.: A system for video surveillance and monitoring. Vsam Final Report 2000(1–68), 1 (2000)
Google Scholar
Da, C., Zhang, Y., Zheng, Y., Pan, P., Xu, Y., Pan, C.: Asynce: disentangling false-positives for weakly-supervised video grounding. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1129–1137 (2021)
Google Scholar
Duan, X., et al.: Weakly supervised dense event captioning in videos. Adv. Neural. Inf. Process. Syst. 31 (2018)
Google Scholar
Fang, Z., Kong, S., Wang, Z., Fowlkes, C., Yang, Y.: Weak supervision and referring attention for temporal-textual association learning. arXiv preprint arXiv:2006.11747 (2020)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5267–5275 (2017)
Google Scholar
Gao, M., Davis, L.S., Socher, R., Xiong, C.: Wslln: weakly supervised natural language localization networks. arXiv preprint arXiv:1909.00239 (2019)
Grill, J.B., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural. Inf. Process. Syst. 33, 21271–21284 (2020)
Google Scholar
Islam, A., Radke, R.: Weakly supervised temporal action localization using deep metric learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 547–556 (2020)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715 (2017)
Google Scholar
Lin, Z., Zhao, Z., Zhang, Z., Wang, Q., Liu, H.: Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11539–11546 (2020)
Google Scholar
Loper, E., Bird, S.: Nltk: The natural language toolkit. arXiv preprint cs/0205028 (2002)
Google Scholar
Luo, F., Chen, S., Chen, J., Wu, Z., Jiang, Y.G.: Self-supervised learning for semi-supervised temporal language grounding. arXiv preprint arXiv:2109.11475 (2021)
Ma, M., Yoon, S., Kim, J., Lee, Y., Kang, S., Yoo, C.D.: VLANet: video-language alignment network for weakly-supervised video moment retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 156–171. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_10
Chapter Google Scholar
Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 533–542 (2002)
Google Scholar
Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11592–11601 (2019)
Google Scholar
Nam, J., Ahn, D., Kang, D., Ha, S.J., Choi, J.: Zero-shot natural language video localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1470–1479 (2021)
Google Scholar
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: Videomoco: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214 (2021)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014)
Google Scholar
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31
Chapter Google Scholar
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. Adv. Neural Inform. Process. Syst. 29 (2016)
Google Scholar
Song, Y., Wang, J., Ma, L., Yu, Z., Yu, J.: Weakly-supervised multi-level attentional reconstruction network for grounding textual queries in videos. arXiv preprint arXiv:2003.07048 (2020)
Tan, R., Xu, H., Saenko, K., Plummer, B.A.: Logan: latent graph co-attention network for weakly-supervised video moment retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2083–2092 (2021)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 776–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_45
Chapter Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)
Google Scholar
Wang, Y., Deng, J., Zhou, W., Li, H.: Weakly supervised temporal adjacent network for language grounding. IEEE Trans. Multimedia (2021)
Google Scholar
Wang, Y., Zhou, W., Li, H.: Fine-grained semantic alignment network for weakly supervised temporal language grounding. arXiv preprint arXiv:2210.11933 (2022)
Wang, Z., Chen, J., Jiang, Y.G.: Visual co-occurrence alignment learning for weakly-supervised video moment retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 1459–1468 (2021)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Xu, H., He, K., Plummer, B.A., Sigal, L., Sclaroff, S., Saenko, K.: Multilevel language and vision integration for text-to-clip retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9062–9069 (2019)
Google Scholar
Xu, Y., Cao, P., Kong, Y., Wang, Y.: L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. Adv. Neural Inform. Process. Syst. 32 (2019)
Google Scholar
Yang, W., Zhang, T., Zhang, Y., Wu, F.: Local correspondence network for weakly supervised temporal sentence grounding. IEEE Trans. Image Process. 30, 3252–3262 (2021)
Article Google Scholar
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9159–9166 (2019)
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12870–12877 (2020)
Google Scholar
Zhang, Z., Lin, Z., Zhao, Z., Zhu, J., He, X.: Regularized two-branch proposal networks for weakly-supervised moment retrieval in videos. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 4098–4106 (2020)
Google Scholar
Zhang, Z., Zhao, Z., Lin, Z., He, X., et al.: Counterfactual contrastive learning for weakly-supervised vision-language grounding. Adv. Neural. Inf. Process. Syst. 33, 18123–18134 (2020)
Google Scholar
Zheng, M., Huang, Y., Chen, Q., Liu, Y.: Weakly supervised video moment localization with contrastive negative sample mining. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 1, p. 3 (2022)
Google Scholar
Zheng, M., Huang, Y., Chen, Q., Peng, Y., Liu, Y.: Weakly supervised temporal sentence grounding with gaussian-based contrastive proposal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15555–15564 (2022)
Google Scholar

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (NSFC) (Grant 62376265).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Yaru Zhang, Xiao-Yu Zhang & Haichao Shi
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Yaru Zhang

Authors

Yaru Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haichao Shi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Yu Zhang .

Editor information

Editors and Affiliations

Victoria University of Wellington, Wellington, New Zealand
Fang-Lue Zhang
Ben-Gurion University, Be'er Sheva, Israel
Andrei Sharf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Zhang, XY., Shi, H. (2024). Denoised Dual-Level Contrastive Network for Weakly-Supervised Temporal Sentence Grounding. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14593. Springer, Singapore. https://doi.org/10.1007/978-981-97-2092-7_14

Download citation

DOI: https://doi.org/10.1007/978-981-97-2092-7_14
Published: 30 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2091-0
Online ISBN: 978-981-97-2092-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Denoised Dual-Level Contrastive Network for Weakly-Supervised Temporal Sentence Grounding