Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Lv, Gang; Sun, Yining; Nian, Fudong

doi:10.1007/s00530-023-01205-8

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Regular Paper
Published: 22 January 2024

Volume 30, article number 35, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Gang Lv^1,2,
Yining Sun^1,2 &
Fudong Nian^3,4

403 Accesses
Explore all metrics

Abstract

Despite significant advancements in deep learning-based video–text retrieval methods, three challenges persist: the alignment of fine-grained semantic information from text and video, ensuring that the obtained textual and video feature representations capture primary semantic information while maintaining good discriminability, and measuring the semantic similarity between different instances. To tackle these issues, we introduce an end-to-end video–text retrieval framework which exploit Multi-Modal Masked Transformer and Adaptive Attribute-Aware Graph Convolutional Network (M$^3$Trans-A$^3$GCN). Specifically, the features extracted from videos and texts are fed into M$^3$Trans to jointly integrate the multi-modal content and mask irrelevant multi-modal context. Subsequently, a novel GCN with an adaptive correlation matrix (i.e., A$^3$GCN) is constructed to obtain discriminative video representation for video–text retrieval. To better measure the semantic similarity between video–text pairs during training, we propose a novel Text-semantic-guided Multi-Modal Cross-Entropy (TMCE) loss function. Here, the similarity between different video–text pairs within a batch is computed based on the features of the corresponding text rather than their instance labels. Comprehensive experimental results on three benchmark datasets, MSR-VTT, MSVD and LSMDC, demonstrate the superiority of M$^3$Trans-A$^3$GCN, compared with the state-of-the-art methods in video–text retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SPSD: Similarity-preserving self-distillation for video–text retrieval

Article 01 September 2023

LSECA: local semantic enhancement and cross aggregation for video-text retrieval

Article 22 July 2024

Level-wise aligned dual networks for text–video retrieval

Article Open access 07 July 2022

Data availability

MSR-VTT dataset is available at https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-videodescription-dataset-for-bridging-video-and-language/. MSVD dataset is available at https://www.cs.utexas.edu/users/ml/clamp/videoDescription/. LSMDC dataset is available at https://github.com/yj-yu/lsmdc.

References

Amrani, E., Ben-Ari, R., Rotman, D., et al: Noise estimation using density estimation for self-supervised multimodal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6644–6652 (2021)
Bain, M., Nagrani, A., Varol, G., et al: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1728–1738 (2021)
Barraco, M., Cornia, M., Cascianelli, S., et al: The unreasonable effectiveness of clip features for image captioning: an experimental analysis. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4662–4670 (2022)
Bogolin, S.V., Croitoru, I., Jin, H., et al: Cross modal retrieval with querybank normalisation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5194–5205 (2022)
Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200 (2011)
Croitoru, I., Bogolin, S.V., Leordeanu, M., et al: Teachtext: Crossmodal generalized distillation for text-video retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11583–11593 (2021)
Dzabraev, M., Kalashnikov, M., Komkov, S., et al: Mdmmt: Multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3354–3363 (2021)
Fu, T.J., Li, L., Gan, Z., et al: Violet: End-to-end video-language transformers with masked visual-token modeling (2021). arXiv preprint arXiv:2111.12681
Gabeur, V., Sun, C., Alahari, K., et al: Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229 (2020)
Ge, Y., Ge, Y., Liu, X., et al: Bridging video-text retrieval with multiple choice questions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16167–16176 (2022)
Ging, S., Zolfaghari, M., Pirsiavash, H., et al.: Coot: cooperative hierarchical transformer for video-text representation learning. Adv. Neural Inform. Process. Syst. 33, 22605–22618 (2020)
Google Scholar
Gorti, S.K., Vouitsis, N., Ma, J., et al: X-pool: Cross-modal language-video attention for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5006–5015 (2022)
Huang, J., Li, Y., Feng, J., et al: Clover: Towards a unified video-language alignment and fusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14856–14866 (2023)
Kaufman, D., Levi, G., Hassner, T., et al: Temporal tessellation: A unified approach for video analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp 94–104 (2017)
Kim, D., Park, J., Lee, J., et al: Language-free training for zero-shot video grounding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 2539–2548 (2023)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014). arXiv preprint arXiv:1411.2539
Lei, J., Li, L., Zhou, L., et al: Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7331–7341 (2021)
Li, L., Chen, Y.C., Cheng, Y., et al: Hero: Hierarchical encoder for video+ language omni-representation pre-training (2020). arXiv preprint arXiv:2005.00200
Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI conference on artificial intelligence (2018)
Liu, Y., Albanie, S., Nagrani, A., et al: Use what you have: Video retrieval using representations from collaborative experts (2019). arXiv preprint arXiv:1907.13487
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017). arXiv preprint arXiv:1711.05101
Luo, H., Ji, L., Zhong, M., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)
Article Google Scholar
Ma, Y., Xu, G., Sun, X., et al: X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 638–647 (2022)
Maas, A.L., Hannun, A.Y., Ng, A.Y., et al: Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml, Atlanta, Georgia, USA, p 3 (2013)
Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Nian, F., Ding, L., Hu, Y., et al.: Multi-level cross-modal semantic alignment network for video-text retrieval. Mathematics 10(18), 3346 (2022)
Article Google Scholar
Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). arXiv preprint arXiv:1807.03748
Patrick, M., Huang, P.Y., Asano, Y., et al: Support-set bottlenecks for video-text representation learning (2020). arXiv preprint arXiv:2010.02824
Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using clip. In: Pattern Recognition: 13th Mexican Conference, MCPR 2021, Mexico City, Mexico, June 23–26, 2021, Proceedings, Springer, pp 3–12 (2021)
Qi, P., Dozat, T., Zhang, Y., et al: Universal dependency parsing from scratch (2019). arXiv preprint arXiv:1901.10457
Qian, S., Xue, D., Fang, Q., et al.: Adaptive label-aware graph convolutional networks for cross-modal retrieval. IEEE Trans. Multimed. 24, 3520–3532 (2021)
Article Google Scholar
Radford, A., Kim, J.W., Hallacy, C., et al: Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763 (2021)
Rohrbach, A., Torabi, A., Rohrbach, M., et al.: Movie description. Int. J. Comput. Vis. 123, 94–120 (2017)
Article Google Scholar
Sun, C., Myers, A., Vondrick, C., et al: Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473 (2019)
Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language (2016). arXiv preprint arXiv:1609.08124
Wang, J., Ge, Y., Cai, G., et al: Object-aware video-language pre-training for retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3313–3322 (2022)
Wang, P., Yang, A., Men, R., et al: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, PMLR, pp 23318–23340 (2022)
Wang, Q., Zhang, Y., Zheng, Y., et al: Disentangled representation learning for text-video retrieval (2022). arXiv preprint arXiv:2203.07111
Wang, J., Qian, S., Hu, J., et al: Positive unlabeled fake news detection via multi-modal masked transformer network. IEEE Trans. Multimed. (2023)
Wu, P., He, X., Tang, M., et al: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on Multimedia, pp 3518–3527 (2021)
Xu, J., Mei, T., Yao, T., et al: Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296 (2016)
Xu, H., Ghosh, G., Huang, P.Y., et al: Videoclip: Contrastive pre-training for zero-shot video-text understanding (2021). arXiv preprint arXiv:2109.14084
Xu, H., Ghosh, G., Huang, P.Y., et al: Vlm: Task-agnostic video-language model pre-training for video understanding (2021). arXiv preprint arXiv:2105.09996
Xue, H., Sun, Y., Liu, B., et al: Clip-vip: Adapting pre-trained image-text model to video-language representation alignment (2022). arXiv preprint arXiv:2209.06430
Yang, J., Bisk, Y., Gao, J.: Taco: token-aware cascade contrastive learning for video-text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11562–11572 (2021)
Yu, Y., Ko, H., Choi, J., et al: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3165–3173 (2017)
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 471–487 (2018)
Zhang, H., Yang, Y., Qi, F., et al: Debiased video-text retrieval via soft positive sample calibration. IEEE Transa. Circ. Syst. Video Technol. (2023)
Zhang, H., Yang, Y., Qi, F., et al: Robust video-text retrieval via noisy pair calibration. IEEE Trans. Multimed. (2023)
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8746–8755 (2020)
Zhu, C., Jia, Q., Chen, W., et al.: Deep learning for video-text retrieval: a review. Int. J. Multimed. Inform. Retri. 12(1), 3 (2023)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the University Synergy Innovation Program of Anhui Province (No. GXXT-2022-043, No. GXXT-2022-037), Anhui Provincial Key Research and Development Program (No. 2022a05020042), and National Natural Science Foundation of China (No. 61902104).

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, 230026, China
Gang Lv & Yining Sun
Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, China
Gang Lv & Yining Sun
School of Advanced Manufacturing Engineering, Hefei University, Hefei, 230601, China
Fudong Nian
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, 230088, China
Fudong Nian

Authors

Gang Lv
View author publications
You can also search for this author inPubMed Google Scholar
Yining Sun
View author publications
You can also search for this author inPubMed Google Scholar
Fudong Nian
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

GL: wrote the main manuscript text. YS: conceptualization and methodology. FN: experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yining Sun.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lv, G., Sun, Y. & Nian, F. Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network. Multimedia Systems 30, 35 (2024). https://doi.org/10.1007/s00530-023-01205-8

Download citation

Received: 27 June 2023
Accepted: 08 December 2023
Published: 22 January 2024
DOI: https://doi.org/10.1007/s00530-023-01205-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SPSD: Similarity-preserving self-distillation for video–text retrieval

LSECA: local semantic enhancement and cross aggregation for video-text retrieval

Level-wise aligned dual networks for text–video retrieval

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now