Skip to main content
Log in

Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Despite significant advancements in deep learning-based video–text retrieval methods, three challenges persist: the alignment of fine-grained semantic information from text and video, ensuring that the obtained textual and video feature representations capture primary semantic information while maintaining good discriminability, and measuring the semantic similarity between different instances. To tackle these issues, we introduce an end-to-end video–text retrieval framework which exploit Multi-Modal Masked Transformer and Adaptive Attribute-Aware Graph Convolutional Network (M\(^3\)Trans-A\(^3\)GCN). Specifically, the features extracted from videos and texts are fed into M\(^3\)Trans to jointly integrate the multi-modal content and mask irrelevant multi-modal context. Subsequently, a novel GCN with an adaptive correlation matrix (i.e., A\(^3\)GCN) is constructed to obtain discriminative video representation for video–text retrieval. To better measure the semantic similarity between video–text pairs during training, we propose a novel Text-semantic-guided Multi-Modal Cross-Entropy (TMCE) loss function. Here, the similarity between different video–text pairs within a batch is computed based on the features of the corresponding text rather than their instance labels. Comprehensive experimental results on three benchmark datasets, MSR-VTT, MSVD and LSMDC, demonstrate the superiority of M\(^3\)Trans-A\(^3\)GCN, compared with the state-of-the-art methods in video–text retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

MSR-VTT dataset is available at https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-videodescription-dataset-for-bridging-video-and-language/. MSVD dataset is available at https://www.cs.utexas.edu/users/ml/clamp/videoDescription/. LSMDC dataset is available at https://github.com/yj-yu/lsmdc.

References

  1. Amrani, E., Ben-Ari, R., Rotman, D., et al: Noise estimation using density estimation for self-supervised multimodal learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6644–6652 (2021)

  2. Bain, M., Nagrani, A., Varol, G., et al: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1728–1738 (2021)

  3. Barraco, M., Cornia, M., Cascianelli, S., et al: The unreasonable effectiveness of clip features for image captioning: an experimental analysis. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4662–4670 (2022)

  4. Bogolin, S.V., Croitoru, I., Jin, H., et al: Cross modal retrieval with querybank normalisation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5194–5205 (2022)

  5. Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200 (2011)

  6. Croitoru, I., Bogolin, S.V., Leordeanu, M., et al: Teachtext: Crossmodal generalized distillation for text-video retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11583–11593 (2021)

  7. Dzabraev, M., Kalashnikov, M., Komkov, S., et al: Mdmmt: Multidomain multimodal transformer for video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3354–3363 (2021)

  8. Fu, T.J., Li, L., Gan, Z., et al: Violet: End-to-end video-language transformers with masked visual-token modeling (2021). arXiv preprint arXiv:2111.12681

  9. Gabeur, V., Sun, C., Alahari, K., et al: Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229 (2020)

  10. Ge, Y., Ge, Y., Liu, X., et al: Bridging video-text retrieval with multiple choice questions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16167–16176 (2022)

  11. Ging, S., Zolfaghari, M., Pirsiavash, H., et al.: Coot: cooperative hierarchical transformer for video-text representation learning. Adv. Neural Inform. Process. Syst. 33, 22605–22618 (2020)

    Google Scholar 

  12. Gorti, S.K., Vouitsis, N., Ma, J., et al: X-pool: Cross-modal language-video attention for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5006–5015 (2022)

  13. Huang, J., Li, Y., Feng, J., et al: Clover: Towards a unified video-language alignment and fusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14856–14866 (2023)

  14. Kaufman, D., Levi, G., Hassner, T., et al: Temporal tessellation: A unified approach for video analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp 94–104 (2017)

  15. Kim, D., Park, J., Lee, J., et al: Language-free training for zero-shot video grounding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 2539–2548 (2023)

  16. Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models (2014). arXiv preprint arXiv:1411.2539

  17. Lei, J., Li, L., Zhou, L., et al: Less is more: Clipbert for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7331–7341 (2021)

  18. Li, L., Chen, Y.C., Cheng, Y., et al: Hero: Hierarchical encoder for video+ language omni-representation pre-training (2020). arXiv preprint arXiv:2005.00200

  19. Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: Proceedings of the AAAI conference on artificial intelligence (2018)

  20. Liu, Y., Albanie, S., Nagrani, A., et al: Use what you have: Video retrieval using representations from collaborative experts (2019). arXiv preprint arXiv:1907.13487

  21. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2017). arXiv preprint arXiv:1711.05101

  22. Luo, H., Ji, L., Zhong, M., et al.: Clip4clip: an empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508, 293–304 (2022)

    Article  Google Scholar 

  23. Ma, Y., Xu, G., Sun, X., et al: X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 638–647 (2022)

  24. Maas, A.L., Hannun, A.Y., Ng, A.Y., et al: Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml, Atlanta, Georgia, USA, p 3 (2013)

  25. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  26. Nian, F., Ding, L., Hu, Y., et al.: Multi-level cross-modal semantic alignment network for video-text retrieval. Mathematics 10(18), 3346 (2022)

    Article  Google Scholar 

  27. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). arXiv preprint arXiv:1807.03748

  28. Patrick, M., Huang, P.Y., Asano, Y., et al: Support-set bottlenecks for video-text representation learning (2020). arXiv preprint arXiv:2010.02824

  29. Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using clip. In: Pattern Recognition: 13th Mexican Conference, MCPR 2021, Mexico City, Mexico, June 23–26, 2021, Proceedings, Springer, pp 3–12 (2021)

  30. Qi, P., Dozat, T., Zhang, Y., et al: Universal dependency parsing from scratch (2019). arXiv preprint arXiv:1901.10457

  31. Qian, S., Xue, D., Fang, Q., et al.: Adaptive label-aware graph convolutional networks for cross-modal retrieval. IEEE Trans. Multimed. 24, 3520–3532 (2021)

    Article  Google Scholar 

  32. Radford, A., Kim, J.W., Hallacy, C., et al: Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763 (2021)

  33. Rohrbach, A., Torabi, A., Rohrbach, M., et al.: Movie description. Int. J. Comput. Vis. 123, 94–120 (2017)

    Article  Google Scholar 

  34. Sun, C., Myers, A., Vondrick, C., et al: Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473 (2019)

  35. Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language (2016). arXiv preprint arXiv:1609.08124

  36. Wang, J., Ge, Y., Cai, G., et al: Object-aware video-language pre-training for retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3313–3322 (2022)

  37. Wang, P., Yang, A., Men, R., et al: Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In: International Conference on Machine Learning, PMLR, pp 23318–23340 (2022)

  38. Wang, Q., Zhang, Y., Zheng, Y., et al: Disentangled representation learning for text-video retrieval (2022). arXiv preprint arXiv:2203.07111

  39. Wang, J., Qian, S., Hu, J., et al: Positive unlabeled fake news detection via multi-modal masked transformer network. IEEE Trans. Multimed. (2023)

  40. Wu, P., He, X., Tang, M., et al: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on Multimedia, pp 3518–3527 (2021)

  41. Xu, J., Mei, T., Yao, T., et al: Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296 (2016)

  42. Xu, H., Ghosh, G., Huang, P.Y., et al: Videoclip: Contrastive pre-training for zero-shot video-text understanding (2021). arXiv preprint arXiv:2109.14084

  43. Xu, H., Ghosh, G., Huang, P.Y., et al: Vlm: Task-agnostic video-language model pre-training for video understanding (2021). arXiv preprint arXiv:2105.09996

  44. Xue, H., Sun, Y., Liu, B., et al: Clip-vip: Adapting pre-trained image-text model to video-language representation alignment (2022). arXiv preprint arXiv:2209.06430

  45. Yang, J., Bisk, Y., Gao, J.: Taco: token-aware cascade contrastive learning for video-text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 11562–11572 (2021)

  46. Yu, Y., Ko, H., Choi, J., et al: End-to-end concept word detection for video captioning, retrieval, and question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3165–3173 (2017)

  47. Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 471–487 (2018)

  48. Zhang, H., Yang, Y., Qi, F., et al: Debiased video-text retrieval via soft positive sample calibration. IEEE Transa. Circ. Syst. Video Technol. (2023)

  49. Zhang, H., Yang, Y., Qi, F., et al: Robust video-text retrieval via noisy pair calibration. IEEE Trans. Multimed. (2023)

  50. Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8746–8755 (2020)

  51. Zhu, C., Jia, Q., Chen, W., et al.: Deep learning for video-text retrieval: a review. Int. J. Multimed. Inform. Retri. 12(1), 3 (2023)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the University Synergy Innovation Program of Anhui Province (No. GXXT-2022-043, No. GXXT-2022-037), Anhui Provincial Key Research and Development Program (No. 2022a05020042), and National Natural Science Foundation of China (No. 61902104).

Author information

Authors and Affiliations

Authors

Contributions

GL: wrote the main manuscript text. YS: conceptualization and methodology. FN: experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Yining Sun.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by B. Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lv, G., Sun, Y. & Nian, F. Video–text retrieval via multi-modal masked transformer and adaptive attribute-aware graph convolutional network. Multimedia Systems 30, 35 (2024). https://doi.org/10.1007/s00530-023-01205-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-023-01205-8

Keywords

Navigation