Abstract
With the explosive growth of videos on the internet, video-text retrieval is receiving increasing attention. Most of the existing approaches map videos and texts into a shared latent vector space and then measure their similarities. However, for video encoding, most methods ignore the interactions of frames in a video. In addition, many works obtain features of various aspects but lack a proper module to fuse them. They use simple concatenation, gate unit, or average pooling, which possibly can not fully exploit the interactions of different features. To solve these problems, we propose the Multi-Interaction Model (MIM). Concretely, we propose a well-designed multi-scale interaction module to exploit interactions among frames. Besides, a fusion module is designed to combine representations from different branches by encoding them into various subspaces and capturing interactions among them. Furthermore, to learn more discriminative representations, we propose an improved loss function. And we design a new mining strategy, which selectively reserves informative pairs. Extensive experiments conducted on MSR-VTT, TGIF, and VATEX datasets demonstrate the effectiveness of the proposed video-text retrieval model.
This work was supported in part by the National Innovation 2030 Major S&T Project of China under Grant 2020AAA0104203, and in part by the Nature Science Foundation of China under Grant 62006007.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, S., Zhao, Y., Jin, Q., et al.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: CVPR, pp. 10638–10647 (2020)
Song, Y., Soleymani, M.: Polysemous Visual-semantic embedding for cross-modal retrieval. arXiv preprint arXiv:1906.04402 (2019)
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: ECCV, pp. 471–487 (2018)
Liu, Y., et al.: Use what you have: video retrieval using representations from collaborative experts. In: BMVC (2019)
Miech, A., Zhukov, D., et al.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV, pp. 2630–2640 (2019)
Loko, J., et al.: A W2VV++ case study with automated and interactive text-to-video retrieval. In: MM, pp. 2553–2561 (2020)
Faghri, F., et al.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC (2018)
Dong, J., et al.: Dual encoding for video retrieval by text. In: TPAMI (2021)
Wei, J., et al.: Universal weighting metric learning for cross-modal matching. In: CVPR, pp. 13005–13014 (2020)
Devlin, J., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Sun, Y., et al.: Circle loss: a unified perspective of pair similarity optimization. In: CVPR, pp. 6398–6407 (2020)
Feng, F., et al.: Cross-modal retrieval with correspondence autoencoder. In: MM, pp. 7–16 (2014)
Wu, D., et al.: Multi-dimensional attentive hierarchical graph pooling network for video-text retrieval. In: ICME (2021)
Xu, J., et al.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)
Wang, X., et al.: Vatex: a large-scale, high-quality multilingual dataset for video-and-language research. In: CVPR, pp. 4581–4591 (2019)
Li, Y., et al.: TGIF: a new dataset and benchmark on animated GIF description. In: CVPR, pp. 4641–4650 (2016)
Kingma, DP., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Xie, S., et al.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 1492–1500. (2017)
He, K., et al.: Deep residual learning for image recognition. In: CVPR, pp: 770–778 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, J., Wu, D., Zhu, Y., Bai, Z. (2021). A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Communications in Computer and Information Science, vol 1517. Springer, Cham. https://doi.org/10.1007/978-3-030-92310-5_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-92310-5_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92309-9
Online ISBN: 978-3-030-92310-5
eBook Packages: Computer ScienceComputer Science (R0)