Skip to main content

YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2022)

Abstract

Video-query based video moment retrieval (VQ-VMR) aims to localize the segment in a long reference video that semantically corresponds to a short query video. This task faces the problem of matching long and short videos feature, which requires extracting dependencies of long-term sequences. To address this problem, we developed a new transformer architecture, termed YFormer, for VQ-VMR task. Specific to this work, a Spatio-temporal Feature Extractor based on self-attention is proposed to build fine-grained semantic embedding for each frame, and a Semantic Relevance Matcher based on cross-attention is proposed to extract cross-correlation between the query and reference videos. The token-based prediction head and pool-based head are developed to localize the start and end boundaries of the result. These prediction heads for video moment localization facilitate a complete end-to-end retrieval process. We reorganize the videos in ActivityNet dataset to build a video moment retrieval dataset and conduct extensive experiments on this benchmark. Our model achieves favorable performance compared to those state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bojanowski, P., et al.: Weakly-supervised alignment of video with text, pp. 4462–4470 (2015)

    Google Scholar 

  2. Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  3. Cao, D., Yu, Z., Zhang, H., Fang, J., Nie, L., Tian, Q.: Video-based cross-modal recipe retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, pp. 1685–1693. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3343031.3351067

  4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  5. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017). https://doi.org/10.1109/CVPR.2017.502

  6. Chen, L., Lu, C., Tang, S., Xiao, J., Li, X.: Rethinking the bottom-up framework for query-based video localization. Proc. AAAI Conf. Artif. Intell. 34(7), 10551–10558 (2020)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics (2019)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. Heilbron, F.C., Victor Escorcia, B.G., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  10. Feng, Y., Ma, L., Liu, W., Zhang, T., Luo, J.: Video re-localization. In: European Conference on Computer Vision (2018)

    Google Scholar 

  11. Feng, Y., Ma, L., Liu, W., Luo, J.: Spatio-temporal video re-localization by warp LSTM. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1288–1297 (2019)

    Google Scholar 

  12. Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query, pp. 5277–5285 (2017)

    Google Scholar 

  13. Habibian, A., Mensink, T., Snoek, C.: Composite concept discovery for zero-shot video event detection. In: Proceedings of International Conference on Multimedia Retrieval (2014)

    Google Scholar 

  14. Hahn, M., Kadav, A., Rehg, J.M., Graf, H.: Tripping through time: efficient localization of activities in videos. arXiv abs/1904.09936 (2020)

    Google Scholar 

  15. He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: TransReID: transformer-based object re-identification. arXiv preprint arXiv:2102.04378 (2021)

  16. He, X., Pan, Y., Tang, M., Lv, Y.: Self-supervised video retrieval transformer network. arXiv abs/2104.07993 (2021)

    Google Scholar 

  17. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  18. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language, pp. 5804–5813 (2017)

    Google Scholar 

  19. Huang, H., et al.: Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

    Google Scholar 

  20. Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I.: FIVR: fine-grained incident video retrieval. IEEE Trans. Multimedia 21(10), 2638–2652 (2019). https://doi.org/10.1109/TMM.2019.2905741

    Article  Google Scholar 

  21. Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, pp. 15–24. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3209978.3210003

  22. Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: Proceedings of the 26th ACM International Conference on Multimedia, MM 2018, pp. 843–851. Association for Computing Machinery, New York, NY, USA (2018)

    Google Scholar 

  23. Liu, Z., et al.: Video Swin transformer. CoRR abs/2106.13230 (2021). https://arxiv.org/abs/2106.13230

  24. Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries, pp. 11584–11593 (2019)

    Google Scholar 

  25. Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27(7), 3210–3221 (2018). https://doi.org/10.1109/TIP.2018.2814344

    Article  MathSciNet  MATH  Google Scholar 

  26. Sun, X., Wang, H., He, B.: MABAN: multi-agent boundary-aware network for natural language moment retrieval. IEEE Trans. Image Process. 30, 5589–5599 (2021)

    Article  Google Scholar 

  27. Tellex, S., Roy, D.: Towards surveillance video search by natural language query. In: CIVR 2009 (2009)

    Google Scholar 

  28. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Curran Associates Inc., Red Hook, NY, USA (2017)

    Google Scholar 

  29. Vaswani, A., et al.: Attention is all you need. arXiv abs/1706.03762 (2017)

    Google Scholar 

  30. Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

    Google Scholar 

  31. Wu, J.Y., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. arXiv abs/2001.06680 (2020)

    Google Scholar 

  32. Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression, July 2019

    Google Scholar 

  33. Zeng, Y., Cao, D., Wei, X., Liu, M., Zhao, Z., Qin, Z.: Multi-modal relational graph for cross-modal video moment retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2215–2224, June 2021

    Google Scholar 

  34. Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: American Association for Artificial Intelligence (AAAI) (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuan Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huo, S., Zhou, Y., Wang, H. (2022). YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18913-5_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18912-8

  • Online ISBN: 978-3-031-18913-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics