YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval

Huo, Shuwei; Zhou, Yuan; Wang, Haiyang

doi:10.1007/978-3-031-18913-5_49

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1879 Accesses

Abstract

Video-query based video moment retrieval (VQ-VMR) aims to localize the segment in a long reference video that semantically corresponds to a short query video. This task faces the problem of matching long and short videos feature, which requires extracting dependencies of long-term sequences. To address this problem, we developed a new transformer architecture, termed YFormer, for VQ-VMR task. Specific to this work, a Spatio-temporal Feature Extractor based on self-attention is proposed to build fine-grained semantic embedding for each frame, and a Semantic Relevance Matcher based on cross-attention is proposed to extract cross-correlation between the query and reference videos. The token-based prediction head and pool-based head are developed to localize the start and end boundaries of the result. These prediction heads for video moment localization facilitate a complete end-to-end retrieval process. We reorganize the videos in ActivityNet dataset to build a video moment retrieval dataset and conduct extensive experiments on this benchmark. Our model achieves favorable performance compared to those state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Context Alignment Network for Video Moment Retrieval

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

Article Open access 17 February 2024

References

Bojanowski, P., et al.: Weakly-supervised alignment of video with text, pp. 4462–4470 (2015)
Google Scholar
Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Cao, D., Yu, Z., Zhang, H., Fang, J., Nie, L., Tian, Q.: Video-based cross-modal recipe retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, pp. 1685–1693. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3343031.3351067
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017). https://doi.org/10.1109/CVPR.2017.502
Chen, L., Lu, C., Tang, S., Xiao, J., Li, X.: Rethinking the bottom-up framework for query-based video localization. Proc. AAAI Conf. Artif. Intell. 34(7), 10551–10558 (2020)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Heilbron, F.C., Victor Escorcia, B.G., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Feng, Y., Ma, L., Liu, W., Zhang, T., Luo, J.: Video re-localization. In: European Conference on Computer Vision (2018)
Google Scholar
Feng, Y., Ma, L., Liu, W., Luo, J.: Spatio-temporal video re-localization by warp LSTM. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1288–1297 (2019)
Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query, pp. 5277–5285 (2017)
Google Scholar
Habibian, A., Mensink, T., Snoek, C.: Composite concept discovery for zero-shot video event detection. In: Proceedings of International Conference on Multimedia Retrieval (2014)
Google Scholar
Hahn, M., Kadav, A., Rehg, J.M., Graf, H.: Tripping through time: efficient localization of activities in videos. arXiv abs/1904.09936 (2020)
Google Scholar
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: TransReID: transformer-based object re-identification. arXiv preprint arXiv:2102.04378 (2021)
He, X., Pan, Y., Tang, M., Lv, Y.: Self-supervised video retrieval transformer network. arXiv abs/2104.07993 (2021)
Google Scholar
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language, pp. 5804–5813 (2017)
Google Scholar
Huang, H., et al.: Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Google Scholar
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I.: FIVR: fine-grained incident video retrieval. IEEE Trans. Multimedia 21(10), 2638–2652 (2019). https://doi.org/10.1109/TMM.2019.2905741
Article Google Scholar
Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, pp. 15–24. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3209978.3210003
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: Proceedings of the 26th ACM International Conference on Multimedia, MM 2018, pp. 843–851. Association for Computing Machinery, New York, NY, USA (2018)
Google Scholar
Liu, Z., et al.: Video Swin transformer. CoRR abs/2106.13230 (2021). https://arxiv.org/abs/2106.13230
Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries, pp. 11584–11593 (2019)
Google Scholar
Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27(7), 3210–3221 (2018). https://doi.org/10.1109/TIP.2018.2814344
Article MathSciNet MATH Google Scholar
Sun, X., Wang, H., He, B.: MABAN: multi-agent boundary-aware network for natural language moment retrieval. IEEE Trans. Image Process. 30, 5589–5599 (2021)
Article Google Scholar
Tellex, S., Roy, D.: Towards surveillance video search by natural language query. In: CIVR 2009 (2009)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Curran Associates Inc., Red Hook, NY, USA (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv abs/1706.03762 (2017)
Google Scholar
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Wu, J.Y., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. arXiv abs/2001.06680 (2020)
Google Scholar
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression, July 2019
Google Scholar
Zeng, Y., Cao, D., Wei, X., Liu, M., Zhao, Z., Qin, Z.: Multi-modal relational graph for cross-modal video moment retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2215–2224, June 2021
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: American Association for Artificial Intelligence (AAAI) (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, China
Shuwei Huo, Yuan Zhou & Haiyang Wang

Authors

Shuwei Huo
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Zhou .

Editor information

Editors and Affiliations

Southern University of Science and Technology, Shenzhen, China
Shiqi Yu
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhaoxiang Zhang
Hong Kong Baptist University, Hong Kong, China
Pong C. Yuen
Northwestern Polytechnical University, Xi'an, China
Junwei Han
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hong Kong Baptist University, Hong Kong, China
Yike Guo
Sun Yat-sen University, Guangzhou, China
Jianhuang Lai
Southern University of Science and Technology, Shenzhen, China
Jianguo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huo, S., Zhou, Y., Wang, H. (2022). YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_49

Download citation

DOI: https://doi.org/10.1007/978-3-031-18913-5_49
Published: 27 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Context Alignment Network for Video Moment Retrieval

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Context Alignment Network for Video Moment Retrieval

MGSGA: Multi-grained and Semantic-Guided Alignment for Text-Video Retrieval

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation