Abstract
Video-query based video moment retrieval (VQ-VMR) aims to localize the segment in a long reference video that semantically corresponds to a short query video. This task faces the problem of matching long and short videos feature, which requires extracting dependencies of long-term sequences. To address this problem, we developed a new transformer architecture, termed YFormer, for VQ-VMR task. Specific to this work, a Spatio-temporal Feature Extractor based on self-attention is proposed to build fine-grained semantic embedding for each frame, and a Semantic Relevance Matcher based on cross-attention is proposed to extract cross-correlation between the query and reference videos. The token-based prediction head and pool-based head are developed to localize the start and end boundaries of the result. These prediction heads for video moment localization facilitate a complete end-to-end retrieval process. We reorganize the videos in ActivityNet dataset to build a video moment retrieval dataset and conduct extensive experiments on this benchmark. Our model achieves favorable performance compared to those state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bojanowski, P., et al.: Weakly-supervised alignment of video with text, pp. 4462–4470 (2015)
Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Cao, D., Yu, Z., Zhang, H., Fang, J., Nie, L., Tian, Q.: Video-based cross-modal recipe retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, pp. 1685–1693. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3343031.3351067
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017). https://doi.org/10.1109/CVPR.2017.502
Chen, L., Lu, C., Tang, S., Xiao, J., Li, X.: Rethinking the bottom-up framework for query-based video localization. Proc. AAAI Conf. Artif. Intell. 34(7), 10551–10558 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics (2019)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Heilbron, F.C., Victor Escorcia, B.G., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Feng, Y., Ma, L., Liu, W., Zhang, T., Luo, J.: Video re-localization. In: European Conference on Computer Vision (2018)
Feng, Y., Ma, L., Liu, W., Luo, J.: Spatio-temporal video re-localization by warp LSTM. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1288–1297 (2019)
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query, pp. 5277–5285 (2017)
Habibian, A., Mensink, T., Snoek, C.: Composite concept discovery for zero-shot video event detection. In: Proceedings of International Conference on Multimedia Retrieval (2014)
Hahn, M., Kadav, A., Rehg, J.M., Graf, H.: Tripping through time: efficient localization of activities in videos. arXiv abs/1904.09936 (2020)
He, S., Luo, H., Wang, P., Wang, F., Li, H., Jiang, W.: TransReID: transformer-based object re-identification. arXiv preprint arXiv:2102.04378 (2021)
He, X., Pan, Y., Tang, M., Lv, Y.: Self-supervised video retrieval transformer network. arXiv abs/2104.07993 (2021)
Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language, pp. 5804–5813 (2017)
Huang, H., et al.: Transferable representation learning in vision-and-language navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019
Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I.: FIVR: fine-grained incident video retrieval. IEEE Trans. Multimedia 21(10), 2638–2652 (2019). https://doi.org/10.1109/TMM.2019.2905741
Liu, M., Wang, X., Nie, L., He, X., Chen, B., Chua, T.S.: Attentive moment retrieval in videos. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, pp. 15–24. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3209978.3210003
Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., Chua, T.S.: Cross-modal moment localization in videos. In: Proceedings of the 26th ACM International Conference on Multimedia, MM 2018, pp. 843–851. Association for Computing Machinery, New York, NY, USA (2018)
Liu, Z., et al.: Video Swin transformer. CoRR abs/2106.13230 (2021). https://arxiv.org/abs/2106.13230
Mithun, N.C., Paul, S., Roy-Chowdhury, A.K.: Weakly supervised video moment retrieval from text queries, pp. 11584–11593 (2019)
Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27(7), 3210–3221 (2018). https://doi.org/10.1109/TIP.2018.2814344
Sun, X., Wang, H., He, B.: MABAN: multi-agent boundary-aware network for natural language moment retrieval. IEEE Trans. Image Process. 30, 5589–5599 (2021)
Tellex, S., Roy, D.: Towards surveillance video search by natural language query. In: CIVR 2009 (2009)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Curran Associates Inc., Red Hook, NY, USA (2017)
Vaswani, A., et al.: Attention is all you need. arXiv abs/1706.03762 (2017)
Wang, X., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Wu, J.Y., Li, G., Liu, S., Lin, L.: Tree-structured policy based progressive reinforcement learning for temporally language grounding in video. arXiv abs/2001.06680 (2020)
Yuan, Y., Mei, T., Zhu, W.: To find where you talk: temporal sentence localization in video with attention based location regression, July 2019
Zeng, Y., Cao, D., Wei, X., Liu, M., Zhao, Z., Qin, Z.: Multi-modal relational graph for cross-modal video moment retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2215–2224, June 2021
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2D temporal adjacent networks for moment localization with natural language. In: American Association for Artificial Intelligence (AAAI) (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huo, S., Zhou, Y., Wang, H. (2022). YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_49
Download citation
DOI: https://doi.org/10.1007/978-3-031-18913-5_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)