Abstract
Video grounding aims to temporally localize an action in an untrimmed video referred to by a query in natural language, which plays an important role in fine-grained video understanding. Given temporal proposals of limited granularity, the task is challenging that it requires fusing multi-modal features from questions and videos effectively, and localizing the referred action accurately. For multimodal feature fusion, we present an Intra- and Inter-modal Multilinear pooling (IIM) model to effectively combine the multi-modal features with considering both the intra- and inter-modal feature interactions. Compared to existing multimodal fusion models, IIM can capture high-order interactions and is more capable for modeling temporal information of videos. For action localization, we propose a simple yet effective multi-task learning framework to simultaneously predict the action label, alignment score and refined location in an end-to-end manner. Experimental results on real-world TaCoS and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.
Similar content being viewed by others
Notes
The strategy we used here is as follows: if a sentence is detected to have more than one verbs, we prefer to choose the verb tagged as ‘VBZ’ (i.e., verb in 3rd person singular present), since the subjects in the queries are usually in the 3rd person singular present (e.g., ‘the person’, ‘she’, or ‘he’). If ‘VBZ’ is not detected, we will choose the last verb as the representative verb since the rest verbs are likely to be used to describe the subject rather than the action.
References
Chen J, Chen X, Ma L, Jie Z, Chua TS (2018) Temporally grounding natural sentence in video. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 162–171
Deng C, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint. arXiv:1606.01847
Gao F, Yu J (2016) Biologically inspired image quality assessment. Sig Process 124:210–219
Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision, pp 5267–5275
Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE international conference on computer vision, pp 3628–3636
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124
Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR, pp 4555–4564
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574
Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint. arXiv:1610.04325
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980
Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint. arXiv:1506.06726
Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: ICCV, pp 1449–1457
Liu M, Wang X, Nie L, He X, Chen B, Chua TS (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 15–24
Qi P, Dozat T, Zhang Y, Manning CD (2018) Universal dependency parsing from scratch. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, pp 160–170. Association for Computational Linguistics, Brussels, Belgium. https://nlp.stanford.edu/pubs/qi2018universal.pdf
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36
Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV, pp 817–834
Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, pp 144–157. Springer
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Song X, Han Y (2018) Val: visual-attention action localizer. In: Pacific rim conference on multimedia, pp 340–350
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI
Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2018) Multilevel language and vision integration for text-to-clip retrieval. arXiv preprint. arXiv:1804.05113
Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687
Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068
Yu J, Kuang Z, Zhang B, Zhang W, Lin D, Fan J (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Trans Inf Forensics Secur 13(5):1317–1332
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the thirty-third AAAI conference on artificial intelligence (AAAI), pp 9127–9134
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE international conference on computer vision (ICCV), pp 1839–1848
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: International joint conference on artificial intelligence (IJCAI), pp 1114–1120
Yuan Y, Mei T, Zhu W (2018) To find where you talk: temporal sentence localization in video with attention based location regression. arXiv preprint. arXiv:1804.07014
Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432
Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, Wu F, Zhuang Y (2018) Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: International joint conference on artificial intelligence (IJCAI), pp 3683–3689
Acknowledgements
This work was supported in part by National Natural Science Foundation of China, Young Scientists Fund and Key Programme under Grant 61702143, Grant 61836002.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yu, Z., Song, Y., Yu, J. et al. Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding. Neural Process Lett 52, 1863–1879 (2020). https://doi.org/10.1007/s11063-020-10205-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10205-y