Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

Yu, Zhou; Song, Yijun; Yu, Jun; Wang, Meng; Huang, Qingming

doi:10.1007/s11063-020-10205-y

Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

Published: 24 February 2020

Volume 52, pages 1863–1879, (2020)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

376 Accesses
10 Citations
Explore all metrics

Abstract

Video grounding aims to temporally localize an action in an untrimmed video referred to by a query in natural language, which plays an important role in fine-grained video understanding. Given temporal proposals of limited granularity, the task is challenging that it requires fusing multi-modal features from questions and videos effectively, and localizing the referred action accurately. For multimodal feature fusion, we present an Intra- and Inter-modal Multilinear pooling (IIM) model to effectively combine the multi-modal features with considering both the intra- and inter-modal feature interactions. Compared to existing multimodal fusion models, IIM can capture high-order interactions and is more capable for modeling temporal information of videos. For action localization, we propose a simple yet effective multi-task learning framework to simultaneously predict the action label, alignment score and refined location in an end-to-end manner. Experimental results on real-world TaCoS and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

ViGT: proposal-free video grounding with a learnable token in the transformer

Article 26 September 2023

Kun Li, Dan Guo & Meng Wang

CMGN: Cross-Modal Grounding Network for Temporal Sentence Retrieval in Video

Hierarchical Matching and Reasoning for Action Localization via Language Query

Notes

The strategy we used here is as follows: if a sentence is detected to have more than one verbs, we prefer to choose the verb tagged as ‘VBZ’ (i.e., verb in 3rd person singular present), since the subjects in the queries are usually in the 3rd person singular present (e.g., ‘the person’, ‘she’, or ‘he’). If ‘VBZ’ is not detected, we will choose the last verb as the representative verb since the rest verbs are likely to be used to describe the subject rather than the action.

References

Chen J, Chen X, Ma L, Jie Z, Chua TS (2018) Temporally grounding natural sentence in video. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 162–171
Deng C, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint. arXiv:1606.01847
Gao F, Yu J (2016) Biologically inspired image quality assessment. Sig Process 124:210–219
Article Google Scholar
Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision, pp 5267–5275
Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE international conference on computer vision, pp 3628–3636
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124
Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR, pp 4555–4564
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574
Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint. arXiv:1610.04325
Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980
Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint. arXiv:1506.06726
Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: ICCV, pp 1449–1457
Liu M, Wang X, Nie L, He X, Chen B, Chua TS (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 15–24
Qi P, Dozat T, Zhang Y, Manning CD (2018) Universal dependency parsing from scratch. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, pp 160–170. Association for Computational Linguistics, Brussels, Belgium. https://nlp.stanford.edu/pubs/qi2018universal.pdf
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541
Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36
Article Google Scholar
Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV, pp 817–834
Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, pp 144–157. Springer
Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Song X, Han Y (2018) Val: visual-attention action localizer. In: Pacific rim conference on multimedia, pp 340–350
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI
Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2018) Multilevel language and vision integration for text-to-clip retrieval. arXiv preprint. arXiv:1804.05113
Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687
Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068
Article Google Scholar
Yu J, Kuang Z, Zhang B, Zhang W, Lin D, Fan J (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Trans Inf Forensics Secur 13(5):1317–1332
Article Google Scholar
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Article Google Scholar
Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the thirty-third AAAI conference on artificial intelligence (AAAI), pp 9127–9134
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE international conference on computer vision (ICCV), pp 1839–1848
Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959
Article Google Scholar
Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: International joint conference on artificial intelligence (IJCAI), pp 1114–1120
Yuan Y, Mei T, Zhu W (2018) To find where you talk: temporal sentence localization in video with attention based location regression. arXiv preprint. arXiv:1804.07014
Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432
Article MathSciNet Google Scholar
Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, Wu F, Zhuang Y (2018) Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: International joint conference on artificial intelligence (IJCAI), pp 3683–3689

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China, Young Scientists Fund and Key Programme under Grant 61702143, Grant 61836002.

Author information

Authors and Affiliations

Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, People’s Republic of China
Zhou Yu, Yijun Song & Jun Yu
Hefei University of Technology, Hefei, People’s Republic of China
Meng Wang
The School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 101408, People’s Republic of China
Qingming Huang

Authors

Zhou Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yijun Song
View author publications
You can also search for this author in PubMed Google Scholar
Jun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingming Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Z., Song, Y., Yu, J. et al. Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding. Neural Process Lett 52, 1863–1879 (2020). https://doi.org/10.1007/s11063-020-10205-y

Download citation

Published: 24 February 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11063-020-10205-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

Abstract

Access this article

Similar content being viewed by others

ViGT: proposal-free video grounding with a learnable token in the transformer

CMGN: Cross-Modal Grounding Network for Temporal Sentence Retrieval in Video

Hierarchical Matching and Reasoning for Action Localization via Language Query

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

Abstract

Access this article

Similar content being viewed by others

ViGT: proposal-free video grounding with a learnable token in the transformer

CMGN: Cross-Modal Grounding Network for Temporal Sentence Retrieval in Video

Hierarchical Matching and Reasoning for Action Localization via Language Query

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation