Skip to main content
Log in

Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Video grounding aims to temporally localize an action in an untrimmed video referred to by a query in natural language, which plays an important role in fine-grained video understanding. Given temporal proposals of limited granularity, the task is challenging that it requires fusing multi-modal features from questions and videos effectively, and localizing the referred action accurately. For multimodal feature fusion, we present an Intra- and Inter-modal Multilinear pooling (IIM) model to effectively combine the multi-modal features with considering both the intra- and inter-modal feature interactions. Compared to existing multimodal fusion models, IIM can capture high-order interactions and is more capable for modeling temporal information of videos. For action localization, we propose a simple yet effective multi-task learning framework to simultaneously predict the action label, alignment score and refined location in an end-to-end manner. Experimental results on real-world TaCoS and Charades-STA datasets demonstrate the superiority of the proposed approach over existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. The strategy we used here is as follows: if a sentence is detected to have more than one verbs, we prefer to choose the verb tagged as ‘VBZ’ (i.e., verb in 3rd person singular present), since the subjects in the queries are usually in the 3rd person singular present (e.g., ‘the person’, ‘she’, or ‘he’). If ‘VBZ’ is not detected, we will choose the last verb as the representative verb since the rest verbs are likely to be used to describe the subject rather than the action.

References

  1. Chen J, Chen X, Ma L, Jie Z, Chua TS (2018) Temporally grounding natural sentence in video. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 162–171

  2. Deng C, Wu Q, Wu Q, Hu F, Lyu F, Tan M (2018) Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755

  3. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint. arXiv:1606.01847

  4. Gao F, Yu J (2016) Biologically inspired image quality assessment. Sig Process 124:210–219

    Article  Google Scholar 

  5. Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query. In: Proceedings of the IEEE international conference on computer vision, pp 5267–5275

  6. Gao J, Yang Z, Chen K, Sun C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE international conference on computer vision, pp 3628–3636

  7. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  8. Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124

  9. Hu R, Xu H, Rohrbach M, Feng J, Saenko K, Darrell T (2016) Natural language object retrieval. In: CVPR, pp 4555–4564

  10. Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. In: Advances in neural information processing systems, pp 1564–1574

  11. Kim JH, On KW, Lim W, Kim J, Ha JW, Zhang BT (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint. arXiv:1610.04325

  12. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint. arXiv:1412.6980

  13. Kiros R, Zhu Y, Salakhutdinov R, Zemel RS, Torralba A, Urtasun R, Fidler S (2015) Skip-thought vectors. arXiv preprint. arXiv:1506.06726

  14. Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: ICCV, pp 1449–1457

  15. Liu M, Wang X, Nie L, He X, Chen B, Chua TS (2018) Attentive moment retrieval in videos. In: The 41st international ACM SIGIR conference on research & development in information retrieval, pp 15–24

  16. Qi P, Dozat T, Zhang Y, Manning CD (2018) Universal dependency parsing from scratch. In: Proceedings of the CoNLL 2018 shared task: multilingual parsing from raw text to universal dependencies, pp 160–170. Association for Computational Linguistics, Brussels, Belgium. https://nlp.stanford.edu/pubs/qi2018universal.pdf

  17. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5533–5541

  18. Regneri M, Rohrbach M, Wetzel D, Thater S, Schiele B, Pinkal M (2013) Grounding action descriptions in videos. Trans Assoc Comput Linguist 1:25–36

    Article  Google Scholar 

  19. Rohrbach A, Rohrbach M, Hu R, Darrell T, Schiele B (2016) Grounding of textual phrases in images by reconstruction. In: ECCV, pp 817–834

  20. Rohrbach M, Regneri M, Andriluka M, Amin S, Pinkal M, Schiele B (2012) Script data for attribute-based recognition of composite activities. In: European conference on computer vision, pp 144–157. Springer

  21. Sigurdsson GA, Varol G, Wang X, Farhadi A, Laptev I, Gupta A (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In: European conference on computer vision, pp 510–526. Springer

  22. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  23. Song X, Han Y (2018) Val: visual-attention action localizer. In: Pacific rim conference on multimedia, pp 340–350

  24. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  25. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  26. Wu A, Han Y (2018) Multi-modal circulant fusion for video-to-language and backward. In: IJCAI

  27. Xu H, He K, Plummer BA, Sigal L, Sclaroff S, Saenko K (2018) Multilevel language and vision integration for text-to-clip retrieval. arXiv preprint. arXiv:1804.05113

  28. Yeung S, Russakovsky O, Mori G, Fei-Fei L (2016) End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2678–2687

  29. Yu J, Hong C, Rui Y, Tao D (2017) Multitask autoencoder model for recovering human poses. IEEE Trans Ind Electron 65(6):5060–5068

    Article  Google Scholar 

  30. Yu J, Kuang Z, Zhang B, Zhang W, Lin D, Fan J (2018) Leveraging content sensitiveness and user trustworthiness to recommend fine-grained privacy settings for social image sharing. IEEE Trans Inf Forensics Secur 13(5):1317–1332

    Article  Google Scholar 

  31. Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482

    Article  Google Scholar 

  32. Yu Z, Xu D, Yu J, Yu T, Zhao Z, Zhuang Y, Tao D (2019) Activitynet-qa: a dataset for understanding complex web videos via question answering. In: Proceedings of the thirty-third AAAI conference on artificial intelligence (AAAI), pp 9127–9134

  33. Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE international conference on computer vision (ICCV), pp 1839–1848

  34. Yu Z, Yu J, Xiang C, Fan J, Tao D (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(12):5947–5959

    Article  Google Scholar 

  35. Yu Z, Yu J, Xiang C, Zhao Z, Tian Q, Tao D (2018) Rethinking diversified and discriminative proposal generation for visual grounding. In: International joint conference on artificial intelligence (IJCAI), pp 1114–1120

  36. Yuan Y, Mei T, Zhu W (2018) To find where you talk: temporal sentence localization in video with attention based location regression. arXiv preprint. arXiv:1804.07014

  37. Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432

    Article  MathSciNet  Google Scholar 

  38. Zhao Z, Zhang Z, Xiao S, Yu Z, Yu J, Cai D, Wu F, Zhuang Y (2018) Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In: International joint conference on artificial intelligence (IJCAI), pp 3683–3689

Download references

Acknowledgements

This work was supported in part by National Natural Science Foundation of China, Young Scientists Fund and Key Programme under Grant 61702143, Grant 61836002.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, Z., Song, Y., Yu, J. et al. Intra- and Inter-modal Multilinear Pooling with Multitask Learning for Video Grounding. Neural Process Lett 52, 1863–1879 (2020). https://doi.org/10.1007/s11063-020-10205-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10205-y

Keywords

Navigation