research-article

Video Question Answering via Hierarchical Dual-Level Attention Network Learning

Authors:

Yueting ZhuangAuthors Info & Claims

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1050 - 1058

https://doi.org/10.1145/3123266.3123364

Published: 19 October 2017 Publication History

Abstract

Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question. However, the existing visual question answering approaches mainly tackle the problem of static image question answering, which may be ineffectively applied for video question answering directly, due to the insufficiency of modeling the video temporal dynamics. In this paper, we study the problem of video question answering from the viewpoint of hierarchical dual-level attention network learning. We obtain the object appearance and movement information in the video based on both frame-level and segment-level feature representation methods. We then develop the hierarchical duallevel attention networks to learn the question-aware video representations with word-level and question-level attention mechanisms. We next devise the question-level fusion attention mechanism for our proposed networks to learn the questionaware joint video representation for video question answering. We construct two large-scale video question answering datasets. The extensive experiments validate the effectiveness of our method.

References

[1]

E. Acar. Learning representations for affective video understanding. In ACM Multimedia, pages 1055--1058. ACM, 2013.

Digital Library

[2]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, pages 2425--2433, 2015.

Digital Library

[3]

N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.

[4]

P. Cui, Z. Wang, and Z. Su. What videos are similar with you?: Learning a common attributed representation for video recommendation. In ACM Multimedia, pages 597--606. ACM, 2014.

Digital Library

[5]

C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468--3476, 2016.

Digital Library

[6]

C. Gan, Y. Yang, L. Zhu, D. Zhao, and Y. Zhuang. Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision, 120(1):61--77, 2016.

Digital Library

[7]

S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, pages 2712--2719, 2013.

Digital Library

[8]

A. Gupta and R. Jain. Visual information retrieval. Communications of the ACM, 40(5):70--79, 1997.

Digital Library

[9]

X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pages 173--182. International World Wide Web Conferences Steering Committee, 2017.

Digital Library

[10]

X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 549--558. ACM, 2016.

Digital Library

[11]

M. Heilman and N. A. Smith. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609--617. ACL, 2010.

Digital Library

[12]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.

Digital Library

[13]

R. Hong, M. Wang, G. Li, L. Nie, Z.-J. Zha, and T.-S. Chua. Multimedia question answering. IEEE MultiMedia, 19(4):72--78, 2012.

Digital Library

[14]

J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890, 2016.

[15]

J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residual learning for visual qa. NIPS, 2016.

Digital Library

[16]

D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[17]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097--1105, 2012.

Digital Library

[18]

R. Li and J. Jia. Visual question answering with question representation update (qru). In NIPS, pages 4655--4663, 2016.

Digital Library

[19]

Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo. Tgif: A new dataset and benchmark on animated gif description. In CVPR, pages 4641--4650, 2016.

[20]

J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, pages 289--297, 2016.

Digital Library

[21]

M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, pages 1682--1690, 2014.

Digital Library

[22]

A. Mazaheri, D. Zhang, and M. Shah. Video fill in the blank with merging lstms. arXiv preprint arXiv:1610.04062, 2016.

[23]

C.-A. Palma, J. Björk, F. Klappenberger, E. Arras, D. Kühne, S. Stafström, and J. V. Barth. Visualization and thermodynamic encoding of single-molecule partition function projections. Nature communications, 6, 2015.

[24]

P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR, pages 1029--1038, 2016.

[25]

F. Shen, Y. Mu, Y. Yang, W. Liu, L. Liu, J. Song, and H. T. Shen. Classification by retrieval: Binarizing data and classifier. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 2017.

Digital Library

[26]

F. Shen, Y. Yang, L. Liu, W. Liu, D. Tao, and H. T. Shen. Asymmetric binary coding for image search. IEEE Transactions on Multimedia, 2017.

Digital Library

[27]

K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In CVPR, pages 4613--4621, 2016.

[28]

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568--576, 2014.

Digital Library

[29]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[30]

N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, pages 843--852, 2015.

Digital Library

[31]

S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In NIPS, pages 2440--2448, 2015.

Digital Library

[32]

M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631--4640, 2016.

[33]

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489--4497, 2015.

Digital Library

[34]

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In ICCV, pages 4534--4542, 2015.

Digital Library

[35]

X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, pages 2794--2802, 2015.

Digital Library

[36]

Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel. Visual question answering: A survey of methods and datasets. arXiv preprint arXiv:1607.05910, 2016.

[37]

Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Multimedia, pages 167--176. ACM, 2014.

Digital Library

[38]

C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. ICML, 1603, 2016.

Digital Library

[39]

B. Xu, X. Wang, and X. Tang. Fusing music and video modalities using multi-timescale shared representations. In ACM Multimedia, pages 1073--1076. ACM, 2014.

Digital Library

[40]

Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In CVPR, pages 1798--1807, 2015.

[41]

Y. Yan, F. Nie, W. Li, C. Gao, Y. Yang, and D. Xu. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia, 18(12):2494--2502, 2016.

Digital Library

[42]

Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, pages 21--29, 2016.

[43]

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, pages 4507--4515, 2015.

Digital Library

[44]

Y. Ye, Z. Zhao, Y. Li, J. Xiao, and Z. Yueting. Video question answering via attributed-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 2017.

Digital Library

[45]

K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. arXiv preprint arXiv:1611.04021, 2016.

[46]

H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.

[47]

H. Zhang, M. Wang, R. Hong, and T.-S. Chua. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM Multimedia, pages 781--790. ACM, 2016.

Digital Library

[48]

H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, Y. Gao, and T.-S. Chua. Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In Proceedings of the 21st ACM international conference on Multimedia, pages 33--42. ACM, 2013.

Digital Library

[49]

Z. Zhao, X. He, D. Cai, L. Zhang, W. Ng, and Y. Zhuang. Graph regularized feature selection with data reconstruction. IEEE Transactions on Knowledge and Data Engineering, 28(3):689--700, 2016.

Digital Library

[50]

Z. Zhao, H. Lu, D. Cai, X. He, and Y. Zhuang. Partial multi-modal sparse coding via adaptive similarity structure regularization. In Proceedings of the 2016 ACM on Multimedia Conference, pages 152--156. ACM, 2016.

Digital Library

[51]

Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical spatio-temporal attention networks. In IJCAI, 2017.

[52]

Z. Zhao, L. Zhang, X. He, and W. Ng. Expert finding for question answering via graph regularized matrix completion. IEEE Transactions on Knowledge and Data Engineering, 27(4):993--1004, 2015.

Digital Library

[53]

L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann. Uncovering temporal context for video question and answering. International Journal of Computer Vision, 2017.

Digital Library

Cited By

Zhang HLiu MLiu ZSong XWang YNie LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Multi-factor adaptive vision selection for egocentric video question answeringProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694520(59310-59328)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694520
Yu TFu KZhang JHuang QYu J(2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3390984
Xu FZhong ZZhu YZhou YLi G(2024)Appearance-Motion Dual-Stream Heterogeneous Network for VideoQAMultiMedia Modeling10.1007/978-3-031-53311-2_16(212-227)Online publication date: 28-Jan-2024
https://doi.org/10.1007/978-3-031-53311-2_16
Show More Cited By

Recommendations

Spatiotemporal-Textual Co-Attention Network for Video Question Answering
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018

Visual Question Answering (VQA) is to provide a natural language answer for a pair of an image or video and a natural language question. Despite recent progress on VQA, existing works primarily focus on image question answering and are suboptimal for ...
Video Question Answering via Gradually Refined Attention over Appearance and Motion
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Recently image question answering (ImageQA) has gained lots of attention in the research community. However, as its natural extension, video question answering (VideoQA) is less explored. Although both tasks look similar, VideoQA is more challenging ...
Video Question Answering via Attribute-Augmented Attention Network Learning
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Video Question Answering is a challenging problem in visual information retrieval, which provides the answer to the referenced video content according to the question. However, the existing visual question answering approaches mainly tackle the problem ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '17: Proceedings of the 25th ACM international conference on Multimedia

October 2017

2028 pages

ISBN:9781450349062

DOI:10.1145/3123266

General Chairs:
Qiong Liu
FXPAL, USA
,
Rainer Lienhart
Universität Augsburg, Germany
,
Haohong Wang
TCL America, USA
,
Program Chairs:
Sheng-Wei "Kuan-Ta" Chen
Academia Sinica, Taiwan
,
Susanne Boll
University of Oldenburg, Germany
,
Phoebe Chen
La Trobe University, Australia
,
Gerald Friedland
Lawrence Livermore National Lab, USA
,
Jia Li
Google, USA
,
Shuicheng Yan
Qihoo 360, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '17

Sponsor:

SIGMM

MM '17: ACM Multimedia Conference

October 23 - 27, 2017

California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
549
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)2

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang HLiu MLiu ZSong XWang YNie LSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Multi-factor adaptive vision selection for egocentric video question answeringProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694520(59310-59328)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694520
Yu TFu KZhang JHuang QYu J(2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3390984
Xu FZhong ZZhu YZhou YLi G(2024)Appearance-Motion Dual-Stream Heterogeneous Network for VideoQAMultiMedia Modeling10.1007/978-3-031-53311-2_16(212-227)Online publication date: 28-Jan-2024
https://doi.org/10.1007/978-3-031-53311-2_16
Lee JHa SKang J(2023)Video Question Answering with Overcoming Spatial and Temporal Redundancy in Feature ExtractionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2023.28.7.84928:7(849-858)Online publication date: 31-Dec-2023
https://doi.org/10.5909/JBE.2023.28.7.849
Liu MZhang FLuo XLiu FWei YNie LEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement NetworkProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612239(3985-3993)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612239
Shen RInoue NShinoda K(2023)Text-Guided Object Detector for Multi-modal Video Question Answering2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00109(1032-1042)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00109
Xu FZhong ZZhu YLi GZhou YZhou W(2023)Object-based Appearance-Motion Heterogeneous Network for Video Question Answering2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00025(112-119)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00025
Li JWei PHan WFan L(2023)IntentQA: Context-aware Video Intent Reasoning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01099(11929-11940)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01099
Yu ZZheng LZhao ZWu FFan JRen KYu J(2023)ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02221(23191-23200)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.02221
Zang CWang HPei MLiang W(2023)Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01824(19027-19036)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.01824
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten