skip to main content
10.1145/3123266.3123364acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Video Question Answering via Hierarchical Dual-Level Attention Network Learning

Published: 19 October 2017 Publication History

Abstract

Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question. However, the existing visual question answering approaches mainly tackle the problem of static image question answering, which may be ineffectively applied for video question answering directly, due to the insufficiency of modeling the video temporal dynamics. In this paper, we study the problem of video question answering from the viewpoint of hierarchical dual-level attention network learning. We obtain the object appearance and movement information in the video based on both frame-level and segment-level feature representation methods. We then develop the hierarchical duallevel attention networks to learn the question-aware video representations with word-level and question-level attention mechanisms. We next devise the question-level fusion attention mechanism for our proposed networks to learn the questionaware joint video representation for video question answering. We construct two large-scale video question answering datasets. The extensive experiments validate the effectiveness of our method.

References

[1]
E. Acar. Learning representations for affective video understanding. In ACM Multimedia, pages 1055--1058. ACM, 2013.
[2]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, pages 2425--2433, 2015.
[3]
N. Ballas, L. Yao, C. Pal, and A. Courville. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.
[4]
P. Cui, Z. Wang, and Z. Su. What videos are similar with you?: Learning a common attributed representation for video recommendation. In ACM Multimedia, pages 597--606. ACM, 2014.
[5]
C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468--3476, 2016.
[6]
C. Gan, Y. Yang, L. Zhu, D. Zhao, and Y. Zhuang. Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision, 120(1):61--77, 2016.
[7]
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, pages 2712--2719, 2013.
[8]
A. Gupta and R. Jain. Visual information retrieval. Communications of the ACM, 40(5):70--79, 1997.
[9]
X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pages 173--182. International World Wide Web Conferences Steering Committee, 2017.
[10]
X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 549--558. ACM, 2016.
[11]
M. Heilman and N. A. Smith. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609--617. ACL, 2010.
[12]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.
[13]
R. Hong, M. Wang, G. Li, L. Nie, Z.-J. Zha, and T.-S. Chua. Multimedia question answering. IEEE MultiMedia, 19(4):72--78, 2012.
[14]
J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890, 2016.
[15]
J.-H. Kim, S.-W. Lee, D.-H. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, and B.-T. Zhang. Multimodal residual learning for visual qa. NIPS, 2016.
[16]
D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[17]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097--1105, 2012.
[18]
R. Li and J. Jia. Visual question answering with question representation update (qru). In NIPS, pages 4655--4663, 2016.
[19]
Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo. Tgif: A new dataset and benchmark on animated gif description. In CVPR, pages 4641--4650, 2016.
[20]
J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, pages 289--297, 2016.
[21]
M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS, pages 1682--1690, 2014.
[22]
A. Mazaheri, D. Zhang, and M. Shah. Video fill in the blank with merging lstms. arXiv preprint arXiv:1610.04062, 2016.
[23]
C.-A. Palma, J. Björk, F. Klappenberger, E. Arras, D. Kühne, S. Stafström, and J. V. Barth. Visualization and thermodynamic encoding of single-molecule partition function projections. Nature communications, 6, 2015.
[24]
P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR, pages 1029--1038, 2016.
[25]
F. Shen, Y. Mu, Y. Yang, W. Liu, L. Liu, J. Song, and H. T. Shen. Classification by retrieval: Binarizing data and classifier. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 2017.
[26]
F. Shen, Y. Yang, L. Liu, W. Liu, D. Tao, and H. T. Shen. Asymmetric binary coding for image search. IEEE Transactions on Multimedia, 2017.
[27]
K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In CVPR, pages 4613--4621, 2016.
[28]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568--576, 2014.
[29]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[30]
N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, pages 843--852, 2015.
[31]
S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In NIPS, pages 2440--2448, 2015.
[32]
M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, pages 4631--4640, 2016.
[33]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489--4497, 2015.
[34]
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In ICCV, pages 4534--4542, 2015.
[35]
X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, pages 2794--2802, 2015.
[36]
Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. v. d. Hengel. Visual question answering: A survey of methods and datasets. arXiv preprint arXiv:1607.05910, 2016.
[37]
Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Multimedia, pages 167--176. ACM, 2014.
[38]
C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. ICML, 1603, 2016.
[39]
B. Xu, X. Wang, and X. Tang. Fusing music and video modalities using multi-timescale shared representations. In ACM Multimedia, pages 1073--1076. ACM, 2014.
[40]
Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative cnn video representation for event detection. In CVPR, pages 1798--1807, 2015.
[41]
Y. Yan, F. Nie, W. Li, C. Gao, Y. Yang, and D. Xu. Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia, 18(12):2494--2502, 2016.
[42]
Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, pages 21--29, 2016.
[43]
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, pages 4507--4515, 2015.
[44]
Y. Ye, Z. Zhao, Y. Li, J. Xiao, and Z. Yueting. Video question answering via attributed-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 2017.
[45]
K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun. Leveraging video descriptions to learn video question answering. arXiv preprint arXiv:1611.04021, 2016.
[46]
H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.
[47]
H. Zhang, M. Wang, R. Hong, and T.-S. Chua. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In ACM Multimedia, pages 781--790. ACM, 2016.
[48]
H. Zhang, Z.-J. Zha, Y. Yang, S. Yan, Y. Gao, and T.-S. Chua. Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In Proceedings of the 21st ACM international conference on Multimedia, pages 33--42. ACM, 2013.
[49]
Z. Zhao, X. He, D. Cai, L. Zhang, W. Ng, and Y. Zhuang. Graph regularized feature selection with data reconstruction. IEEE Transactions on Knowledge and Data Engineering, 28(3):689--700, 2016.
[50]
Z. Zhao, H. Lu, D. Cai, X. He, and Y. Zhuang. Partial multi-modal sparse coding via adaptive similarity structure regularization. In Proceedings of the 2016 ACM on Multimedia Conference, pages 152--156. ACM, 2016.
[51]
Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang. Video question answering via hierarchical spatio-temporal attention networks. In IJCAI, 2017.
[52]
Z. Zhao, L. Zhang, X. He, and W. Ng. Expert finding for question answering via graph regularized matrix completion. IEEE Transactions on Knowledge and Data Engineering, 27(4):993--1004, 2015.
[53]
L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann. Uncovering temporal context for video question and answering. International Journal of Computer Vision, 2017.

Cited By

View all
  • (2024)Multi-factor adaptive vision selection for egocentric video question answeringProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694520(59310-59328)Online publication date: 21-Jul-2024
  • (2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
  • (2024)Appearance-Motion Dual-Stream Heterogeneous Network for VideoQAMultiMedia Modeling10.1007/978-3-031-53311-2_16(212-227)Online publication date: 28-Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hierarchical attention network
  2. video question answering

Qualifiers

  • Research-article

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-factor adaptive vision selection for egocentric video question answeringProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694520(59310-59328)Online publication date: 21-Jul-2024
  • (2024)Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question AnsweringIEEE Transactions on Image Processing10.1109/TIP.2024.339098433(3115-3129)Online publication date: 2024
  • (2024)Appearance-Motion Dual-Stream Heterogeneous Network for VideoQAMultiMedia Modeling10.1007/978-3-031-53311-2_16(212-227)Online publication date: 28-Jan-2024
  • (2023)Video Question Answering with Overcoming Spatial and Temporal Redundancy in Feature ExtractionJOURNAL OF BROADCAST ENGINEERING10.5909/JBE.2023.28.7.84928:7(849-858)Online publication date: 31-Dec-2023
  • (2023)Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement NetworkProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612239(3985-3993)Online publication date: 26-Oct-2023
  • (2023)Text-Guided Object Detector for Multi-modal Video Question Answering2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00109(1032-1042)Online publication date: Jan-2023
  • (2023)Object-based Appearance-Motion Heterogeneous Network for Video Question Answering2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00025(112-119)Online publication date: 17-Dec-2023
  • (2023)IntentQA: Context-aware Video Intent Reasoning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01099(11929-11940)Online publication date: 1-Oct-2023
  • (2023)ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02221(23191-23200)Online publication date: Jun-2023
  • (2023)Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.01824(19027-19036)Online publication date: Jun-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media