research-article

Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

Authors:
Fei Liu

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Jing Liu

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

,
Xinxin Zhu

Institute of Automation, Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences, Beijing, China
View Profile

,
Richang Hong

Hefei University of Technology, Hefei, China

Hefei University of Technology, Hefei, China
View Profile

,
Hanqing Lu

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China

Institute of Automation, Chinese Academy of Sciences & University of Chinese Academy of Sciences, Beijing, China
View Profile

MM '20: Proceedings of the 28th ACM International Conference on MultimediaOctober 2020Pages 4253–4261https://doi.org/10.1145/3394171.3413649

Published:12 October 2020Publication History

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 4253–4261

ABSTRACT

Video story question answering (video story QA) is a challenging problem, as it requires a joint understanding of diverse data sources (i.e., video, subtitle, question, and answer choices). Existing approaches for video story QA have several common defects: (1) single temporal scale; (2) static and rough multimodal interaction; and (3) insufficient (or shallow) exploitation of both question and answer choices. In this paper, we propose a novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) to address the aforementioned defects together. The proposed DHTCN explores multiple temporal scales by building hierarchical temporal convolutional network. In each temporal convolutional layer, two key components, namely AttLSTM and QA-Aware Dynamic Normalization, are introduced to capture the temporal dependency and the multimodal interaction in a dynamic and fine-grained manner. To enable sufficient exploitation of both question and answer choices, we increase the depth of QA pairs with a stack of non-linear layers, and exploit QA pairs in each layer of the network. Extensive experiments are conducted on two widely used datasets: TVQA and MovieQA, demonstrating the effectiveness of DHTCN. Our model obtains state-of-the-art results on the both datasets.

Supplemental Material

3394171.3413649.mp4

mp4

31.5 MB

Download

Available for Download

zip

mmfp1417aux.zip (1.7 MB)

Please directly unzip this compressed file. You will then find a PDF file, containing additional experiments and visualizations.

References

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't just assume; look and answer: Overcoming priors for visual question answering. In CVPR. 4971--4980.Google Scholar
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.Google Scholar
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.Google Scholar
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google Scholar
Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In ICCV. 2612--2620.Google Scholar
Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. Murel: Multimodal relational reasoning for visual question answering. In CVPR. 1989--1998.Google Scholar
Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015).Google Scholar
Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville. 2017. Modulating early visual processing by language. In NeurIPS. 6594--6604.Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR. 1999--2007.Google Scholar
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.Google Scholar
Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In CVPR. 6576--6585.Google Scholar
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR. 6639--6648.Google Scholar
Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwiňska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et almbox. 2016. Hybrid computing using a neural network with dynamic external memory. Nature, Vol. 538, 7626 (2016), 471--476.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google Scholar
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).Google Scholar
Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR. 2758--2766.Google Scholar
Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In ICCV. 1965--1973.Google Scholar
Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, and Chang D Yoo. 2019. Progressive attention memory network for movie story question answering. In CVPR. 8337--8346.Google Scholar
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018b. Bilinear attention networks. In NeurIPS. 1564--1574.Google Scholar
Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).Google Scholar
Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018a. Multimodal dual attention memory for video story question answering. In ECCV. 673--688.Google Scholar
Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. 2017. Deepstory: Video story qa by deep embedded memory networks. In IJCAI. 2016--2022.Google Scholar
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP. 1369--1379.Google Scholar
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2019. TVQA: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019).Google Scholar
Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G Hauptmann. 2018. Focal visual-text attention for visual question answering. In CVPR. 6135--6143.Google Scholar
Fei Liu, Jing Liu, Zhiwei Fang, Richang Hong, and Hanqing Lu. 2019 a. Densely Connected Attention Flow for Visual Question Answering.. In IJCAI. 869--875.Google Scholar
Fei Liu, Jing Liu, Richang Hong, and Hanqing Lu. 2019 b. Erasing-based Attention Learning for Visual Question Answering. In ACM MM. 1175--1183.Google Scholar
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS. 289--297.Google Scholar
Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, and Christopher Pal. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In CVPR. 6884--6893.Google Scholar
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In ICCV. 1--9.Google Scholar
Alejandro Molina, Patrick Schramowski, and Kristian Kersting. 2019. Pad\'e Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks. arXiv preprint arXiv:1907.06732 (2019).Google Scholar
Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han. 2017. Marioqa: Answering questions by watching gameplay videos. In ICCV. 2867--2875.Google Scholar
Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim. 2017. A read-write memory network for movie story understanding. In ICCV. 677--685.Google Scholar
Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR. 6087--6096.Google Scholar
Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In CVPR. 4613--4621.Google Scholar
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox. 2015. End-to-end memory networks. In NeurIPS. 2440--2448.Google Scholar
Jinhui Tang, Xiangbo Shu, Zechao Li, Yu-Gang Jiang, and Qi Tian. 2019. Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE transactions on pattern analysis and machine intelligence (2019).Google Scholar
Jinhui Tang, Xiangbo Shu, Guo-Jun Qi, Zechao Li, Meng Wang, Shuicheng Yan, and Ramesh Jain. 2016. Tri-clustered tensor completion for social-aware image tag refinement. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 8 (2016), 1662--1674.Google Scholar
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In CVPR. 4631--4640.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.Google Scholar
Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, and Vijay Chandrasekhar. 2019. Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing, Vol. 29 (2019), 489--499.Google ScholarDigital Library
Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. 2018. Movie question answering: Remembering the textual cues for layered visual contents. In AAAI.Google Scholar
Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916 (2014).Google Scholar
Jialin Wu, Zeyuan Hu, and Raymond Mooney. 2019. Generating Question Relevant Captions to Aid Visual Question Answering. In ACL. 3585--3594.Google Scholar
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In ICML. 2397--2406.Google Scholar
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In ACM MM. 1645--1653.Google Scholar
Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV. 451--466.Google Scholar
Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. 2020. BERT representations for Video Question Answering. In WACV. 1556--1565.Google Scholar
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR. 21--29.Google Scholar
Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video question answering via attribute-augmented attention network learning. In ACM SIGIR. 829--832.Google Scholar
Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV. 471--487.Google Scholar
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR. 6281--6290.Google Scholar
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV. 1821--1830.Google Scholar
Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence (2019).Google Scholar
Zheng-Jun Zha, Jiawei Liu, Di Chen, and Feng Wu. 2020. Adversarial attribute-text embedding for person search with natural language query. IEEE Transactions on Multimedia (2020).Google Scholar
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124, 3 (2017), 409--421.Google ScholarDigital Library

Index Terms

Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Evaluating Temporal Information Understanding with Temporal Question Answering
ICSC '12: Proceedings of the 2012 IEEE Sixth International Conference on Semantic Computing

The temporal annotation scheme Time ML was developed to support research in complex temporal question answering (QA). Given the complexity of temporal QA, most of the efforts have focused, so far, on extracting temporal information, which has been ...
Read More
Quality-aware collaborative question answering: methods and evaluation
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

Community Question Answering (QA) portals contain questions and answers contributed by hundreds of millions of users. These databases of questions and answers are of great value if they can be used directly to answer questions from any user. In this ...
Read More
Keyword-aware Multi-modal Enhancement Attention for Video Question Answering
CSAI '21: Proceedings of the 2021 5th International Conference on Computer Science and Artificial Intelligence

Video question answering (VideoQA) is an intriguing topic in the field of visual language. Most of the current VideoQA models directly harness the global video information to answer questions. However, in VideoQA task, the answers associated with the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
casia
institute of automation, chinese academy of sciences
nlpr
university of chinese academy of sciences
video question answering
visual question answering
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 255
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Evaluating Temporal Information Understanding with Temporal Question Answering

Quality-aware collaborative question answering: methods and evaluation

Keyword-aware Multi-modal Enhancement Attention for Video Question Answering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media