skip to main content
10.1145/3394171.3413649acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

Authors Info & Claims
Published:12 October 2020Publication History

ABSTRACT

Video story question answering (video story QA) is a challenging problem, as it requires a joint understanding of diverse data sources (i.e., video, subtitle, question, and answer choices). Existing approaches for video story QA have several common defects: (1) single temporal scale; (2) static and rough multimodal interaction; and (3) insufficient (or shallow) exploitation of both question and answer choices. In this paper, we propose a novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) to address the aforementioned defects together. The proposed DHTCN explores multiple temporal scales by building hierarchical temporal convolutional network. In each temporal convolutional layer, two key components, namely AttLSTM and QA-Aware Dynamic Normalization, are introduced to capture the temporal dependency and the multimodal interaction in a dynamic and fine-grained manner. To enable sufficient exploitation of both question and answer choices, we increase the depth of QA pairs with a stack of non-linear layers, and exploit QA pairs in each layer of the network. Extensive experiments are conducted on two widely used datasets: TVQA and MovieQA, demonstrating the effectiveness of DHTCN. Our model obtains state-of-the-art results on the both datasets.

Skip Supplemental Material Section

Supplemental Material

3394171.3413649.mp4

mp4

31.5 MB

References

  1. Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don't just assume; look and answer: Overcoming priors for visual question answering. In CVPR. 4971--4980.Google ScholarGoogle Scholar
  2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.Google ScholarGoogle Scholar
  3. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.Google ScholarGoogle Scholar
  4. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).Google ScholarGoogle Scholar
  5. Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In ICCV. 2612--2620.Google ScholarGoogle Scholar
  6. Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. 2019. Murel: Multimodal relational reasoning for visual question answering. In CVPR. 1989--1998.Google ScholarGoogle Scholar
  7. Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015).Google ScholarGoogle Scholar
  8. Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron C Courville. 2017. Modulating early visual processing by language. In NeurIPS. 6594--6604.Google ScholarGoogle Scholar
  9. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.Google ScholarGoogle Scholar
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  11. Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, and Heng Huang. 2019. Heterogeneous memory enhanced multimodal attention model for video question answering. In CVPR. 1999--2007.Google ScholarGoogle Scholar
  12. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP. 457--468.Google ScholarGoogle Scholar
  13. Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In CVPR. 6576--6585.Google ScholarGoogle Scholar
  14. Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In CVPR. 6639--6648.Google ScholarGoogle Scholar
  15. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwiňska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et almbox. 2016. Hybrid computing using a neural network with dynamic external memory. Nature, Vol. 538, 7626 (2016), 471--476.Google ScholarGoogle Scholar
  16. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.Google ScholarGoogle Scholar
  17. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).Google ScholarGoogle Scholar
  18. Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR. 2758--2766.Google ScholarGoogle Scholar
  19. Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In ICCV. 1965--1973.Google ScholarGoogle Scholar
  20. Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, and Chang D Yoo. 2019. Progressive attention memory network for movie story question answering. In CVPR. 8337--8346.Google ScholarGoogle Scholar
  21. Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018b. Bilinear attention networks. In NeurIPS. 1564--1574.Google ScholarGoogle Scholar
  22. Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, and Byoung-Tak Zhang. 2016. Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016).Google ScholarGoogle Scholar
  23. Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018a. Multimodal dual attention memory for video story question answering. In ECCV. 673--688.Google ScholarGoogle Scholar
  24. Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. 2017. Deepstory: Video story qa by deep embedded memory networks. In IJCAI. 2016--2022.Google ScholarGoogle Scholar
  25. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  26. Jie Lei, Licheng Yu, Mohit Bansal, and Tamara Berg. 2018. TVQA: Localized, Compositional Video Question Answering. In EMNLP. 1369--1379.Google ScholarGoogle Scholar
  27. Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2019. TVQA: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019).Google ScholarGoogle Scholar
  28. Junwei Liang, Lu Jiang, Liangliang Cao, Li-Jia Li, and Alexander G Hauptmann. 2018. Focal visual-text attention for visual question answering. In CVPR. 6135--6143.Google ScholarGoogle Scholar
  29. Fei Liu, Jing Liu, Zhiwei Fang, Richang Hong, and Hanqing Lu. 2019 a. Densely Connected Attention Flow for Visual Question Answering.. In IJCAI. 869--875.Google ScholarGoogle Scholar
  30. Fei Liu, Jing Liu, Richang Hong, and Hanqing Lu. 2019 b. Erasing-based Attention Learning for Visual Question Answering. In ACM MM. 1175--1183.Google ScholarGoogle Scholar
  31. Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In NeurIPS. 289--297.Google ScholarGoogle Scholar
  32. Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, and Christopher Pal. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In CVPR. 6884--6893.Google ScholarGoogle Scholar
  33. Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In ICCV. 1--9.Google ScholarGoogle Scholar
  34. Alejandro Molina, Patrick Schramowski, and Kristian Kersting. 2019. Pad\'e Activation Units: End-to-end Learning of Flexible Activation Functions in Deep Networks. arXiv preprint arXiv:1907.06732 (2019).Google ScholarGoogle Scholar
  35. Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, and Bohyung Han. 2017. Marioqa: Answering questions by watching gameplay videos. In ICCV. 2867--2875.Google ScholarGoogle Scholar
  36. Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim. 2017. A read-write memory network for movie story understanding. In ICCV. 677--685.Google ScholarGoogle Scholar
  37. Duy-Kien Nguyen and Takayuki Okatani. 2018. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR. 6087--6096.Google ScholarGoogle Scholar
  38. Kevin J Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In CVPR. 4613--4621.Google ScholarGoogle Scholar
  39. Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et almbox. 2015. End-to-end memory networks. In NeurIPS. 2440--2448.Google ScholarGoogle Scholar
  40. Jinhui Tang, Xiangbo Shu, Zechao Li, Yu-Gang Jiang, and Qi Tian. 2019. Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE transactions on pattern analysis and machine intelligence (2019).Google ScholarGoogle Scholar
  41. Jinhui Tang, Xiangbo Shu, Guo-Jun Qi, Zechao Li, Meng Wang, Shuicheng Yan, and Ramesh Jain. 2016. Tri-clustered tensor completion for social-aware image tag refinement. IEEE transactions on pattern analysis and machine intelligence, Vol. 39, 8 (2016), 1662--1674.Google ScholarGoogle Scholar
  42. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In CVPR. 4631--4640.Google ScholarGoogle Scholar
  43. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.Google ScholarGoogle Scholar
  44. Anran Wang, Anh Tuan Luu, Chuan-Sheng Foo, Hongyuan Zhu, Yi Tay, and Vijay Chandrasekhar. 2019. Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing, Vol. 29 (2019), 489--499.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. 2018. Movie question answering: Remembering the textual cues for layered visual contents. In AAAI.Google ScholarGoogle Scholar
  46. Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916 (2014).Google ScholarGoogle Scholar
  47. Jialin Wu, Zeyuan Hu, and Raymond Mooney. 2019. Generating Question Relevant Captions to Aid Visual Question Answering. In ACL. 3585--3594.Google ScholarGoogle Scholar
  48. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In ICML. 2397--2406.Google ScholarGoogle Scholar
  49. Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In ACM MM. 1645--1653.Google ScholarGoogle Scholar
  50. Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV. 451--466.Google ScholarGoogle Scholar
  51. Zekun Yang, Noa Garcia, Chenhui Chu, Mayu Otani, Yuta Nakashima, and Haruo Takemura. 2020. BERT representations for Video Question Answering. In WACV. 1556--1565.Google ScholarGoogle Scholar
  52. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In CVPR. 21--29.Google ScholarGoogle Scholar
  53. Yunan Ye, Zhou Zhao, Yimeng Li, Long Chen, Jun Xiao, and Yueting Zhuang. 2017. Video question answering via attribute-augmented attention network learning. In ACM SIGIR. 829--832.Google ScholarGoogle Scholar
  54. Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In ECCV. 471--487.Google ScholarGoogle Scholar
  55. Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In CVPR. 6281--6290.Google ScholarGoogle Scholar
  56. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV. 1821--1830.Google ScholarGoogle Scholar
  57. Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence (2019).Google ScholarGoogle Scholar
  58. Zheng-Jun Zha, Jiawei Liu, Di Chen, and Feng Wu. 2020. Adversarial attribute-text embedding for person search with natural language query. IEEE Transactions on Multimedia (2020).Google ScholarGoogle Scholar
  59. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. 2017. Uncovering the temporal context for video question answering. International Journal of Computer Vision, Vol. 124, 3 (2017), 409--421.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader