skip to main content
research-article

Multichannel Attention Refinement for Video Question Answering

Published:12 March 2020Publication History
Skip Abstract Section

Abstract

Video Question Answering (VideoQA) is the extension of image question answering (ImageQA) in the video domain. Methods are required to give the correct answer after analyzing the provided video and question in this task. Comparing to ImageQA, the most distinctive part is the media type. Both tasks require the understanding of visual media, but VideoQA is much more challenging, mainly because of the complexity and diversity of videos. Particularly, working with the video needs to model its inherent temporal structure and analyze the diverse information it contains. In this article, we propose to tackle the task from a multichannel perspective. Appearance, motion, and audio features are extracted from the video, and question-guided attentions are refined to generate the expressive clues that support the correct answer. We also incorporate the relevant text information acquired from Wikipedia as an attempt to extend the capability of the method. Experiments on TGIF-QA and ActivityNet-QA datasets show the advantages of our method compared to existing methods. We also demonstrate the effectiveness and interpretability of our method by analyzing the refined attention weights during the question-answering procedure.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961--970.Google ScholarGoogle ScholarCross RefCross Ref
  4. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  5. Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 335--344.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659--5667.Google ScholarGoogle ScholarCross RefCross Ref
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.Google ScholarGoogle Scholar
  8. Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 457--468.Google ScholarGoogle ScholarCross RefCross Ref
  9. Jiyang Gao, Runzhou Ge, Kan Chen, and Ram Nevatia. 2018. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6576--6585.Google ScholarGoogle ScholarCross RefCross Ref
  10. Lianli Gao, Xiangpeng Li, Jingkuan Song, and Heng Tao Shen. 2019. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. (2019). https://ieeexplore.ieee.org/abstract/document/8620348.Google ScholarGoogle Scholar
  11. Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 776--780.Google ScholarGoogle ScholarCross RefCross Ref
  12. Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904--6913.Google ScholarGoogle ScholarCross RefCross Ref
  14. Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6546--6555.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  16. Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17). IEEE, 131--135.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2758--2766.Google ScholarGoogle ScholarCross RefCross Ref
  18. Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1 (2013), 221--231.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 675--678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565--4574.Google ScholarGoogle ScholarCross RefCross Ref
  21. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  23. Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706--715.Google ScholarGoogle ScholarCross RefCross Ref
  24. Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 1 (2017), 32--73.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In Proceedings of the International Conference on Machine Learning. 1378--1387.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of Doc2Vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP. 78--86.Google ScholarGoogle ScholarCross RefCross Ref
  27. Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning. 1188--1196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Quoc V. Le, Will Y. Zou, Serena Y. Yeung, and Andrew Y. Ng. 2011. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle Scholar
  29. Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4641--4650.Google ScholarGoogle ScholarCross RefCross Ref
  30. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 3111--3119.Google ScholarGoogle Scholar
  31. Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1029--1038.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532--1543.Google ScholarGoogle ScholarCross RefCross Ref
  33. Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 2953--2961.Google ScholarGoogle Scholar
  34. Matthew J. Roach, J. D. Mason, and Mark Pawlewski. 2001. Video genre classification using dynamics. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 3. IEEE, 1557--1560.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Paul Scovanner, Saad Ali, and Mubarak Shah. 2007. A 3-dimensional sift descriptor and its application to action recognition. In Proceedings of the 15th ACM International Conference on Multimedia. ACM, 357--360.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. 2013. Overfeat: Integrated recognition, localization, and detection using convolutional networks. arXiv preprint arXiv:1312.6229 (2013).Google ScholarGoogle Scholar
  37. Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 568--576.Google ScholarGoogle Scholar
  38. Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google ScholarGoogle Scholar
  39. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4631--4640.Google ScholarGoogle ScholarCross RefCross Ref
  40. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarGoogle ScholarCross RefCross Ref
  42. Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence—Video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Meng Wang, Weijie Fu, Shijie Hao, Hengchang Liu, and Xindong Wu. 2017. Learning on big graph: Label inference and regularization with anchor hierarchy. IEEE Trans. Knowl. Data Eng. 29, 5 (2017), 1101--1114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4622--4630.Google ScholarGoogle Scholar
  45. Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. 2019. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Google ScholarGoogle ScholarCross RefCross Ref
  47. Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. 2017. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM International Conference on Multimedia. ACM, 1645--1653.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Huijuan Xu and Kate Saenko. 2016. Ask, attend, and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the European Conference on Computer Vision. Springer, 451--466.Google ScholarGoogle ScholarCross RefCross Ref
  49. Hongyang Xue, Wenqing Chu, Zhou Zhao, and Deng Cai. 2018. A better way to attend: Attention with trees for video question answering. IEEE Trans. Image Proc. 27, 11 (2018), 5563--5574.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Hongyang Xue, Zhou Zhao, and Deng Cai. 2017. Unifying the video and question attentions for open-ended video question answering. IEEE Trans. Image Proc. 26, 12 (2017), 5656--5666.Google ScholarGoogle ScholarCross RefCross Ref
  51. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 21--29.Google ScholarGoogle ScholarCross RefCross Ref
  52. Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4584--4593.Google ScholarGoogle ScholarCross RefCross Ref
  54. Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3165--3173.Google ScholarGoogle ScholarCross RefCross Ref
  55. Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Yueting Zhuang, and Dacheng Tao. 2019. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.Google ScholarGoogle ScholarCross RefCross Ref
  56. Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. 2019. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 6281--6290.Google ScholarGoogle ScholarCross RefCross Ref
  57. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1839--1848.Google ScholarGoogle ScholarCross RefCross Ref
  58. Zhou Yu, Jun Yu, Chenchao Xiang, Jianping Fan, and Dacheng Tao. 2018. Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29, 12 (2018), 5947--5959.Google ScholarGoogle ScholarCross RefCross Ref
  59. Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. 2015. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4694--4702.Google ScholarGoogle ScholarCross RefCross Ref
  60. Mihai Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. 2016. Spatio-temporal attention models for grounded video captioning. In Proceedings of the Asian Conference on Computer Vision. Springer, 104--119.Google ScholarGoogle Scholar
  61. Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. 2017. Leveraging video descriptions to learn video question answering. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google ScholarGoogle Scholar
  62. Hanwang Zhang, Xindi Shang, Huanbo Luan, Meng Wang, and Tat-Seng Chua. 2017. Learning from collective intelligence: Feature learning using social images and tags. ACM Trans. Multim. Comput. Commun. Applic. 13, 1 (2017), 1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, Yue Gao, and Tat-Seng Chua. 2013. Attribute-augmented semantic hierarchy: Towards bridging semantic gap and intention gap in image retrieval. In Proceedings of the 21st ACM International Conference on Multimedia. ACM, 33--42.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Songyang Zhang, Yang Yang, Jun Xiao, Xiaoming Liu, Yi Yang, Di Xie, and Yueting Zhuang. 2018. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks. IEEE Trans. Multim. 20, 9 (2018), 2330--2343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Shengping Zhang, Hongxun Yao, Xin Sun, Kuanquan Wang, Jun Zhang, Xiusheng Lu, and Yanhao Zhang. 2014. Action recognition based on overcomplete independent components analysis. Inf. Sci. 281 (2014), 635--647.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Zhou Zhao, Qifan Yang, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Video question answering via hierarchical spatio-temporal attention networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’17). 3518--3524.Google ScholarGoogle ScholarCross RefCross Ref
  67. Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. 2017. Uncovering the temporal context for video question answering. Int. J. Comput. Vis. 124, 3 (2017), 409--421.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multichannel Attention Refinement for Video Question Answering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 1s
          Special Issue on Multimodal Machine Learning for Human Behavior Analysis and Special Issue on Computational Intelligence for Biomedical Data and Imaging
          January 2020
          376 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/3388236
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 March 2020
          • Revised: 1 October 2019
          • Accepted: 1 October 2019
          • Received: 1 April 2019
          Published in tomm Volume 16, Issue 1s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format