Elsevier

Pattern Recognition

Volume 132, December 2022, 108959
Pattern Recognition

Dynamic self-attention with vision synchronization networks for video question answering

https://doi.org/10.1016/j.patcog.2022.108959Get rights and content

Highlights

  • A novel token selection mechanism based on the dynamic self-attention network is proposed to automatically extract important video features.

  • A vision synchronization network is proposed to align appearance and motion features at the time slice level.

  • Extensive experiments and analysis confirm the superiority of the proposed model DSAVS.

Abstract

Video Question Answering (VideoQA) has gained increasing attention as an important task in understanding the rich spatio-temporal contents, i.e., the appearance and motion in the video. However, existing approaches mainly use the question to learn attentions over all the sampled appearance and motion features separately, which neglect two properties of VideoQA: (1) the answer to the question is often reflected on a few frames and video clips, and most video contents are superfluous; (2) appearance and motion features are usually concomitant and complementary to each other in time series. In this paper, we propose a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), to address these problems. Specifically, a gated token selection mechanism is proposed to dynamically select the important tokens from appearance and motion sequences. These chosen tokens are fed into a self-attention mechanism to model the internal dependencies for more effective representation learning. To capture the correlation between the appearance and motion features, a vision synchronization block is proposed to synchronize the two types of vision features at the time slice level. Then, the visual objects can be correlated with their corresponding activities and the performance is further improved. Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of our model compared with state-of-the-art methods.

Introduction

With the rapid development of computer vision and natural language processing, tasks involving vision and language have inspired considerable interests. A new problem named Visual Question Answering (VQA), including Image Question Answering (ImageQA) [1], [2], [3], [4] and Video Question Answering (VideoQA) [5], [6], [7], [8], has emerged as a promising but intractable research point. It requires the agent to provide the correct answer to a question about the contents of a given image/video. Compared with ImageQA where image just contains appearance features, VideoQA is more challenging due to three reasons: (1) it deals with a long sequence of images that contain not only appearance but also motion features; (2) substantial redundant information exists in the video; (3) different questions require different parts of the video to infer the answer. However, the achievement of VideoQA is beneficial for various real-life applications, including tourist assistance, automatic customer service, and human-machine interaction [9].

A number of methods have sprung up in recent years to tackle the difficulties faced by VideoQA. These methods can be broadly divided into two categories depending on the way in which visual features are processed. The first type of method [9], [10], [11], [12] only uses the appearance features in the video. The work presented in [13] introduces a learnable aggregating network with only the appearance features. It mainly learns the cross-modal correlations using diversity learning schemes, which neglects the motion information of the video and thus is hard to capture fine-grained features in time series. The second type of method [14], [15], [16], [17] incorporates both appearance and motion features as the input. The model proposed in [15] gradually refines its attentions over both appearance and motion features using the question as guidance. However, since appearance and motion features obtained from a video have redundant and interference information, directly inferring the answer from them will result in a negative impact on the performance. Besides, this group of methods deals with the appearance and motion features separately, which neglects the synchrony relation between the two types of features.

Meanwhile, the unique characteristics of visual appearance and motion features in videos provide important clues for VideoQA. First, the answer to a question is usually reflected on a few key frames or video clips, and most video information is superfluous. Inferring the answer directly from all the appearance and motion information is not only computationally expensive but also introduces a lot of noise. As the examples shown in Fig. 1(a), in the left instance, the question “what color is the phone in the hand of the woman?” can be answered from the frames that contain a phone in the hand of a woman. However, the object phone does not appear in most frames of the video. It means that most of the information in the video is useless to this question. In the right instance, a little boy is calling in the video. The action “press the phone” only exists in a short video clip. Therefore, the information existing outside this short clip in the video is useless to answer the question “How many times did the boy press the phone?” Second, the visual objects have their own activities in a video, and thus the appearance and motion features are often concomitant and complementary to each other at the time slice level. Dealing with the appearance and motion features independently may result in mistakes. Take the example shown in Fig. 1(b) as an instance. The video mainly contains two parts of information, i.e., a boy is jumping and a dog is running. The appearance features are extracted as a sequence of “boy-boy- -dog-dog”, while the motion features are extracted as “jump-jump- -running-running”. Most of the existing approaches concentrate on the appearance of the “boy” for the question “what is the boy in the white vest doing?” with an attention mechanism. They also prefer “running” with the guidance of “boy” from the video, as the motion “running” has higher probability than “jumping” when it is combined with “boy”, i.e., (p(boy,running)>p(boy,jumping)) in the training set. Then, a wrong answer “running” is inferred. The reason is that existing methods employ the attention mechanism to appearance and motion features separately, which neglect the synchrony relation between the two types of visual features. Obviously, the appearance and motion features have different roles in visual reasoning, and they are concomitant and complementary in time series. Therefore, these latent correlations between the appearance and motion features should be exploited, so that the action of the boy can be matched to jumping and the action of the dog to running.

However, it is quite difficult to capture the answer-related key information and the correlation between appearance and motion features. First, there is no explicit knowledge about which video frames and clips are relevant to the answer. Second, the appearance and motion features are represented in heterogeneous space, and it is difficult to find the motion features for a given visual object. To tackle these challenges, we propose to take advantage of the latent correlation between different types of video features and the answer for VideoQA. In particular, we investigate: (1) how to mine the answer-related frames and clips in the video; (2) how to learn the correlation between the appearance and motion features. Our solutions to these questions result in a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS). To automatically find out the key video frames and clips used for answer inference, a gated token selection mechanism with a dynamic self-attention block is proposed to select the supportive tokens from the input sequence. To effectively learn the correlation between appearance and motion features, we synchronize them at the time slice level with a vision synchronization block, where the two types of features are fused in the guidance of the question sentence. Then, both the dynamic self-attention and vision synchronization blocks are integrated into an end-to-end framework to infer the answer. The main contributions are summarized as follows:

  • We propose a dynamic self-attention method to automatically select important video information to learn internal dependencies, avoiding a lot of redundant and noisy information.

  • Unlike the existing VideoQA methods, we introduce a vision synchronization model to synchronize the appearance and motion features, which can avoid the misalignment of appearance and motion in time series.

  • Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of the proposed DSAVS compared with state-of-the-art methods.

The remainder of the paper is organized as follows. We first survey the related works. Then, the problem statement and the details of our model are presented. Next, we describe the experiments and the analysis of the results, and the paper is concluded in the end.

Section snippets

Related work

VideoQA has been proposed as a challenging task in understanding the rich spatio-temporal information of the video in recent years. It requires the agent to automatically infer the answer for a free-form, open-ended question about the content of a video. Compared with image-based multimedia tasks, e.g., image question answering [18], [19], [20], [21], image caption [22], [23], [24], [25], and text-to-image retrieval [26], [27], [28], [29], VideoQA is more challenging due to the complex temporal

Problem statement

In this section, we first present the definition of the VideoQA problem we study and then introduce the feature representation method used in our work.

Given a question sentence Q and the corresponding video V, Video Question Answering (VideoQA) aims to generate the correct answer A. Like previous method [15], [42], the VideoQA task is treated as a maximum likelihood estimation problem. It can be solved by computing the likelihood probability distributation pvideoQA. For each answer A in the

Proposed model

Figure 2 illustrates the framework of our proposed DSAVS. Specifically, the framework of DSAVS mainly contains two salient components, i.e., the Dynamic Self-Attention (DynSA) block and the Vision Synchronization (VS) block. The DynSA block is proposed to find out the question-related key video frames and clips. Meanwhile, the internal dependencies in the key frames and clips are modeled respectively to learn more effective visual representation. The VS block is used to synchronize appearance

Datasets

Among the recently proposed VideoQA data sets, MSVD-QA [15], MSRVTT-QA [15] and YouTube2Text-QA [11] are three challenging ones:

MSVD-QA is constructed based on the Microsoft Research Video Description Corpus [54], which is widely employed in the video captioning task. This data set consists of 50,505 question-answer pairs and 1,970 videos. It is divided into three splits: training (61%), validation (13%), and testing (26%). Table 1 shows the statistics of the MSVD-QA data set.

MSRVTT-QA is a

Conclusion and future work

In this work, we aim to tackle two problems in the Video Question Answering (VideoQA) task: (1) the answer to the question is often reflected on a few frames and video clips, and most video information is superfluous; (2) the appearance and motion features are usually concomitant and complementary to each other at time slice level. We propose a novel model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), for VideoQA. It dynamically selects the important tokens from

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Acknowledgments

This work was supported by the Zunyi Technology and Big data Bureau, Moutai Institute Joint Science and Technology Research and Development Project (ZSKHHZ[2022] No.167, No.170, No.160, and No.165), supported by the Youth Science and Technology Talents Development Project of Guizhou Education Department (Qian Jiaohe KY Zi [2020] 225), and supported by the Program of Basic Research in Guizhou Province (Science and Technology Foundation of Guizhou Province) under Grant ZK[2022]YB539.

Yun Liu received the B.Sc. degree in computer science and technology from Sichuan University, Chengdu, China, in 2015, and the Ph.D. degree in computer science and technology from Beihang University, Beijing, China, in 2022. He is currently an associate professor with the department of automation, Moutai Institute, Renhuai, China. His research interests include social media analysis, multimodal data analysis, and data mining.

References (58)

  • Y. Jang et al.

    TGIF-QA: toward spatio-temporal reasoning in visual question answering

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Tapaswi et al.

    Movieqa: Understanding stories in movies through question-answering

    Proceedings of Conference on Computer Vision and Pattern Recognition

    (2016)
  • K. Zeng et al.

    Leveraging video descriptions to learn video question answering

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    (2017)
  • X. Song et al.

    Explore multi-step reasoning in video question answering

    Proceedings of the ACM Multimedia Conference

    (2018)
  • L. Gao et al.

    Structured two-stream attention network for video question answering

    Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence

    (2019)
  • Y. Yu et al.

    End-to-end concept word detection for video captioning, retrieval, and question answering

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Y. Ye et al.

    Video question answering via attribute-augmented attention network learning

    Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

    (2017)
  • X. Li et al.

    Beyond rnns: Positional self-attention with co-attention for video question answering

    Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence

    (2019)
  • X. Li et al.

    Learnable aggregating net with diversity learning for video question answering

    Proceedings of the 27th ACM International Conference on Multimedia

    (2019)
  • Z. Zhao et al.

    Video question answering via hierarchical spatio-temporal attention networks

    IJCAI

    (2017)
  • D. Xu et al.

    Video question answering via gradually refined attention over appearance and motion

    Proceedings of the ACM Multimedia Conference

    (2017)
  • J. Gao et al.

    Motion-appearance co-memory networks for video question answering

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • C. Fan et al.

    Heterogeneous memory enhanced multimodal attention model for video question answering

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • J. Lu et al.

    Hierarchical question-image co-attention for visual question answering

    Proceedings of Advances In Neural Information Processing Systems

    (2016)
  • P. Anderson et al.

    Bottom-up and top-down attention for image captioning and visual question answering

    IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • Y. Liu et al.

    Adversarial learning with multi-modal attention for visual question answering

    IEEE Trans Neural Netw Learn Syst

    (2020)
  • K. Xu et al.

    Show, attend and tell: Neural image caption generation with visual attention

    Proceedings of International Conference on Machine Learning

    (2015)
  • A. Karpathy et al.

    Deep visual-semantic alignments for generating image descriptions

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • O. Vinyals et al.

    Show and tell: A neural image caption generator

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • Cited by (7)

    • Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering

      2023, ACM Transactions on Multimedia Computing, Communications and Applications
    View all citing articles on Scopus

    Yun Liu received the B.Sc. degree in computer science and technology from Sichuan University, Chengdu, China, in 2015, and the Ph.D. degree in computer science and technology from Beihang University, Beijing, China, in 2022. He is currently an associate professor with the department of automation, Moutai Institute, Renhuai, China. His research interests include social media analysis, multimodal data analysis, and data mining.

    Xiaoming Zhang received the B.Sc. and the M.Sc. degrees from the National University of Defence Technology, Changsha, China, in 2003 and 2007, respectively, and the Ph.D. degree in computer science from Beihang University, Beijing, China, in 2012. He is currently with the School of Cyber Science and Technology, Beihang University, where he has been an associate professor. He has published over 40 papers, such as TOIS, TMM, TIP, TCYB, WWWJ, Signal Processing, ACM MM, AAAI, IJCAI, CIKM, ICMR, SDM, and EMNLP. His current research interests include social media analysis and text mining.

    Feiran Huang received the B.Sc. degree from Central South University, Changsha, China, in 2011, and the Ph.D. degree in the School of Computer Science and Engineering, Beihang University, Beijing, China, in 2018. He is currently with the College of Cyber Security/College of Information Science and Technology, Jinan University, where he has been a lecturer since 2018. He has published over 20 papers, such as TIP, TCYB, TOMM, TII, ACM MM, ICMR, and CIKM. His research interests include social media analysis, multimodal data analysis, and data mining.

    Zhoujun Li received the M.Sc. and Ph.D. degrees in computer science from the National University of Defence Technology, Changsha, China, in 1984 and 1999, respectively. He is currently with the School of Computer Science and Engineering, Beihang University, Beijing, China, where he has been a professor since 2001. He has published over 150 papers on international journals such as TKDE, TIP, TMM, TCYB, TOIS, WWWJ, and Information Science, and international conferences such as SIGKDD, ACL, SIGIR, AAAI, IJCAI, MM, CIKM, EMNLP, SDM, and WSDM. His current research interests include data mining, information retrieval, and database. Dr. Li was a PC Member of several international conferences, such as SDM 2015, CIKM 2013, WAIM 2012, and PRICAI 2012.

    View full text