Dynamic self-attention with vision synchronization networks for video question answering
Introduction
With the rapid development of computer vision and natural language processing, tasks involving vision and language have inspired considerable interests. A new problem named Visual Question Answering (VQA), including Image Question Answering (ImageQA) [1], [2], [3], [4] and Video Question Answering (VideoQA) [5], [6], [7], [8], has emerged as a promising but intractable research point. It requires the agent to provide the correct answer to a question about the contents of a given image/video. Compared with ImageQA where image just contains appearance features, VideoQA is more challenging due to three reasons: (1) it deals with a long sequence of images that contain not only appearance but also motion features; (2) substantial redundant information exists in the video; (3) different questions require different parts of the video to infer the answer. However, the achievement of VideoQA is beneficial for various real-life applications, including tourist assistance, automatic customer service, and human-machine interaction [9].
A number of methods have sprung up in recent years to tackle the difficulties faced by VideoQA. These methods can be broadly divided into two categories depending on the way in which visual features are processed. The first type of method [9], [10], [11], [12] only uses the appearance features in the video. The work presented in [13] introduces a learnable aggregating network with only the appearance features. It mainly learns the cross-modal correlations using diversity learning schemes, which neglects the motion information of the video and thus is hard to capture fine-grained features in time series. The second type of method [14], [15], [16], [17] incorporates both appearance and motion features as the input. The model proposed in [15] gradually refines its attentions over both appearance and motion features using the question as guidance. However, since appearance and motion features obtained from a video have redundant and interference information, directly inferring the answer from them will result in a negative impact on the performance. Besides, this group of methods deals with the appearance and motion features separately, which neglects the synchrony relation between the two types of features.
Meanwhile, the unique characteristics of visual appearance and motion features in videos provide important clues for VideoQA. First, the answer to a question is usually reflected on a few key frames or video clips, and most video information is superfluous. Inferring the answer directly from all the appearance and motion information is not only computationally expensive but also introduces a lot of noise. As the examples shown in Fig. 1(a), in the left instance, the question “what color is the phone in the hand of the woman?” can be answered from the frames that contain a phone in the hand of a woman. However, the object phone does not appear in most frames of the video. It means that most of the information in the video is useless to this question. In the right instance, a little boy is calling in the video. The action “press the phone” only exists in a short video clip. Therefore, the information existing outside this short clip in the video is useless to answer the question “How many times did the boy press the phone?” Second, the visual objects have their own activities in a video, and thus the appearance and motion features are often concomitant and complementary to each other at the time slice level. Dealing with the appearance and motion features independently may result in mistakes. Take the example shown in Fig. 1(b) as an instance. The video mainly contains two parts of information, i.e., a boy is jumping and a dog is running. The appearance features are extracted as a sequence of “boy-boy- -dog-dog”, while the motion features are extracted as “jump-jump- -running-running”. Most of the existing approaches concentrate on the appearance of the “boy” for the question “what is the boy in the white vest doing?” with an attention mechanism. They also prefer “running” with the guidance of “boy” from the video, as the motion “running” has higher probability than “jumping” when it is combined with “boy”, i.e., () in the training set. Then, a wrong answer “running” is inferred. The reason is that existing methods employ the attention mechanism to appearance and motion features separately, which neglect the synchrony relation between the two types of visual features. Obviously, the appearance and motion features have different roles in visual reasoning, and they are concomitant and complementary in time series. Therefore, these latent correlations between the appearance and motion features should be exploited, so that the action of the boy can be matched to jumping and the action of the dog to running.
However, it is quite difficult to capture the answer-related key information and the correlation between appearance and motion features. First, there is no explicit knowledge about which video frames and clips are relevant to the answer. Second, the appearance and motion features are represented in heterogeneous space, and it is difficult to find the motion features for a given visual object. To tackle these challenges, we propose to take advantage of the latent correlation between different types of video features and the answer for VideoQA. In particular, we investigate: (1) how to mine the answer-related frames and clips in the video; (2) how to learn the correlation between the appearance and motion features. Our solutions to these questions result in a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS). To automatically find out the key video frames and clips used for answer inference, a gated token selection mechanism with a dynamic self-attention block is proposed to select the supportive tokens from the input sequence. To effectively learn the correlation between appearance and motion features, we synchronize them at the time slice level with a vision synchronization block, where the two types of features are fused in the guidance of the question sentence. Then, both the dynamic self-attention and vision synchronization blocks are integrated into an end-to-end framework to infer the answer. The main contributions are summarized as follows:
- •
We propose a dynamic self-attention method to automatically select important video information to learn internal dependencies, avoiding a lot of redundant and noisy information.
- •
Unlike the existing VideoQA methods, we introduce a vision synchronization model to synchronize the appearance and motion features, which can avoid the misalignment of appearance and motion in time series.
- •
Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of the proposed DSAVS compared with state-of-the-art methods.
The remainder of the paper is organized as follows. We first survey the related works. Then, the problem statement and the details of our model are presented. Next, we describe the experiments and the analysis of the results, and the paper is concluded in the end.
Section snippets
Related work
VideoQA has been proposed as a challenging task in understanding the rich spatio-temporal information of the video in recent years. It requires the agent to automatically infer the answer for a free-form, open-ended question about the content of a video. Compared with image-based multimedia tasks, e.g., image question answering [18], [19], [20], [21], image caption [22], [23], [24], [25], and text-to-image retrieval [26], [27], [28], [29], VideoQA is more challenging due to the complex temporal
Problem statement
In this section, we first present the definition of the VideoQA problem we study and then introduce the feature representation method used in our work.
Given a question sentence and the corresponding video , Video Question Answering (VideoQA) aims to generate the correct answer . Like previous method [15], [42], the VideoQA task is treated as a maximum likelihood estimation problem. It can be solved by computing the likelihood probability distributation . For each answer in the
Proposed model
Figure 2 illustrates the framework of our proposed DSAVS. Specifically, the framework of DSAVS mainly contains two salient components, i.e., the Dynamic Self-Attention (DynSA) block and the Vision Synchronization (VS) block. The DynSA block is proposed to find out the question-related key video frames and clips. Meanwhile, the internal dependencies in the key frames and clips are modeled respectively to learn more effective visual representation. The VS block is used to synchronize appearance
Datasets
Among the recently proposed VideoQA data sets, MSVD-QA [15], MSRVTT-QA [15] and YouTube2Text-QA [11] are three challenging ones:
MSVD-QA is constructed based on the Microsoft Research Video Description Corpus [54], which is widely employed in the video captioning task. This data set consists of 50,505 question-answer pairs and 1,970 videos. It is divided into three splits: training (61%), validation (13%), and testing (26%). Table 1 shows the statistics of the MSVD-QA data set.
MSRVTT-QA is a
Conclusion and future work
In this work, we aim to tackle two problems in the Video Question Answering (VideoQA) task: (1) the answer to the question is often reflected on a few frames and video clips, and most video information is superfluous; (2) the appearance and motion features are usually concomitant and complementary to each other at time slice level. We propose a novel model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), for VideoQA. It dynamically selects the important tokens from
Declaration of Competing Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Acknowledgments
This work was supported by the Zunyi Technology and Big data Bureau, Moutai Institute Joint Science and Technology Research and Development Project (ZSKHHZ[2022] No.167, No.170, No.160, and No.165), supported by the Youth Science and Technology Talents Development Project of Guizhou Education Department (Qian Jiaohe KY Zi [2020] 225), and supported by the Program of Basic Research in Guizhou Province (Science and Technology Foundation of Guizhou Province) under Grant ZK[2022]YB539.
Yun Liu received the B.Sc. degree in computer science and technology from Sichuan University, Chengdu, China, in 2015, and the Ph.D. degree in computer science and technology from Beihang University, Beijing, China, in 2022. He is currently an associate professor with the department of automation, Moutai Institute, Renhuai, China. His research interests include social media analysis, multimodal data analysis, and data mining.
References (58)
- et al.
Text-instance graph: exploring the relational semantics for text-based visual question answering
Pattern Recognit
(2022) - et al.
Accuracy vs. complexity: a trade-off in visual question answering models
Pattern Recognit
(2021) - et al.
End-to-end supermask pruning: learning to prune image captioning models
Pattern Recognit
(2022) - et al.
Multi-task framework based on feature separation and reconstruction for cross-modal retrieval
Pattern Recognit
(2022) - et al.
Generalized pyramid co-attention with learnable aggregation net for video question answering
Pattern Recognit
(2021) - et al.
Long video question answering: a matching-guided attention model
Pattern Recognit
(2020) - et al.
Dual self-attention with co-attention networks for visual question answering
Pattern Recognit
(2021) - et al.
VQA: Visual question answering
Proceedings of the IEEE International Conference on Computer Vision
(2015) - et al.
Stacked attention networks for image question answering
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016) - et al.
Adversarial learning of answer-related representation for visual question answering
Proceedings of the 27th ACM International Conference on Information and Knowledge Management
(2018)
TGIF-QA: toward spatio-temporal reasoning in visual question answering
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Movieqa: Understanding stories in movies through question-answering
Proceedings of Conference on Computer Vision and Pattern Recognition
Leveraging video descriptions to learn video question answering
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence
Explore multi-step reasoning in video question answering
Proceedings of the ACM Multimedia Conference
Structured two-stream attention network for video question answering
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence
End-to-end concept word detection for video captioning, retrieval, and question answering
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Video question answering via attribute-augmented attention network learning
Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
Beyond rnns: Positional self-attention with co-attention for video question answering
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence
Learnable aggregating net with diversity learning for video question answering
Proceedings of the 27th ACM International Conference on Multimedia
Video question answering via hierarchical spatio-temporal attention networks
IJCAI
Video question answering via gradually refined attention over appearance and motion
Proceedings of the ACM Multimedia Conference
Motion-appearance co-memory networks for video question answering
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
Heterogeneous memory enhanced multimodal attention model for video question answering
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Hierarchical question-image co-attention for visual question answering
Proceedings of Advances In Neural Information Processing Systems
Bottom-up and top-down attention for image captioning and visual question answering
IEEE Conference on Computer Vision and Pattern Recognition
Adversarial learning with multi-modal attention for visual question answering
IEEE Trans Neural Netw Learn Syst
Show, attend and tell: Neural image caption generation with visual attention
Proceedings of International Conference on Machine Learning
Deep visual-semantic alignments for generating image descriptions
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Show and tell: A neural image caption generator
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (7)
Multi-agent dueling Q-learning with mean field and value decomposition
2023, Pattern RecognitionVideo question answering via traffic knowledge database and question classification
2024, Multimedia SystemsAppearance-Motion Dual-Stream Heterogeneous Network for VideoQA
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering
2023, ACM Transactions on Multimedia Computing, Communications and Applications
Yun Liu received the B.Sc. degree in computer science and technology from Sichuan University, Chengdu, China, in 2015, and the Ph.D. degree in computer science and technology from Beihang University, Beijing, China, in 2022. He is currently an associate professor with the department of automation, Moutai Institute, Renhuai, China. His research interests include social media analysis, multimodal data analysis, and data mining.
Xiaoming Zhang received the B.Sc. and the M.Sc. degrees from the National University of Defence Technology, Changsha, China, in 2003 and 2007, respectively, and the Ph.D. degree in computer science from Beihang University, Beijing, China, in 2012. He is currently with the School of Cyber Science and Technology, Beihang University, where he has been an associate professor. He has published over 40 papers, such as TOIS, TMM, TIP, TCYB, WWWJ, Signal Processing, ACM MM, AAAI, IJCAI, CIKM, ICMR, SDM, and EMNLP. His current research interests include social media analysis and text mining.
Feiran Huang received the B.Sc. degree from Central South University, Changsha, China, in 2011, and the Ph.D. degree in the School of Computer Science and Engineering, Beihang University, Beijing, China, in 2018. He is currently with the College of Cyber Security/College of Information Science and Technology, Jinan University, where he has been a lecturer since 2018. He has published over 20 papers, such as TIP, TCYB, TOMM, TII, ACM MM, ICMR, and CIKM. His research interests include social media analysis, multimodal data analysis, and data mining.
Zhoujun Li received the M.Sc. and Ph.D. degrees in computer science from the National University of Defence Technology, Changsha, China, in 1984 and 1999, respectively. He is currently with the School of Computer Science and Engineering, Beihang University, Beijing, China, where he has been a professor since 2001. He has published over 150 papers on international journals such as TKDE, TIP, TMM, TCYB, TOIS, WWWJ, and Information Science, and international conferences such as SIGKDD, ACL, SIGIR, AAAI, IJCAI, MM, CIKM, EMNLP, SDM, and WSDM. His current research interests include data mining, information retrieval, and database. Dr. Li was a PC Member of several international conferences, such as SDM 2015, CIKM 2013, WAIM 2012, and PRICAI 2012.