Dynamic self-attention with vision synchronization networks for video question answering

doi:10.1016/j.patcog.2022.108959

Pattern Recognition

Volume 132, December 2022, 108959

https://doi.org/10.1016/j.patcog.2022.108959 Get rights and content

Highlights

•
A novel token selection mechanism based on the dynamic self-attention network is proposed to automatically extract important video features.
•
A vision synchronization network is proposed to align appearance and motion features at the time slice level.
•
Extensive experiments and analysis confirm the superiority of the proposed model DSAVS.

Abstract

Video Question Answering (VideoQA) has gained increasing attention as an important task in understanding the rich spatio-temporal contents, i.e., the appearance and motion in the video. However, existing approaches mainly use the question to learn attentions over all the sampled appearance and motion features separately, which neglect two properties of VideoQA: (1) the answer to the question is often reflected on a few frames and video clips, and most video contents are superfluous; (2) appearance and motion features are usually concomitant and complementary to each other in time series. In this paper, we propose a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), to address these problems. Specifically, a gated token selection mechanism is proposed to dynamically select the important tokens from appearance and motion sequences. These chosen tokens are fed into a self-attention mechanism to model the internal dependencies for more effective representation learning. To capture the correlation between the appearance and motion features, a vision synchronization block is proposed to synchronize the two types of vision features at the time slice level. Then, the visual objects can be correlated with their corresponding activities and the performance is further improved. Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of our model compared with state-of-the-art methods.

Introduction

With the rapid development of computer vision and natural language processing, tasks involving vision and language have inspired considerable interests. A new problem named Visual Question Answering (VQA), including Image Question Answering (ImageQA) [1], [2], [3], [4] and Video Question Answering (VideoQA) [5], [6], [7], [8], has emerged as a promising but intractable research point. It requires the agent to provide the correct answer to a question about the contents of a given image/video. Compared with ImageQA where image just contains appearance features, VideoQA is more challenging due to three reasons: (1) it deals with a long sequence of images that contain not only appearance but also motion features; (2) substantial redundant information exists in the video; (3) different questions require different parts of the video to infer the answer. However, the achievement of VideoQA is beneficial for various real-life applications, including tourist assistance, automatic customer service, and human-machine interaction [9].

A number of methods have sprung up in recent years to tackle the difficulties faced by VideoQA. These methods can be broadly divided into two categories depending on the way in which visual features are processed. The first type of method [9], [10], [11], [12] only uses the appearance features in the video. The work presented in [13] introduces a learnable aggregating network with only the appearance features. It mainly learns the cross-modal correlations using diversity learning schemes, which neglects the motion information of the video and thus is hard to capture fine-grained features in time series. The second type of method [14], [15], [16], [17] incorporates both appearance and motion features as the input. The model proposed in [15] gradually refines its attentions over both appearance and motion features using the question as guidance. However, since appearance and motion features obtained from a video have redundant and interference information, directly inferring the answer from them will result in a negative impact on the performance. Besides, this group of methods deals with the appearance and motion features separately, which neglects the synchrony relation between the two types of features.

Meanwhile, the unique characteristics of visual appearance and motion features in videos provide important clues for VideoQA. First, the answer to a question is usually reflected on a few key frames or video clips, and most video information is superfluous. Inferring the answer directly from all the appearance and motion information is not only computationally expensive but also introduces a lot of noise. As the examples shown in Fig. 1(a), in the left instance, the question “what color is the phone in the hand of the woman?” can be answered from the frames that contain a phone in the hand of a woman. However, the object phone does not appear in most frames of the video. It means that most of the information in the video is useless to this question. In the right instance, a little boy is calling in the video. The action “press the phone” only exists in a short video clip. Therefore, the information existing outside this short clip in the video is useless to answer the question “How many times did the boy press the phone?” Second, the visual objects have their own activities in a video, and thus the appearance and motion features are often concomitant and complementary to each other at the time slice level. Dealing with the appearance and motion features independently may result in mistakes. Take the example shown in Fig. 1(b) as an instance. The video mainly contains two parts of information, i.e., a boy is jumping and a dog is running. The appearance features are extracted as a sequence of “boy-boy- $\dots$ -dog-dog”, while the motion features are extracted as “jump-jump- $\dots$ -running-running”. Most of the existing approaches concentrate on the appearance of the “boy” for the question “what is the boy in the white vest doing?” with an attention mechanism. They also prefer “running” with the guidance of “boy” from the video, as the motion “running” has higher probability than “jumping” when it is combined with “boy”, i.e., ( $p (b o y, r u n n i n g) > p (b o y, j u m p i n g)$ ) in the training set. Then, a wrong answer “running” is inferred. The reason is that existing methods employ the attention mechanism to appearance and motion features separately, which neglect the synchrony relation between the two types of visual features. Obviously, the appearance and motion features have different roles in visual reasoning, and they are concomitant and complementary in time series. Therefore, these latent correlations between the appearance and motion features should be exploited, so that the action of the boy can be matched to jumping and the action of the dog to running.

However, it is quite difficult to capture the answer-related key information and the correlation between appearance and motion features. First, there is no explicit knowledge about which video frames and clips are relevant to the answer. Second, the appearance and motion features are represented in heterogeneous space, and it is difficult to find the motion features for a given visual object. To tackle these challenges, we propose to take advantage of the latent correlation between different types of video features and the answer for VideoQA. In particular, we investigate: (1) how to mine the answer-related frames and clips in the video; (2) how to learn the correlation between the appearance and motion features. Our solutions to these questions result in a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS). To automatically find out the key video frames and clips used for answer inference, a gated token selection mechanism with a dynamic self-attention block is proposed to select the supportive tokens from the input sequence. To effectively learn the correlation between appearance and motion features, we synchronize them at the time slice level with a vision synchronization block, where the two types of features are fused in the guidance of the question sentence. Then, both the dynamic self-attention and vision synchronization blocks are integrated into an end-to-end framework to infer the answer. The main contributions are summarized as follows:

•
We propose a dynamic self-attention method to automatically select important video information to learn internal dependencies, avoiding a lot of redundant and noisy information.
•
Unlike the existing VideoQA methods, we introduce a vision synchronization model to synchronize the appearance and motion features, which can avoid the misalignment of appearance and motion in time series.
•
Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of the proposed DSAVS compared with state-of-the-art methods.

The remainder of the paper is organized as follows. We first survey the related works. Then, the problem statement and the details of our model are presented. Next, we describe the experiments and the analysis of the results, and the paper is concluded in the end.

Section snippets

Related work

VideoQA has been proposed as a challenging task in understanding the rich spatio-temporal information of the video in recent years. It requires the agent to automatically infer the answer for a free-form, open-ended question about the content of a video. Compared with image-based multimedia tasks, e.g., image question answering [18], [19], [20], [21], image caption [22], [23], [24], [25], and text-to-image retrieval [26], [27], [28], [29], VideoQA is more challenging due to the complex temporal

Problem statement

In this section, we first present the definition of the VideoQA problem we study and then introduce the feature representation method used in our work.

Given a question sentence $Q$ and the corresponding video $V$ , Video Question Answering (VideoQA) aims to generate the correct answer $A$ . Like previous method [15], [42], the VideoQA task is treated as a maximum likelihood estimation problem. It can be solved by computing the likelihood probability distributation $p_{v i d e o Q A}$ . For each answer $A^{'}$ in the

Proposed model

Figure 2 illustrates the framework of our proposed DSAVS. Specifically, the framework of DSAVS mainly contains two salient components, i.e., the Dynamic Self-Attention (DynSA) block and the Vision Synchronization (VS) block. The DynSA block is proposed to find out the question-related key video frames and clips. Meanwhile, the internal dependencies in the key frames and clips are modeled respectively to learn more effective visual representation. The VS block is used to synchronize appearance

Datasets

Among the recently proposed VideoQA data sets, MSVD-QA [15], MSRVTT-QA [15] and YouTube2Text-QA [11] are three challenging ones:

MSVD-QA is constructed based on the Microsoft Research Video Description Corpus [54], which is widely employed in the video captioning task. This data set consists of 50,505 question-answer pairs and 1,970 videos. It is divided into three splits: training (61%), validation (13%), and testing (26%). Table 1 shows the statistics of the MSVD-QA data set.

MSRVTT-QA is a

Conclusion and future work

In this work, we aim to tackle two problems in the Video Question Answering (VideoQA) task: (1) the answer to the question is often reflected on a few frames and video clips, and most video information is superfluous; (2) the appearance and motion features are usually concomitant and complementary to each other at time slice level. We propose a novel model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), for VideoQA. It dynamically selects the important tokens from

Declaration of Competing Interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Acknowledgments

This work was supported by the Zunyi Technology and Big data Bureau, Moutai Institute Joint Science and Technology Research and Development Project (ZSKHHZ[2022] No.167, No.170, No.160, and No.165), supported by the Youth Science and Technology Talents Development Project of Guizhou Education Department (Qian Jiaohe KY Zi [2020] 225), and supported by the Program of Basic Research in Guizhou Province (Science and Technology Foundation of Guizhou Province) under Grant ZK[2022]YB539.

Yun Liu received the B.Sc. degree in computer science and technology from Sichuan University, Chengdu, China, in 2015, and the Ph.D. degree in computer science and technology from Beihang University, Beijing, China, in 2022. He is currently an associate professor with the department of automation, Moutai Institute, Renhuai, China. His research interests include social media analysis, multimodal data analysis, and data mining.

References (58)

X. Li et al.
Text-instance graph: exploring the relational semantics for text-based visual question answering
Pattern Recognit
(2022)
M. Farazi et al.
Accuracy vs. complexity: a trade-off in visual question answering models
Pattern Recognit
(2021)
J.H. Tan et al.
End-to-end supermask pruning: learning to prune image captioning models
Pattern Recognit
(2022)
L. Zhang et al.
Multi-task framework based on feature separation and reconstruction for cross-modal retrieval
Pattern Recognit
(2022)
L. Gao et al.
Generalized pyramid co-attention with learnable aggregation net for video question answering
Pattern Recognit
(2021)
W. Wang et al.
Long video question answering: a matching-guided attention model
Pattern Recognit
(2020)
Y. Liu et al.
Dual self-attention with co-attention networks for visual question answering
Pattern Recognit
(2021)
S. Antol et al.
VQA: Visual question answering
Proceedings of the IEEE International Conference on Computer Vision
(2015)
Z. Yang et al.
Stacked attention networks for image question answering
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
Y. Liu et al.
Adversarial learning of answer-related representation for visual question answering
Proceedings of the 27th ACM International Conference on Information and Knowledge Management
(2018)

Y. Jang et al.

TGIF-QA: toward spatio-temporal reasoning in visual question answering

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2017)

M. Tapaswi et al.

Movieqa: Understanding stories in movies through question-answering

Proceedings of Conference on Computer Vision and Pattern Recognition

(2016)

K. Zeng et al.

Leveraging video descriptions to learn video question answering

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

(2017)

X. Song et al.

Explore multi-step reasoning in video question answering

Proceedings of the ACM Multimedia Conference

(2018)

L. Gao et al.

Structured two-stream attention network for video question answering

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence

(2019)

Y. Yu et al.

End-to-end concept word detection for video captioning, retrieval, and question answering

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

Y. Ye et al.

Video question answering via attribute-augmented attention network learning

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

(2017)

X. Li et al.

Beyond rnns: Positional self-attention with co-attention for video question answering

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence

(2019)

X. Li et al.

Learnable aggregating net with diversity learning for video question answering

Proceedings of the 27th ACM International Conference on Multimedia

(2019)

Z. Zhao et al.

Video question answering via hierarchical spatio-temporal attention networks

IJCAI

(2017)

D. Xu et al.

Video question answering via gradually refined attention over appearance and motion

Proceedings of the ACM Multimedia Conference

(2017)

J. Gao et al.

Motion-appearance co-memory networks for video question answering

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

(2018)

C. Fan et al.

Heterogeneous memory enhanced multimodal attention model for video question answering

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2019)

J. Lu et al.

Hierarchical question-image co-attention for visual question answering

Proceedings of Advances In Neural Information Processing Systems

(2016)

P. Anderson et al.

Bottom-up and top-down attention for image captioning and visual question answering

IEEE Conference on Computer Vision and Pattern Recognition

(2018)

Y. Liu et al.

Adversarial learning with multi-modal attention for visual question answering

IEEE Trans Neural Netw Learn Syst

(2020)

K. Xu et al.

Show, attend and tell: Neural image caption generation with visual attention

Proceedings of International Conference on Machine Learning

(2015)

A. Karpathy et al.

Deep visual-semantic alignments for generating image descriptions

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

O. Vinyals et al.

Show and tell: A neural image caption generator

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

Cited by (7)

Hierarchical synchronization with structured multi-granularity interaction for video question answering
2024, Neurocomputing
Video Question Answering (VideoQA) requires a thorough comprehension of linguistic and visual modalities. However, recent methods confront two problems: (1) Synchronous modeling of object action and frame scene instead of a step-by-step manner, which can better mine potential semantic attributes of videos, lacks research; (2) The relationship between cross-modal alignments at different granularity of abstraction is not fully utilized. Based on these insights, we propose a novel method named hierarchical synchronization with structured multi-granularity interaction (HSSMI) for VideoQA. First, a hierarchical synchronous reasoning module is put forward to model objects’ relations and dynamics and synchronously capture their synergistic influences over time when analyzing whole frames. It is seen as multiple Object ConvLSTMs (O-CLSTMs) in isolation or a Frame ConvLSTM (F-CLSTM) in collectivity. Specifically, O-CLSTM learns the object-level action states under neighboring spatial interplays. Meanwhile, F-CLSTM learns the frame-level scene state, where action information from O-CLSTMs is selectively aggregated into a common memory cell of F-CLSTM as instructed by questions. Besides, a boundary detector is equipped to discover scene discontinuities, enabling F-CLSTM to alter its time connectivity and adapt its sequential encoding process to videos. Thereafter, we develop a conditional VLAD with topic constraints for discriminative modality summarization. Last, a structured multi-granularity interaction module is proposed to integrate complemented clues on the global alignment between scene summary and full question and the local alignments between action summaries and words. This module encourages useful information passing through compositional syntactical topologies of questions to predict answers. Experiments on three public benchmark datasets demonstrate the superiority of our HSSMI against other state-of-the-art methods. Codes will be publicly available at https://github.com/Qiss33/HSSMI.
Multi-agent dueling Q-learning with mean field and value decomposition
2023, Pattern Recognition
A great deal of multi agent reinforcement learning(MARL) work has investigated how multiple agents effectively accomplish cooperative tasks utilizing value function decomposition methods. However, existing value decomposition methods can only handle cooperative tasks with shared reward, due to these methods factorize the value function from a global perspective. To tackle the competitive tasks and mixed cooperative-competitive tasks with differing individual reward setting, we design the Multi-agent Dueling Q-learning (MDQ) method based on mean-filed theory and individual value decomposition. Specifically, we integrate the mean-field theory with the value decomposition to factorize the value function at the individual level, which can deal with mixed cooperative-competitive tasks. Besides, we take a dueling network architecture to distinguish which states are valuable, eliminating the need to learn the impact of each action on each state, therefore enabling efficient learning and leading to better policy evaluation. The proposed method MDQ is applicable not only to cooperative tasks with shared rewards setting, but also to mixed cooperative-competitive tasks with individualized reward settings. In this end, it is flexible and generically applicable enough to most multi-agent tasks. Empirical experiments on various mixed cooperative-competitive tasks demonstrate that MDQ significantly outperforms existing multi agent reinforcement learning methods.
Triadic Temporal-Semantic Alignment for Weakly Supervised Video Moment Retrieval
2024, SSRN
Video question answering via traffic knowledge database and question classification
2024, Multimedia Systems
Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering
2023, ACM Transactions on Multimedia Computing, Communications and Applications

View all citing articles on Scopus

Xiaoming Zhang received the B.Sc. and the M.Sc. degrees from the National University of Defence Technology, Changsha, China, in 2003 and 2007, respectively, and the Ph.D. degree in computer science from Beihang University, Beijing, China, in 2012. He is currently with the School of Cyber Science and Technology, Beihang University, where he has been an associate professor. He has published over 40 papers, such as TOIS, TMM, TIP, TCYB, WWWJ, Signal Processing, ACM MM, AAAI, IJCAI, CIKM, ICMR, SDM, and EMNLP. His current research interests include social media analysis and text mining.

Feiran Huang received the B.Sc. degree from Central South University, Changsha, China, in 2011, and the Ph.D. degree in the School of Computer Science and Engineering, Beihang University, Beijing, China, in 2018. He is currently with the College of Cyber Security/College of Information Science and Technology, Jinan University, where he has been a lecturer since 2018. He has published over 20 papers, such as TIP, TCYB, TOMM, TII, ACM MM, ICMR, and CIKM. His research interests include social media analysis, multimodal data analysis, and data mining.

Zhoujun Li received the M.Sc. and Ph.D. degrees in computer science from the National University of Defence Technology, Changsha, China, in 1984 and 1999, respectively. He is currently with the School of Computer Science and Engineering, Beihang University, Beijing, China, where he has been a professor since 2001. He has published over 150 papers on international journals such as TKDE, TIP, TMM, TCYB, TOIS, WWWJ, and Information Science, and international conferences such as SIGKDD, ACL, SIGIR, AAAI, IJCAI, MM, CIKM, EMNLP, SDM, and WSDM. His current research interests include data mining, information retrieval, and database. Dr. Li was a PC Member of several international conferences, such as SDM 2015, CIKM 2013, WAIM 2012, and PRICAI 2012.

View full text

Dynamic self-attention with vision synchronization networks for video question answering

Highlights

Abstract

Introduction

Section snippets

Related work

Problem statement

Proposed model

Datasets

Conclusion and future work

Declaration of Competing Interest

Acknowledgments

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

VQA: Visual question answering

Proceedings of the IEEE International Conference on Computer Vision

Stacked attention networks for image question answering

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Adversarial learning of answer-related representation for visual question answering

Proceedings of the 27th ACM International Conference on Information and Knowledge Management

TGIF-QA: toward spatio-temporal reasoning in visual question answering

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Movieqa: Understanding stories in movies through question-answering

Proceedings of Conference on Computer Vision and Pattern Recognition

Leveraging video descriptions to learn video question answering

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

Explore multi-step reasoning in video question answering

Proceedings of the ACM Multimedia Conference

Structured two-stream attention network for video question answering

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence

End-to-end concept word detection for video captioning, retrieval, and question answering

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Video question answering via attribute-augmented attention network learning

Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Beyond rnns: Positional self-attention with co-attention for video question answering

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence

Learnable aggregating net with diversity learning for video question answering

Proceedings of the 27th ACM International Conference on Multimedia

Video question answering via hierarchical spatio-temporal attention networks

IJCAI

Video question answering via gradually refined attention over appearance and motion

Proceedings of the ACM Multimedia Conference

Motion-appearance co-memory networks for video question answering

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

Heterogeneous memory enhanced multimodal attention model for video question answering

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Hierarchical question-image co-attention for visual question answering

Proceedings of Advances In Neural Information Processing Systems

Bottom-up and top-down attention for image captioning and visual question answering

IEEE Conference on Computer Vision and Pattern Recognition

Adversarial learning with multi-modal attention for visual question answering

IEEE Trans Neural Netw Learn Syst

Show, attend and tell: Neural image caption generation with visual attention

Proceedings of International Conference on Machine Learning

Deep visual-semantic alignments for generating image descriptions

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Show and tell: A neural image caption generator

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition