Video abstraction based on fMRI-driven visual attention model☆
Introduction
Video abstraction or video summarization aims at providing users a succinct and informative overview of the contents of a full length video by identifying the most important information while removing redundant segments. Its core issue is how to assign importance levels to different video segments. Most pioneering approaches have focused on describing video shots by using low-level visual features such as color, texture, and motion, and then selecting key frames or segments according to specific criteria defined for each video shot. In spite of their simplicity and wide usage, this class of methods tends to neglect human perception and thus has difficulty in bridging the semantic gap between low-level features and the human’s perceived responses. Accordingly, from around 10 years ago, research interests have evolved from “computer-centric” to “human-centric”. For instance, a milestone work [35] fused a number of human attention-related features including contrast, motion, face, sound energy, and text into an overall attention curve along the temporal axis. The intention is to mimic the human attention mechanism and it has achieved exciting results. However, because the human attention mechanism has not been explicitly integrated into the loop, and more importantly, because there still lacks a quantitative, reliable, and faithful representation of human attention that can be used to construct, optimize, integrate and evaluate different attention cues, the work in [35] and its follow-up variants [9], [37], [61] had to compromise to utilize heuristic and suboptimal fusion schemes, which definitely limit their capability.
Essentially, the human brain should be the unique end-evaluator of multimedia contents. It is noticed that the brain’s response to video streams varies depending on the video content. Therefore, it is natural to connect human brain responses with the attractiveness of the video segment. In other words, a quantitative modeling and analysis of human brain signals when watching video streams can certainly provide meaningful and informative guidelines for estimating the attractiveness of a video segment. Recently, the advancements of fMRI brain imaging technology can enable us to acquire reliable and quantitative signals reflecting the full-length dynamics of the brain in perception and cognition of video streams. It is worth noting that the traditional way of studying the user experience might be useful for understanding such brain cognition in some sense, but it is qualitative, suboptimal, and subjective. In particular, user experience modeling has fundamental limitations in capturing the full-length dynamics of the brain’s response. As a powerful tool, fMRI can probe and monitor the human brain’s cognition [20], [32], [41]. For example, the milestone study in [20] has discovered that the contents of a movie clip are highly correlated with temporal fMRI signals in relevant brain regions of interest (ROIs). The experiments on different human subjects revealed that when they watched the same movie stimulus, their brains’ responses, measured by fMRI signals in relevant brain regions, were similar. These observations directly offer strong evidence that fMRI time series data can be used to model the dynamics and interactions between the human brain and multimedia streams, which is the underlying premise of the proposed work.
In this paper, we intend to combine the two fields of brain imaging and bottom-up visual attention modeling to build an effective video abstraction framework via fMRI brain imaging under the natural stimulus of video watching. As illustrated in Fig. 1, the proposed video abstraction framework mainly consists of four components: fMRI-derived attention prediction, bottom-up visual attention computation, supervised model optimization, and key-frame extraction based on the optimized model. In the first component, we present an experimental paradigm that applies state-of-the-art task-based fMRI (T-fMRI) brain imaging techniques to identify major brain regions that are involved in visual information perception and cognition. In this experiment, human subjects’ brains are scanned by an fMRI scanner when they watch video sequences [53]. The strict synchronization between media playing and fMRI scanning is achieved using the E-prime software (http://www.pstnet.com/), so that the time series fMRI signals and video features are in temporal alignment. Then, we identify the relevant brain regions via T-fMRI, which is considered as the benchmark approach for functional localization in the human brain [10], [32]. In our work, 30 ROIs for each of the 4 subjects involved in vision, working memory, and motor networks are identified. The fMRI time series signals extracted from these ROIs represent the human brain’s response to the video stimuli. Afterwards, PFS derived from spectral graph theory is applied to quantify the network synchronization on the functional network constructed by those 30 ROIs, since the property of network synchronization is a good indicator of the functional interaction among brain networks and thus the human brain’s attention engagement. The PFS is utilized to generate benchmark attention curves, given a number of training video sequences and their fMRI temporal signals.
In the second component of our framework, a state-of-the-art approach called the Bayesian surprise model [3] is exploited to compute the bottom-up visual attention features. The Renyi entropy characterizing the spatial distribution of visual saliency is employed to generate the attention curve.
Since fMRI scanning is quite expensive and time-consuming, it is impossible to acquire the fMRI data for every video clip. In the third component, we use the benchmark attention curves derived from the brain responses to guide the optimization of a bottom-up visual attention model based on the “cheap” low-level features by using Gaussian process regression (GPR) [4]. The optimized attention model is expected to maximize the correlation to the brain’s responses, and thus is superior to previous attention models. Given the testing videos, the final component extracts key frames to generate the abstraction based on the optimized attention model.
This paper significantly extends our earlier work [29] and makes the following two major contributions. (1) We propose a novel experimental paradigm to measure the brain’s comprehension of video stimuli quantitatively and then infer attention engagement by using fMRI techniques. The computation of PFS over the functional brain networks, where each node denotes the temporal fMRI signals from an ROI, indicates the attention engagement. It naturally results in the benchmark attention curve by using a small number of video stimuli. (2) In comparison with our earlier work in [29], this paper re-designs and implements a computational framework for the optimization of bottom-up visual attention models under the guidance of fMRI-derived benchmark attention curves, as illustrated in Fig. 1. This new framework not only bridges the brain imaging field and low-level visual content analysis, but also lowers the cost of fMRI scanning. This framework enables the optimization and integration of low-level visual attention cues into better fMRI-driven visual attention models that correlate well with the human brain’s attention engagement.
The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 describes fMRI-measured attention prediction. Section 4 introduces the fMRI-driven visual attention model for video abstraction, which optimizes low-level visual attention cues under the guidance of fMRI-derived benchmark attention curves. Experimental results are provided in Section 5. Finally, conclusions are drawn in Section 6.
Section snippets
Video abstraction
A comprehensive survey of extensive studies of video abstraction can be found in [55]. In this section, we only briefly review some most relevant works. Initial efforts for abstracting videos employ key frames to represent dominant contents of videos and extract key frames based on detecting abrupt changes or clustering of low-level visual features, e.g., color histogram [63]. In [30], Li et al. formulated the problem of video summarization as a rate-distortion optimization problem. A frame
T-fMRI for ROI mapping
fMRI leverages the coupling between neuronal activity and hemodynamics in the brain to obtain non-invasive measurement of brain activity. The block-based fMRI paradigm, as illustrated in Fig. 2, is widely used to map the functional brain regions engaged in certain brain functions. In the block-based paradigm, the temporal axis is divided into baseline and stimulus intervals. The baseline interval is typically blank without any input signals presented, while in the stimulus interval, stimuli are
fMRI-driven visual attention model for video abstraction
Essentially, the benchmark attention curve reflects the brain’s attentional engagement in the comprehension of video. However, fMRI scanning is generally very expensive and time-consuming. Fortunately, we can easily obtain a large number of low-level attentive visual features via computing algorithms. As a result, this paper proposes to learn an fMRI-driven visual attention model with the underlying idea of optimizing low-level feature combination under the guidance of a small number of
Dataset and experimental paradigm
As recommended by [55], publicly available video dataset called TRECVID is appropriate for being used as the test bed for evaluating video abstraction since it is large, diverse, and contains full-length video streams. Therefore, we constructed our experiments based on TRECVID 2005 video data. As reported in [42], TRECVID 2005 video data can be categorized into 7 concepts including politics, finance/business, science/technology, sports, entertainment, weather report, and
Conclusions
In this paper, we have proposed an fMRI-driven visual attention model and its application to video abstraction. The novelty is that it leveraged human brain responses measured by the fMRI technique to construct an objective benchmark criterion for the purpose of learning and optimizing the visual attention model and eventually achieving human-centric video abstraction. A number of major brain ROIs involved in video perception and cognition were identified to form a brain network. PFS derived
Acknowledgements
We thank Alistair Sutherland, Tuo Zhang, Dajiang Zhu, Hanbo Chen, Xi Jiang, Fan Deng, C. Faraco, Degang Zhang, and Xian-Sheng Hua for collecting fMRI data and giving valuable suggestions.
References (63)
- et al.
Applying evolution strategies to preprocessing EEG signals for brain–computer interfaces
Inform. Sci.
(2012) - et al.
Of bits and wows: a Bayesian theory of surprise with applications to attention
Neural Networks
(2010) - et al.
Complex networks: structure and dynamics
Phys. Rep.
(2006) - et al.
Fast saliency-aware multi-modality image fusion
Neurocomputing
(2013) - et al.
Reliability of cortical activity during natural stimulation
Trends Cogn. Sci.
(2010) - et al.
State dependent properties of epileptic brain networks: comparative graph–theoretical analyses of simultaneously recorded EEG and MEG
Clin. Neurophysiol.
(2010) - et al.
Content-based retrieval of human actions from realistic video databases
Inform. Sci.
(2013) - et al.
Using non-negative matrix factorization for single-trial analysis of fMRI data
NeuroImage
(2007) - et al.
Exploiting pairwise recommendation and clustering strategies for image re-ranking
Inform. Sci.
(2012) - et al.
Invariant salient regions based image retrieval under viewpoint and illumination variations
J. Vis. Commun. Image Represent.
(2006)
Specific object retrieval based on salient regions
Pattern Recogn.
Geometric and photometric invariant distinctive regions detection
Inform. Sci.
The impact of weak ground truth and facial expressiveness on affect detection accuracy from time-continuous videos of facial expressions
Inform. Sci.
An integrated system for content-based video retrieval and browsing
Pattern Recogn.
Working Memory, Thought, and Action
Twin Gaussian processes for structured prediction
Int. J. Comput. Vision
A novel video summarization based on mining the story-structure and semantic relations among concept entities
IEEE Trans. Multimedia
Spectral Graph Theory
Enslaving central executives: toward a brain theory of cinema
Projections
Modalities, modes, and models in functional neuroimaging
Science
Statistical parametric maps in functional imaging: a general linear approach
Hum. Brain Mapp.
Online non-negative matrix factorization with robust stochastic approximation
IEEE Trans. Neural Networks Learn. Syst.
NeNMF: an optimal gradient method for non-negative matrix factorization
IEEE Trans. Signal Process.
Object segmentation from consumer videos: a unified framework based on visual attention
IEEE Trans. Consum. Electron.
Broadcast court-net sports video analysis using fast 3-D camera modeling
IEEE Trans. Circuits Syst. Video Technol.
Representing and retrieving video shots in human-centric brain imaging space
IEEE Trans. Image Process.
Unsupervised extraction of visual attention objects in color images
IEEE Trans. Circuits Syst. Video Technol.
Intersubject synchronization of cortical activity during natural vision
Science
Bridging the semantic gap via functional brain imaging
IEEE Trans. Multimedia
Cited by (0)
- ☆
The research was supported by the NIH Career Award EB 006878, NIH R01 R01DA033393, and NSFC 61005018, 91120005, 61103061, 61333017, NPU-FFR-JC20120237, and NCET-10-0079.