Elsevier

Information Sciences

Volume 281, 10 October 2014, Pages 781-796
Information Sciences

Video abstraction based on fMRI-driven visual attention model

https://doi.org/10.1016/j.ins.2013.12.039Get rights and content

Abstract

The explosive growth of digital video data renders a profound challenge to succinct, informative, and human-centric representations of video contents. This quickly-evolving research topic is typically called ‘video abstraction’. We are motivated by the facts that the human brain is the end-evaluator of multimedia content and that the brain’s responses can quantitatively reveal its attentional engagement in the comprehension of video. We propose a novel video abstraction paradigm which leverages functional magnetic resonance imaging (fMRI) to monitor and quantify the brain’s responses to video stimuli. These responses are used to guide the extraction of visually informative segments from videos. Specifically, most relevant brain regions involved in video perception and cognition are identified to form brain networks. Then, the propensity for synchronization (PFS) derived from spectral graph theory is utilized over the brain networks to yield the benchmark attention curves based on the fMRI-measured brain responses to a number of training video streams. These benchmark attention curves are applied to guide and optimize the combinations of a variety of low-level visual features created by the Bayesian surprise model. In particular, in the training stage, the optimization objective is to ensure that the learned attentional model correlates well with the brain’s responses and reflects the attention that viewers pay to video contents. In the application stage, the attention curves predicted by the learned and optimized attentional model serve as an effective benchmark to abstract testing videos. Evaluations on a set of video sequences from the TRECVID database demonstrate the effectiveness of the proposed framework.

Introduction

Video abstraction or video summarization aims at providing users a succinct and informative overview of the contents of a full length video by identifying the most important information while removing redundant segments. Its core issue is how to assign importance levels to different video segments. Most pioneering approaches have focused on describing video shots by using low-level visual features such as color, texture, and motion, and then selecting key frames or segments according to specific criteria defined for each video shot. In spite of their simplicity and wide usage, this class of methods tends to neglect human perception and thus has difficulty in bridging the semantic gap between low-level features and the human’s perceived responses. Accordingly, from around 10 years ago, research interests have evolved from “computer-centric” to “human-centric”. For instance, a milestone work [35] fused a number of human attention-related features including contrast, motion, face, sound energy, and text into an overall attention curve along the temporal axis. The intention is to mimic the human attention mechanism and it has achieved exciting results. However, because the human attention mechanism has not been explicitly integrated into the loop, and more importantly, because there still lacks a quantitative, reliable, and faithful representation of human attention that can be used to construct, optimize, integrate and evaluate different attention cues, the work in [35] and its follow-up variants [9], [37], [61] had to compromise to utilize heuristic and suboptimal fusion schemes, which definitely limit their capability.

Essentially, the human brain should be the unique end-evaluator of multimedia contents. It is noticed that the brain’s response to video streams varies depending on the video content. Therefore, it is natural to connect human brain responses with the attractiveness of the video segment. In other words, a quantitative modeling and analysis of human brain signals when watching video streams can certainly provide meaningful and informative guidelines for estimating the attractiveness of a video segment. Recently, the advancements of fMRI brain imaging technology can enable us to acquire reliable and quantitative signals reflecting the full-length dynamics of the brain in perception and cognition of video streams. It is worth noting that the traditional way of studying the user experience might be useful for understanding such brain cognition in some sense, but it is qualitative, suboptimal, and subjective. In particular, user experience modeling has fundamental limitations in capturing the full-length dynamics of the brain’s response. As a powerful tool, fMRI can probe and monitor the human brain’s cognition [20], [32], [41]. For example, the milestone study in [20] has discovered that the contents of a movie clip are highly correlated with temporal fMRI signals in relevant brain regions of interest (ROIs). The experiments on different human subjects revealed that when they watched the same movie stimulus, their brains’ responses, measured by fMRI signals in relevant brain regions, were similar. These observations directly offer strong evidence that fMRI time series data can be used to model the dynamics and interactions between the human brain and multimedia streams, which is the underlying premise of the proposed work.

In this paper, we intend to combine the two fields of brain imaging and bottom-up visual attention modeling to build an effective video abstraction framework via fMRI brain imaging under the natural stimulus of video watching. As illustrated in Fig. 1, the proposed video abstraction framework mainly consists of four components: fMRI-derived attention prediction, bottom-up visual attention computation, supervised model optimization, and key-frame extraction based on the optimized model. In the first component, we present an experimental paradigm that applies state-of-the-art task-based fMRI (T-fMRI) brain imaging techniques to identify major brain regions that are involved in visual information perception and cognition. In this experiment, human subjects’ brains are scanned by an fMRI scanner when they watch video sequences [53]. The strict synchronization between media playing and fMRI scanning is achieved using the E-prime software (http://www.pstnet.com/), so that the time series fMRI signals and video features are in temporal alignment. Then, we identify the relevant brain regions via T-fMRI, which is considered as the benchmark approach for functional localization in the human brain [10], [32]. In our work, 30 ROIs for each of the 4 subjects involved in vision, working memory, and motor networks are identified. The fMRI time series signals extracted from these ROIs represent the human brain’s response to the video stimuli. Afterwards, PFS derived from spectral graph theory is applied to quantify the network synchronization on the functional network constructed by those 30 ROIs, since the property of network synchronization is a good indicator of the functional interaction among brain networks and thus the human brain’s attention engagement. The PFS is utilized to generate benchmark attention curves, given a number of training video sequences and their fMRI temporal signals.

In the second component of our framework, a state-of-the-art approach called the Bayesian surprise model [3] is exploited to compute the bottom-up visual attention features. The Renyi entropy characterizing the spatial distribution of visual saliency is employed to generate the attention curve.

Since fMRI scanning is quite expensive and time-consuming, it is impossible to acquire the fMRI data for every video clip. In the third component, we use the benchmark attention curves derived from the brain responses to guide the optimization of a bottom-up visual attention model based on the “cheap” low-level features by using Gaussian process regression (GPR) [4]. The optimized attention model is expected to maximize the correlation to the brain’s responses, and thus is superior to previous attention models. Given the testing videos, the final component extracts key frames to generate the abstraction based on the optimized attention model.

This paper significantly extends our earlier work [29] and makes the following two major contributions. (1) We propose a novel experimental paradigm to measure the brain’s comprehension of video stimuli quantitatively and then infer attention engagement by using fMRI techniques. The computation of PFS over the functional brain networks, where each node denotes the temporal fMRI signals from an ROI, indicates the attention engagement. It naturally results in the benchmark attention curve by using a small number of video stimuli. (2) In comparison with our earlier work in [29], this paper re-designs and implements a computational framework for the optimization of bottom-up visual attention models under the guidance of fMRI-derived benchmark attention curves, as illustrated in Fig. 1. This new framework not only bridges the brain imaging field and low-level visual content analysis, but also lowers the cost of fMRI scanning. This framework enables the optimization and integration of low-level visual attention cues into better fMRI-driven visual attention models that correlate well with the human brain’s attention engagement.

The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 describes fMRI-measured attention prediction. Section 4 introduces the fMRI-driven visual attention model for video abstraction, which optimizes low-level visual attention cues under the guidance of fMRI-derived benchmark attention curves. Experimental results are provided in Section 5. Finally, conclusions are drawn in Section 6.

Section snippets

Video abstraction

A comprehensive survey of extensive studies of video abstraction can be found in [55]. In this section, we only briefly review some most relevant works. Initial efforts for abstracting videos employ key frames to represent dominant contents of videos and extract key frames based on detecting abrupt changes or clustering of low-level visual features, e.g., color histogram [63]. In [30], Li et al. formulated the problem of video summarization as a rate-distortion optimization problem. A frame

T-fMRI for ROI mapping

fMRI leverages the coupling between neuronal activity and hemodynamics in the brain to obtain non-invasive measurement of brain activity. The block-based fMRI paradigm, as illustrated in Fig. 2, is widely used to map the functional brain regions engaged in certain brain functions. In the block-based paradigm, the temporal axis is divided into baseline and stimulus intervals. The baseline interval is typically blank without any input signals presented, while in the stimulus interval, stimuli are

fMRI-driven visual attention model for video abstraction

Essentially, the benchmark attention curve reflects the brain’s attentional engagement in the comprehension of video. However, fMRI scanning is generally very expensive and time-consuming. Fortunately, we can easily obtain a large number of low-level attentive visual features via computing algorithms. As a result, this paper proposes to learn an fMRI-driven visual attention model with the underlying idea of optimizing low-level feature combination under the guidance of a small number of

Dataset and experimental paradigm

As recommended by [55], publicly available video dataset called TRECVID is appropriate for being used as the test bed for evaluating video abstraction since it is large, diverse, and contains full-length video streams. Therefore, we constructed our experiments based on TRECVID 2005 video data. As reported in [42], TRECVID 2005 video data can be categorized into 7 concepts including politics, finance/business, science/technology, sports, entertainment, weather report, and

Conclusions

In this paper, we have proposed an fMRI-driven visual attention model and its application to video abstraction. The novelty is that it leveraged human brain responses measured by the fMRI technique to construct an objective benchmark criterion for the purpose of learning and optimizing the visual attention model and eventually achieving human-centric video abstraction. A number of major brain ROIs involved in video perception and cognition were identified to form a brain network. PFS derived

Acknowledgements

We thank Alistair Sutherland, Tuo Zhang, Dajiang Zhu, Hanbo Chen, Xi Jiang, Fan Deng, C. Faraco, Degang Zhang, and Xian-Sheng Hua for collecting fMRI data and giving valuable suggestions.

References (63)

  • L. Shao et al.

    Specific object retrieval based on salient regions

    Pattern Recogn.

    (2006)
  • L. Shao et al.

    Geometric and photometric invariant distinctive regions detection

    Inform. Sci.

    (2007)
  • M. Tkalcic et al.

    The impact of weak ground truth and facial expressiveness on affect detection accuracy from time-continuous videos of facial expressions

    Inform. Sci.

    (2013)
  • H.J. Zhang et al.

    An integrated system for content-based video retrieval and browsing

    Pattern Recogn.

    (1997)
  • A. Baddeley

    Working Memory, Thought, and Action

    (2007)
  • L. Bo et al.

    Twin Gaussian processes for structured prediction

    Int. J. Comput. Vision

    (2010)
  • B.W. Chen et al.

    A novel video summarization based on mining the story-structure and semantic relations among concept entities

    IEEE Trans. Multimedia

    (2009)
  • F.R.K. Chung

    Spectral Graph Theory

    (1997)
  • Y. Dudai

    Enslaving central executives: toward a brain theory of cinema

    Projections

    (2008)
  • G. Evangelopoulos, A. Zlatintsi, G. Skoumas, K. Rapantzikos, A. Potamianos, P. Maragos, Y. Avrithis, Video event...
  • K.J. Friston

    Modalities, modes, and models in functional neuroimaging

    Science

    (2009)
  • K.J. Friston et al.

    Statistical parametric maps in functional imaging: a general linear approach

    Hum. Brain Mapp.

    (1994)
  • N. Guan et al.

    Online non-negative matrix factorization with robust stochastic approximation

    IEEE Trans. Neural Networks Learn. Syst.

    (2012)
  • N. Guan et al.

    NeNMF: an optimal gradient method for non-negative matrix factorization

    IEEE Trans. Signal Process.

    (2012)
  • J. Han

    Object segmentation from consumer videos: a unified framework based on visual attention

    IEEE Trans. Consum. Electron.

    (2009)
  • J. Han et al.

    Broadcast court-net sports video analysis using fast 3-D camera modeling

    IEEE Trans. Circuits Syst. Video Technol.

    (2008)
  • J. Han et al.

    Representing and retrieving video shots in human-centric brain imaging space

    IEEE Trans. Image Process.

    (2013)
  • J. Han et al.

    Unsupervised extraction of visual attention objects in color images

    IEEE Trans. Circuits Syst. Video Technol.

    (2006)
  • U. Hasson et al.

    Intersubject synchronization of cortical activity during natural vision

    Science

    (2004)
  • X. Hu, L. Guo, D. Zhang, K. Li, T. Zhang, J. Lv, J. Han, T. Liu, Assessing the dynamics on functional brain networks...
  • X. Hu et al.

    Bridging the semantic gap via functional brain imaging

    IEEE Trans. Multimedia

    (2012)
  • Cited by (0)

    The research was supported by the NIH Career Award EB 006878, NIH R01 R01DA033393, and NSFC 61005018, 91120005, 61103061, 61333017, NPU-FFR-JC20120237, and NCET-10-0079.

    View full text