Elsevier

Neurocomputing

Volume 266, 29 November 2017, Pages 165-175
Neurocomputing

Spatiotemporal salient object detection based on distance transform and energy optimization

https://doi.org/10.1016/j.neucom.2017.05.032Get rights and content

Abstract

In this paper, we present a novel spatiotemporal salient object detection method to produce high-quality saliency maps. The gradient of optical flow is adopted to coarsely locate the boundaries of salient object and the gray-weighted distance transform is adopted to highlight the whole salient object for a temporal saliency map. Furthermore, a confidence-guided energy function is proposed to adaptively fuse spatial and temporal saliency maps. Based on these efforts, our method can achieve good performance for complex scenes such as cluster backgrounds and non-rigid deformation. Experimental results on two benchmark datasets demonstrate the efficiency of the proposed saliency method.

Introduction

Visual saliency modeling has been widely regarded as an important way to automatically extract important regions in images and videos. During the past few decades, salient object detection has attracted a lot of attention due to its wide applications on image segmentation [1], quality assessment [2], object recognition [3], image classification [4], image re-targeting [5], scene classification [6], image retrieval [7], etc.

The goal of salient object detection for images is to detect and segment the most salient objects from a still scene. Many saliency models have been proposed to accomplish this goal. These models are generally categorized into two classes: top-down models and bottom-up models [8]. Most of these models are bottom-up models which are data-driven and closely related to the intrinsic mechanism of human visual system (HVS). The saliency is usually measured with different low-level features (e.g., edge, gradient, intensity, and color). While top-down models are task-driven and require image understanding. The saliency is usually measured by cognitive factors like expectations and current tasks.

Although image saliency detection has achieved great success in recent years, the research works on spatiotemporal saliency detection have a long way to go. The intuitive way is to apply image saliency models to video frame-by-frame independently. However, the performance degrades greatly due to the neglect of motions that attract human attention. For the goal of saliency estimation of video, temporal information also plays an important role besides spatial information. Motion is a basic form of temporal information. Most of existing video saliency models [9], [10], [11] originate in image saliency models but treat motion as an additional feature and adopt the center-surround mechanism to measure temporal saliency map.

However, it is still challenging for existing models to deal with videos with rich textures or complex motions. When the textures in video are rich, it is difficult to distinguish foreground from background through color contrast. When the motion in video is complex, it is difficult to obtain the real motion vectors through motion estimation. In these cases, an intuitive solution is to adaptively fuse spatial and temporal saliency maps, making full use of the advantages of individual saliency maps. Currently, the fusion problem is mainly addressed by performing additive or multiplication operations over spatial and temporal saliency maps, whose results are far from satisfactory [12], [13].

To address abovementioned open problems, we propose a robust spatiotemporal salient object detection algorithm to detect salient objects in complex backgrounds and motions. The gradient of optical flow is used to coarsely locate the contour of foreground objects, followed by a gray-weighted distance transform to convert the contour into temporal saliency map. And then a confidence-guided energy function is adopted to jointly fuse spatial and temporal saliency. Compared to the existing approaches, the main contributions of this work are three-fold:

  • (i)

    By using the gradient of optical flow and gray-weighted distance transform, we derive a novel temporal saliency detection algorithm.

  • (ii)

    A confidence-guided energy function is proposed to adaptively fuse spatial and temporal saliency maps.

  • (iii)

    A novel spatiotemporal saliency detection algorithm is designed. The proposed scheme is superior to state-of-the-art saliency algorithms on two well-known benchmark datasets.

This paper is organized as follows. Section 2 gives a survey of recently published spatial and temporal saliency models. Section 3 represents the details of the proposed algorithm. The feature extraction process is described in Section 3.1. The temporal saliency measurement process is represented in Section 3.2. The joint energy optimization process based on spatial saliency, temporal saliency, and confidence is described in Section 3.3. The experimental results are demonstrated in Section 4. And Section 5 draws the conclusion.

Section snippets

Related work

A number of saliency detection algorithms have been proposed in recent years. In this section, we give a brief overview of some mainstream saliency models. According to the domain of the information exploited for saliency detection, the models are classified into three basic categories: spatial saliency, temporal saliency, and spatiotemporal saliency.

Spatial saliency, which is also called image saliency, is the birthplace of principles or ideas for saliency detection. The most important

The proposed framework

In this paper, a novel spatiotemporal saliency model based on Gray-weighted distance Transform and Energy Optimization (GTEO) is proposed. The framework of the proposed algorithm is shown in Fig. 1. It is composed of three key components, i.e. the feature extraction component, the temporal saliency map derivation, and the spatiotemporal saliency map generation, marked with purple, red and blue dash-dotted lines, respectively. Input images are divided into superpixels and motion vectors are

Experimental results

In this section, we demonstrate the efficiency of the proposed saliency detection algorithm both qualitatively and quantitatively on two benchmark datasets VS [52] and SegTrack [53]. VS contains 10 videos with resolution of 352 × 288 and their motions are relatively simple. SegTrack contains 6 videos with resolution from 320 × 240 to 414 × 352 and their motions are relatively complex. Both datasets provide the ground truth pixel-level labels for salient objects in each video sequence. SegTrack

Conclusion

In this paper, we present a novel video salient object detection algorithm based on the gray-weighted distance transform and confidence-guided energy optimization. The gradient of optical flow is efficient to coarsely locate salient object and the gray-weighted distance transform is able to map gradient of optical flow into temporal saliency map. Furthermore, the proposed confidence-guided energy function can adaptively fuse spatial and temporal saliency maps. Based on these efforts, the

Acknowledgment

This work was supported in part by National Natural Science Foundation of China (61521062, 61301116, 61133009, 61420106008), Chinese National Key S&T Special Program(2013ZX01033001-002-002), the 111 Project (B07022), Shanghai Key Laboratory of Digital Media Processing and Transmissions (STCSM 12DZ2272600).

Bing Yang received the B.E. degree from Wuhan University, Wuhan, China, in 2011. He is now pursuing the Ph.D. degree at the Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China. His research interests include image and video processing, video coding and parallel realization with embedded system.

References (55)

  • ZhangY. et al.

    Sketch-based image retrieval by salient contour reinforcement

    IEEE Trans. Multim.

    (2016)
  • A. Borji et al.

    State-of-the-art in visual attention modeling

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • L. Itti

    Bayesian surprise attracts human attention

    Advances in Neural Information Processing Systems (NIPS)

    (2005)
  • H.J. Seo et al.

    Static and space-time visual saliency detection by self-resemblance

    J. Vis.

    (2009)
  • LiuZ. et al.

    Superpixel-based spatiotemporal saliency detection

    IEEE Trans. Circuits Syst. Video Technol.

    (2014)
  • C. Chamaret et al.

    Spatio-temporal combination of saliency maps and eye-tracking assessment of different strategies

    IEEE International Conference on Image Processing (ICIP)

    (2010)
  • S.M. Muddamsetty et al.

    A performance evaluation of fusion techniques for spatio-temporal saliency detection in dynamic scenes

    IEEE International Conference on Image Processing (ICIP)

    (2013)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • LiuT. et al.

    Learning to detect a salient object

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • WeiY. et al.

    Geodesic saliency using background priors

    European Conference on Computer Vision (ECCV)

    (2012)
  • HanJ. et al.

    Background prior-based salient object detection via deep reconstruction residual

    IEEE Trans. Circ. Syst. Video Technol.

    (2015)
  • JiangH. et al.

    Salient object detection: a discriminative regional feature integration approach

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2013)
  • YangB. et al.

    Edge guided salient object detection

    Neurocomputing

    (2017)
  • L. Wixson

    Detecting salient motion by accumulating directionally-consistent flow

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • MaY.-F. et al.

    A model of motion attention for video skimming

    IEEE International Conference on Image Processing (ICIP)

    (2002)
  • CuiX. et al.

    Temporal spectral residual for fast salient motion detection

    Neurocomputing

    (2012)
  • V. Gopalakrishnan et al.

    A linear dynamical system framework for salient motion detection

    IEEE Trans. Circuits Syst. Video Technol.

    (2012)
  • Cited by (0)

    Bing Yang received the B.E. degree from Wuhan University, Wuhan, China, in 2011. He is now pursuing the Ph.D. degree at the Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China. His research interests include image and video processing, video coding and parallel realization with embedded system.

    Xiaoyun Zhang received her B.S. and M.S. in Applied Mathematics from Xi’an Jiao Tong University in 1998 and 2001, and Ph.D. degree in pattern recognition from Shanghai Jiao Tong University, China, in 2004. Her Ph.D. thesis has been nominated as ”National 100 Best Ph.D. Theses of China”. Her research interests include computer vision and pattern recognition, image and video processing, digital TV system. Her current research interest includes image super-resolution, image post processing, video compression and the implementation in many core/GPU platforms.

    Li Chen received his B.S and M.S degree from Northwestern Polytechnical University at Xi’an of China, and the Ph.D. Degree in 2006 from Shanghai Jiao Tong University, China, all in electrical engineering. His research interest includes image and video processing, DSP and VLSI for image and video processing. Under the grants from NSFC, he has been devoted to image completion and inpainting, video frame rate conversion, image deshake and deblur. Now he mainly focuses on VLSI for image and video processing.

    Zhiyong Gao received his B. S. and M. S. degrees in electrical engineering from Changsha Institute of Technology (CIT), China, in 1981 and 1984, respectively, and the Ph. D. degree from Tsinghua University, China, in 1989. From 1994 to 2010 he took several senior technical positions in the famous companies in England, including a principal engineer in Snell & Wilcox during 1995–2000, a video architect in 3DLabs during 2000–2001, a consultant engineer in Sony European Semiconductor Design Center during 2001–2004 and a digital video architect in Imagination Technologies during 2004–2010. Since 2010 he became a professor in Shanghai Jiao Tong University, China. His research interests include video processing and its implementation, video coding, digital TV and broadcasting.

    View full text