Elsevier

Neurocomputing

Volume 266, 29 November 2017, Pages 284-292
Neurocomputing

Learning visual saliency from human fixations for stereoscopic images

https://doi.org/10.1016/j.neucom.2017.05.050Get rights and content

Abstract

In the previous years, a lot of saliency detection algorithms have been designed for saliency computation of visual content. Recently, stereoscopic display techniques have developed rapidly, which results in much requirement of stereoscopic saliency detection for emerging stereoscopic applications. Different from 2D saliency prediction, stereoscopic saliency detection methods have to consider depth factor. We design a novel stereoscopic saliency detection algorithm by machine learning technique. First, the features of luminance, color and texture are extracted to calculate the feature contract for predicting feature maps of stereoscopic images. Furthermore, the depth features are extracted for depth feature map computation. Sematic features including the center-bias factor and other top-down cues are also applied as the features in the proposed stereoscopic saliency detection method. Support Vector Regression (SVR) is applied to learn the saliency detection model of stereoscopic images. Experimental results obtained on a public large-scale eye tracking database demonstrate that the proposed method can predict better saliency results for stereoscopic images than other existing ones.

Introduction

Human eyes can process visual information effectively and efficiently due to the selective attention in the Human Visual System (HVS) [39]. When observers view visual scenes, they would pay attention to part of the visual information with the limited resources. In the previous years, many researchers have investigated into visual attention modeling from the areas of visual science, neuroscience, computation vision, etc. [1]. In the HVS, there are two different approaches for visual attention: bottom-up and top-down. Bottom-up visual attention, a task-independent and data-driven approach, is a perceptual process of automatic salient region detection in the HVS  [16], [18], [58], [65], while top-down visual attention is a task-dependent perceptual process which is influenced by the tasks, target features, etc. [5], [21], [33], [53].

Previously, there have been many saliency detection models designed for different multimedia processing applications, including retargeting [6], coding [12], [40], quality assessment [30], [33], recognition [13], retrieval [46], tracking [34], segmentation [25], etc. These existing saliency detection models are mainly built for 2D multimedia content. In recent years, the stereoscopic display techniques have developed rapidly and many multimedia processing applications have been developed for stereoscopic visual content, including 3D video coding [47], 3D visual quality evaluation [17], [48], etc. With these 3D multimedia processing applications, visual attention models of 3D multimedia content are much desired for saliency region extraction. Different from 2D saliency detection models, the depth should be taken into account in 3D visual attention models. At present, several studies have investigated into visual attention modeling for 3D multimedia content [4], [42], [60], [64]. In these studies, they mainly compute the saliency map of 3D images by simply fusing 2D saliency map and depth feature map.

Recently, there have been some studies using machine learning to design visual attention models to obtain promising performance in saliency prediction [22], [45], [67]. Inspired by these studies, we present a new stereoscopic saliency detection model by using machine learning technique in this study. It is well accepted that feature contrast from different low-level features including color, intensity, motion and so on can be used to detect salient regions in images [6], [18], [24], [33]. Here, we compute the feature contrast from luminance, color, and texture as some features in the proposed method. Other features of pyramid subbands, color histogram and horizon line are also included in the proposed method. Besides, the 3D features of depth are extracted for the consideration of stereoscopic perception in visual attention. Some studies claim that depth contrast is important to attract human attention [62], while other studies demonstrate that another depth factor attracting humans’ attention is depth degree [19], [59]. In this study, both depth contrast and depth degree are used as input features in the proposed method.

It is well known that humans’ attention is also attracted by high-level features from top-down mechanism such as performed task, feature distributions from specific objects, etc. [5], [33], [53]. Here, we use a face detector and human detector to extract the face and human features respectively to stimulate top-down mechanism in the proposed method. Of course, other specific features of the top-down mechanism can be also included in the input feature vector. Furthermore, it is demonstrated that center bias exists when human eyes view visual scenes [52], [55]. Here, the center bias is also adopted as a feature to design the stereoscopic saliency detection model.

With the computed feature vector, we train a classifier by Support vector machine (SVM) to classify salient image pixels from non-salient image pixels based on an eye tracking data from 3D images. In most existing related studies, they usually calculate the stereoscopic saliency map by using depth to weight 2D saliency [64], or combining 2D saliency map and depth feature map simply [23], [60]. Compared with these existing studies, we adopt the machine learning technique to build a stereoscopic visual attention model by the extracted feature vector, which includes the low-level features, 3D features and semantic features. The comparison experiment obtained by a public eye tracking database demonstrates better saliency prediction performance of the proposed learning based stereoscopic visual attention model than other related methods.

Section snippets

Related work

Itti et al. built a classical visual attention model based on the neuronal architecture of the primates’ early visual system [18]. Multi-scale feature contrast is computed for saliency prediction based on the features of color, intensity and orientation [18]. Later, Harel et al. introduced a visual attention model by designing a dissimilarity computation method [16]. Bruce and Tsotsos used information maximization to design saliency detection model in [2]. That model calculates the saliency map

Feature extraction

We start to introduce the proposed method by feature extraction (Fig. 1). According to Feature Integration Theory (FIT) [54], salient regions in visual scenes are attracted by human attention due to feature contrast. Feature contrast represents the center-surround differences of features, which have been widely used to design computational models of visual attention. Some studies have shown that 3D feature is important in visual attention [19], [60]. The 3D feature denotes the additional

Experimental results

We conduct the comparison experiments to evaluate the performance of the proposed method in this section. We first introduce the performance evaluation methodology and quantitative evaluation metrics. Then the performance of the proposed method is evaluated by comparing with other existing ones.

Conclusion

We design a novel learning based stereoscopic saliency detection method based on a set of features including low-level features, 3D features, and semantic features. SVR is adopted to learn the saliency prediction model. Compared with traditional 2D images, there is one additional dimension of depth for stereoscopic images. We integrate the 3D features of depth into the proposed model and analyze the influence of 3D features of depth in the proposed saliency detection model. Our experimental

Acknowledgments

This work was partially funded by the Natural Science Foundation of China under grant 61571212, and by Natural Science Foundation of Jiangxi Province in China under grant GJJ160420, grant 20161ACB21014 and grant 20151BDH80003.

Yuming Fang received his Ph.D. degree from Nanyang Technological University in Singapore in 2013, M.S. degree from Beijing University of Technology in China in 2009, and B.E. degree from Sichuan University in China in 2006. Currently, he is a professor in the School of Information Technology, Jiangxi University of Finance and Economics, Nanchang, China. His research interests include visual attention modeling, visual quality assessment, image retargeting, computer vision, 3D image/video

References (67)

  • L. Wang et al.

    Deep networks for saliency detection via local estimation and global search

    IEEE International Conference on Computer Vision and Pattern Recognition

    (2015)
  • A. Borji et al.

    State-of-the-art in visual attention modeling

    IEEE Trans. Pattern Recognit. Mach. Intell.

    (2013)
  • N.D. Bruce et al.

    Saliency based on information maximization

    Adv. Neural Inf. Proces. Syst.

    (2006)
  • M. Cerf et al.

    Face and text attract gaze independent of the task: experimental data and computer model

    J. Vis.

    (2009)
  • A. Ciptadi et al.

    An in depth view of saliency

    British Machine Vision Conference (BMVC)

    (2013)
  • Y. Fang et al.

    A visual attention model combining top-down and bottom-up mechanisms for salient object detection

    IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

    (2011)
  • Y. Fang et al.

    Saliency detection in the compressed domain for adaptive image retargeting

    IEEE Trans. Image Proces.

    (2012)
  • Y. Fang et al.

    Saliency detection for stereoscopic images

    IEEE Trans. Image Proces.

    (2014)
  • Y. Fang et al.

    Video saliency incorporating spatiotemporal cues and uncertainty weighting

    IEEE Trans. Image Proces.

    (2014)
  • P. Felzenszwalb et al.

    A discriminatively trained, multiscale, deformable part model

    IEEE International Conference on Computer Vision and Pattern Recognition

    (2008)
  • K. Fukunaga

    Introduction to Statistical Pattern Recognition

    (1990)
  • S. Goferman et al.

    Context-aware saliency detection

    IEEE International Conference on Computer Vision and Pattern Recognition

    (2010)
  • C. Guo et al.

    A novel multi-resolution spatiotemporal saliency detection model and its applications in image and video compression

    IEEE Trans. Image Proces.

    (2010)
  • S. Han et al.

    Biologically plausible saliency mechanisms improve feedforward object recognition

    Vis. Res.

    (2010)
  • J. Harel et al.

    Graph-based visual saliency

    Adv. Neural Inf. Proces. Syst.

    (2006)
  • Q. Huynh-Thu et al.

    The importance of visual attention in improving the 3DTV viewing experience: overview and new perspectives

    IEEE Trans. Broadcast.

    (2011)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • L. Jansen et al.

    Influence of disparity on fixation and saccades in free viewing of natural scenes

    J. Vis.

    (2009)
  • C. Jia et al.

    Saliency detection via a unified generative and discriminative model

    Neurocomputing

    (2016)
  • C. Lang et al.

    Depth matters: influence of depth cues on visual saliency

    European Conference on Computer Vision

    (2012)
  • J. Lei et al.

    A universal framework for salient object detection

    IEEE Trans. Multimedia

    (2016)
  • J. Lei et al.

    Evaluation and modeling of depth feature incorporated visual attention for salient object segmentation

    Neurocomputing

    (2013)
  • G. Li et al.

    Visual saliency based on multiscale deep features

    IEEE International Conference on Computer Vision and Pattern Recognition

    (2015)
  • Cited by (12)

    • Contour-guided saliency detection with long-range interactions

      2022, Neurocomputing
      Citation Excerpt :

      For example, the vanishing point, as one of the typical cues defined by the global scene structure, has been shown to attract high visual attention [41,43]. Depth features also contribute to the stereoscopic saliency detection [64]. Liang et al. revealed that scene layout properties (i.e., horizontal line, convex parts, and vanishing point) play important roles in directing gaze [65].

    • Attention-based contextual interaction asymmetric network for RGB-D saliency prediction

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Qi et al. [30] proposed a 3D visual saliency prediction model that replicates the perceptual process of the human visual system. Fang et al. [31] designed stereoscopic saliency prediction using machine learning. Banitalebi-Dehkordi et al. [32] exploited low-level attributes, such as brightness, color, texture, orientation, motion, and depth, as well as high-level semantic cues, such as face, person, vehicle, animal, text, and horizon, for stereoscopic video saliency prediction.

    • Stereoscopic saliency estimation with background priors based deep reconstruction

      2018, Neurocomputing
      Citation Excerpt :

      To learn the weights of 2D features together with depth features, Ma and Hang [25] extended Judd et al.’s learning-based model [44] by proposing a new type of depth features and constructing a 3D fixation data set for learning. Also inspired by Judd et al.’s work [44], Fang et al. [45] applied support vector regression (SVR) to learn the weights of low-level features, 3D features, and semantic features. Additionally, Banitablebi-Dehkordi et al. [46] utilized a random forest based algorithm to fuse the conspicuity maps of multiple bottom-up and high-level features.

    • Deep Binocular Fixation Prediction Using a Hierarchical Multimodal Fusion Network

      2023, IEEE Transactions on Cognitive and Developmental Systems
    View all citing articles on Scopus

    Yuming Fang received his Ph.D. degree from Nanyang Technological University in Singapore in 2013, M.S. degree from Beijing University of Technology in China in 2009, and B.E. degree from Sichuan University in China in 2006. Currently, he is a professor in the School of Information Technology, Jiangxi University of Finance and Economics, Nanchang, China. His research interests include visual attention modeling, visual quality assessment, image retargeting, computer vision, 3D image/video processing, etc. He has served as an associate editor for the journal of IEEE Access. He was a co-chair/session chair/tracking chair for many international academic conferences in the previous years.

    Jianjun Lei received the Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications, Beijing, China, in 2007. He was a visiting researcher at the Department of Electrical Engineering, University of Washington, Seattle, WA, from August 2012 to August 2013. He is currently a professor with the School of Electronic Information Engineering, Tianjin University, Tianjin, China. His research interests include 3D video processing, 3D display, and computer vision.

    Jia Li is currently an associate professor with the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing, China. He received his B.E. degree from Tsinghua University in 2005 and Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences in 2011. His research interests include computer vision and image/video processing.

    Long Xu received his M.S. degree in applied mathematics from Xidian University, Xi’an, China, in 2002, and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China. He was a Postdoc with the Department of Computer Science, City University of Hong Kong, the Department of Electronic Engineering, Chinese University of Hong Kong, from July August 2009 to December 2012. From January 2013 to March 2014, he was a Postdoc with the School of Computer Engineering, Nanyang Technological University, Singapore. Currently, he is with the Key Laboratory of Solar Activity, National Astronomical Observatories, Chinese Academy of Sciences. His research interests include image/video processing, solar radio astronomy, wavelet, machine learning, and computer vision. He was selected into the 100-Talents Plan, Chinese Academy of Sciences, 2014.

    Weisi Lin (M’92-SM’98-F’16) received his Ph.D. from King’s College, London University, U.K. He served as the Lab Head of Visual Processing, Institute for Infocomm Research, Singapore. Currently, he is an associate professor in the School of Computer Engineering. His technical expertise includes perceptual modeling and evaluation of multimedia signals, image processing and video compression, in which he has published 160 journal papers and 230 conference papers, filed 7 patents, authored 2 books, edited 3 books, and written 9 book chapters. He is an AE for IEEE Transaction on Image Processing, IEEE Transaction Circuits and Systems for Video Technology, and Journal of Visual Communication and Image Representation, and a past AE for IEEE Transaction on Multimedia and IEEE Signal Processing Letters. He served as a guest editor for 7 special issues in different scholarly journals. He has been a Technical Program Chair for IEEE ICME 2013, PCM 2012, QoMEX 2014 and VCIP 2017. He chaired the IEEE MMTC Special Interest Group on QoE (2012–2014). He has been a keynote/invited/panelist/tutorial speaker in 20+ international conferences, as well as a Distinguished Lecturer of IEEE Circuits and Systems Society 2016–2017, and Asia-Pacific Signal and Information Processing Association (APSIPA), 2012–2013. He is a Chartered Engineer, a fellow of IEEE and IET, and a Honorary Fellow of Singapore Institute of Engineering Technologists.

    Patrick Le Callet received both a M.Sc. and a Ph.D. degree in image processing from Ecole polytechnique de l’Université de Nantes. He was also a student at the Ecole Normale Superieure de Cachan where he sat the “Aggrégation” (credentialing exam) in electronics of the French National Education. He worked as an assistant professor from 1997 to 1999 and as a full time lecturer from 1999 to 2003 at the Department of Electrical Engineering of Technical Institute of the University of Nantes (IUT). Since 2003 he teaches at Ecole polytechnique de l’Université de Nantes (Engineering School) in the Electrical Engineering and the Computer Science departments where is now a Full Professor. Since 2006, he is the head of the Image and Video Communication lab at CNRS IRCCyN, a group of more than 35 researchers. He is mostly engaged in research dealing with the application of human vision modeling in image and video processing. His current centers of interest are 3D image and video quality assessment, watermarking techniques and visual attention modeling and applications. He is co-author of more than 200 publications and communications and co-inventor of 13 international patents on these topics. He also co-chairs within the VQEG (Video Quality Expert Group) the “Joint-Effort Group” and “3DTV” activities. He is currently serving as associate editor for IEEE Transactions on Circuit System and Video Technology, SPRINGER EURASIP Journal on Image and Video Processing, and SPIE Electronic Imaging.

    View full text