Elsevier

Information Sciences

Volume 281, 10 October 2014, Pages 573-585
Information Sciences

Low-level and high-level prior learning for visual saliency estimation

https://doi.org/10.1016/j.ins.2013.09.036Get rights and content

Abstract

Visual saliency estimation is an important issue in multimedia modeling and computer vision, and constitutes a research field that has been studied for decades. Many approaches have been proposed to solve this problem. In this study, we consider the visual attention problem blue with respect to two aspects: low-level prior learning and high-level prior learning. On the one hand, inspired by the concept of chance of happening, the low-level priors, i.e., Color Statistics-based Priors (CSP) and Spatial Correlation-based Priors (SCP), are learned to describe the color distribution and contrast distribution in natural images. On the other hand, the high-level priors, i.e., the relative relationships between objects, are learned to describe the conditional priority between different objects in the images. In particular, we first learn the low-level priors that are statistically based on a large set of natural images. Then, the high-level priors are learned to construct a conditional probability matrix blue that reflects the relative relationship between different objects. Subsequently, a saliency model is presented by integrating the low-level priors, the high-level priors and the Center Bias Prior (CBP), in which the weights that correspond to the low-level priors and the high-level priors are learned based on the eye tracking data set. The experimental results demonstrate that our approach outperforms the existing techniques.

Introduction

The surrounding environment contains a tremendous amount of visual information, which the human visual system (HVS) cannot fully process [24]. Therefore, the HVS tends to pay attention to only a few parts while neglecting other parts of a scene. This phenomenon is usually referred to by psychologists as visual attention. To predict automatically where people look in an image, visual attention analysis has been investigated for dozens of years in the computer vision field. However, until now it has been an open problem that has yet to be addressed. Recently, understanding computer vision problems from the viewpoint of a psychologist is becoming an important research track. Because visual attention is also an important issue and has been studied for more than a century in the psychology field, it is reasonable to adopt some useful concepts from psychology to solve the visual attention analysis problem in multimedia modeling [10], [17], [29], image retrieval [21], [23], [30] and computer vision [9], [22].

Existing visual attention methods can be briefly divided into three groups, which are based on the different driving conditions, namely, the information-driven method, the low-level feature-driven method and the hybrid feature-driven method.

The information-driven methods [2] make contributions to the visual attention issue from a signal processing perspective. Hou and Zhang [11] analyze the log spectrum of each image and obtain the spectral residual. The spectral residual is transformed to the spatial domain to obtain a saliency map. Bruce and Tsotsos [1], [2] believe that the saliency region provides more information than other regions, and a method called “Attention based on Information Maximization (AIM)” is proposed to maximize the self-information in the image. This approach performs marginally better than the previous models. Zhang et al. [36] further use the spatiotemporal visual features to generalize the static image saliency model to dynamic scenes, in which self-information is employed to represent the informative level.

The low-level feature-driven method computes the saliency map from the contrasts and is based on a set of low-level features, such as the color, intensity, and orientation. These low-level features are extracted from the original image at different scales and orientations. The low-level feature-driven method performs well for some nature scenes or synthetic data. Itti et al. [14] compute the saliency value using a center-surround filter to capture the spatial discontinuity. Meur et al. present a method to compute the saliency map based on the fusion of several low-level features (intensity, color, orientation). Oliva and Torralba [20] find that the shape of the scene is also an important factor for human perception. They provide a definition of spatial envelop to describe the shape of the scene in visual attention analysis. However, for the natural scenes that have complex scenarios, the low-level feature-driven method cannot predict where human look correctly. Fig. 1(b) is the saliency map that is generated by Itti et al. [14], which is obtained from color, intensity and orientation features. Fig. 1(c) is the saliency map that is obtained by Oliva and Torralba [20] and is based on the spatial envelop. The real eye-tracking data is given in Fig. 1(e). It is noticeable that there is a large distance between the saliency maps and the real eye-tracking data.

The hybrid feature-driven method accounts for not only the low-level features but also some high-level features, such as face, human and other objects [4], [7], [15], to obtain better results. This method is also treated as a concept-driven method. Cerf et al. [4] add face detection into the low-level feature-driven model [14] and improve the saliency map’s accuracy significantly. Judd et al. [15] expand the hybrid model further, which includes not only high-level features but also mid-level features (horizon line). Then, they train an SVM classifier from the eye-tracking data set to learn different features’ parameters for saliency map construction. Fig. 1(d) shows that it achieves better results than the information-driven method [14] and the low-level feature-driven method [20]. However, because this method ignores the inter-relationships among different high-level features (objects), the salient areas of the map do not match the eye-tracking data very well.

Apart from the above three groups of methods, other models, such as Bayesian model [12], [32], efficient coding [25], and multiview learning [31], [34], [33], [28] provide some different views for the topic as well.

Our proposed technique is a type of hybrid feature-driven method. In contrast to the previous hybrid feature driven model, our approach performs both low-level prior learning and high-level feature learning for visual saliency estimation. In the low-level prior learning part, the concept of “Chance of Happening (CoH)” is introduced when deducing the low-level saliency value. Additionally two low-level priors, i.e., Color Statistics-based Priors (CSP) and Spatial Correlation-based Priors (SCP), are learned to describe the color distribution and contrast distribution in natural images, which are used to compute the CoH value as well as the low-level saliency value. In the high-level prior learning part, the relative relationship is learned to describe the conditional priority between different objects in images, which is used to compute the high-level saliency value. Afterward, a new saliency model is presented by integrating the low-level saliency, the high-level saliency and the Center Bias Prior (CBP), in which the weights that correspond to the low-level and the high-level are learned based on the eye-tracking data set.

The major contributions of this paper include: (1) a novel hybrid feature-driven model is presented to perform both low-level prior learning and high-level feature learning for visual saliency estimation; (2) a concept of “Chance of Happening” for low-level prior learning is introduced; and (3) relative relationships are defined to describe the conditional priority between different objects in images.

The rest of this paper is organized as follows. We discuss the motivation of the proposed approach in Section 2. Section 3 describes our proposed visual saliency estimation, which accounts for the low-level saliency, the high-level saliency and the center bias prior. Experimental results and analysis are given in Section 4. We finally conclude in Section 5.

Section snippets

Motivation of the proposed method

It is known that visual stimuli are the main reason that the HVS stay active and ready for stimuli to drive the movements of eye, which leads to the visual attention mechanism. According to the research of psychologists [13], visual stimuli can be divided into two different types based on the reaction time of the visual neurons. One type is independent of a specific task and can be operated very rapidly in 25–50 ms per item. The image’s color, intensity, and contrast belong to this stimulus; it

The proposed visual saliency estimation approach

To enhance the readability of this paper, Table 1 lists the important notations used in this paper.

Experimental results and analysis

For visual object recognition, a face detector [26], person and car detector [6] and word detector [27] are used to extract the high-level information in our approach. In the experiments, both qualitative and quantitative analyses are conducted to validate the effectiveness of our method. Because all of the parameters that are used in the proposed model are learned from a large number of natural color images, both the qualitative and the quantitative analysis demonstrate that the proposed model

Conclusion

In this paper, we consider the visual attention problem with respect to two aspects: the low-level and the high-level prior. On the one hand, the low-level feature priors, i.e., Color Statistic-based Priors (CSP) and Spatial Correlation-based Priors (SCP), are learned to describe the color distribution and the contrast distribution in natural images. On the other hand, the high-level prior, i.e., the relative relationship between the objects, is learned to describe the conditional priority

References (36)

  • N. Guan et al.

    NeNMF: an optimal gradient method for nonnegative matrix factorization

    IEEE Transactions on Signal Processing

    (2012)
  • R. Hong et al.

    Video accessibility enhancement for hearing impaired users

    ACM Transactions on Multimedia Computing, Communications, and Applications

    (2011)
  • R. Hong et al.

    Multimedia question answering

    IEEE MultiMedia

    (2012)
  • X. Hou et al.

    Saliency detection: a spectral residual approach

    IEEE Conference on Computer Vision and Pattern Recognition

    (2007)
  • L. Itti et al.

    Bayesian surprise attracts human attention

    Advances in Neural Information Processing Systems

    (2006)
  • L. Itti et al.

    Computational modelling of visual attention

    Nature Reviews.Neuroscience

    (2001)
  • L. Itti et al.

    A model of saliency-based visual attention for rapid scene analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1998)
  • T. Judd et al.

    Learning to predict where humans look

    International Conference on Computer Vision

    (2009)
  • Cited by (28)

    • Disparity tuning guided stereoscopic saliency detection for eye fixation prediction

      2018, Journal of Visual Communication and Image Representation
    • Hybrid of extended locality-constrained linear coding and manifold ranking for salient object detection

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      For the purpose of speeding up the process of sketching foreground, we substitute PKL-RC (saliency maps denoted by ‘PKL’) for the conventional High-level Prior (HP) generation. To illustrate the characteristic of PKL-RC, we combine the Gaussian distribution based Center Bias Prior (CBP) [31], color prior and background prior [25,19] to generate the HP maps (denoted by ‘HP’) for comparison. As shown in Fig. 5 (left), HP maps are better than PKL maps because of the center enhancement in HP maps and the noise spreading across the PKL maps.

    • Saliency detection integrating global and local information

      2018, Journal of Visual Communication and Image Representation
    • Video saliency detection via bagging-based prediction and spatiotemporal propagation

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      The research on saliency detection for still images has continued for decades and amounts of effective models have been proposed as mentioned in [21,22]. For example, the incorporation of low-level and high-level prior learning is employed by [78] to compute the visual saliency. In [79], the manifold ranking-based matrix factorization model is proposed to incorporate the features extracted from each superpixel.

    • Saliency integration driven by similar images

      2018, Journal of Visual Communication and Image Representation
    View all citing articles on Scopus
    View full text