Low-level and high-level prior learning for visual saliency estimation
Introduction
The surrounding environment contains a tremendous amount of visual information, which the human visual system (HVS) cannot fully process [24]. Therefore, the HVS tends to pay attention to only a few parts while neglecting other parts of a scene. This phenomenon is usually referred to by psychologists as visual attention. To predict automatically where people look in an image, visual attention analysis has been investigated for dozens of years in the computer vision field. However, until now it has been an open problem that has yet to be addressed. Recently, understanding computer vision problems from the viewpoint of a psychologist is becoming an important research track. Because visual attention is also an important issue and has been studied for more than a century in the psychology field, it is reasonable to adopt some useful concepts from psychology to solve the visual attention analysis problem in multimedia modeling [10], [17], [29], image retrieval [21], [23], [30] and computer vision [9], [22].
Existing visual attention methods can be briefly divided into three groups, which are based on the different driving conditions, namely, the information-driven method, the low-level feature-driven method and the hybrid feature-driven method.
The information-driven methods [2] make contributions to the visual attention issue from a signal processing perspective. Hou and Zhang [11] analyze the log spectrum of each image and obtain the spectral residual. The spectral residual is transformed to the spatial domain to obtain a saliency map. Bruce and Tsotsos [1], [2] believe that the saliency region provides more information than other regions, and a method called “Attention based on Information Maximization (AIM)” is proposed to maximize the self-information in the image. This approach performs marginally better than the previous models. Zhang et al. [36] further use the spatiotemporal visual features to generalize the static image saliency model to dynamic scenes, in which self-information is employed to represent the informative level.
The low-level feature-driven method computes the saliency map from the contrasts and is based on a set of low-level features, such as the color, intensity, and orientation. These low-level features are extracted from the original image at different scales and orientations. The low-level feature-driven method performs well for some nature scenes or synthetic data. Itti et al. [14] compute the saliency value using a center-surround filter to capture the spatial discontinuity. Meur et al. present a method to compute the saliency map based on the fusion of several low-level features (intensity, color, orientation). Oliva and Torralba [20] find that the shape of the scene is also an important factor for human perception. They provide a definition of spatial envelop to describe the shape of the scene in visual attention analysis. However, for the natural scenes that have complex scenarios, the low-level feature-driven method cannot predict where human look correctly. Fig. 1(b) is the saliency map that is generated by Itti et al. [14], which is obtained from color, intensity and orientation features. Fig. 1(c) is the saliency map that is obtained by Oliva and Torralba [20] and is based on the spatial envelop. The real eye-tracking data is given in Fig. 1(e). It is noticeable that there is a large distance between the saliency maps and the real eye-tracking data.
The hybrid feature-driven method accounts for not only the low-level features but also some high-level features, such as face, human and other objects [4], [7], [15], to obtain better results. This method is also treated as a concept-driven method. Cerf et al. [4] add face detection into the low-level feature-driven model [14] and improve the saliency map’s accuracy significantly. Judd et al. [15] expand the hybrid model further, which includes not only high-level features but also mid-level features (horizon line). Then, they train an SVM classifier from the eye-tracking data set to learn different features’ parameters for saliency map construction. Fig. 1(d) shows that it achieves better results than the information-driven method [14] and the low-level feature-driven method [20]. However, because this method ignores the inter-relationships among different high-level features (objects), the salient areas of the map do not match the eye-tracking data very well.
Apart from the above three groups of methods, other models, such as Bayesian model [12], [32], efficient coding [25], and multiview learning [31], [34], [33], [28] provide some different views for the topic as well.
Our proposed technique is a type of hybrid feature-driven method. In contrast to the previous hybrid feature driven model, our approach performs both low-level prior learning and high-level feature learning for visual saliency estimation. In the low-level prior learning part, the concept of “Chance of Happening (CoH)” is introduced when deducing the low-level saliency value. Additionally two low-level priors, i.e., Color Statistics-based Priors (CSP) and Spatial Correlation-based Priors (SCP), are learned to describe the color distribution and contrast distribution in natural images, which are used to compute the CoH value as well as the low-level saliency value. In the high-level prior learning part, the relative relationship is learned to describe the conditional priority between different objects in images, which is used to compute the high-level saliency value. Afterward, a new saliency model is presented by integrating the low-level saliency, the high-level saliency and the Center Bias Prior (CBP), in which the weights that correspond to the low-level and the high-level are learned based on the eye-tracking data set.
The major contributions of this paper include: (1) a novel hybrid feature-driven model is presented to perform both low-level prior learning and high-level feature learning for visual saliency estimation; (2) a concept of “Chance of Happening” for low-level prior learning is introduced; and (3) relative relationships are defined to describe the conditional priority between different objects in images.
The rest of this paper is organized as follows. We discuss the motivation of the proposed approach in Section 2. Section 3 describes our proposed visual saliency estimation, which accounts for the low-level saliency, the high-level saliency and the center bias prior. Experimental results and analysis are given in Section 4. We finally conclude in Section 5.
Section snippets
Motivation of the proposed method
It is known that visual stimuli are the main reason that the HVS stay active and ready for stimuli to drive the movements of eye, which leads to the visual attention mechanism. According to the research of psychologists [13], visual stimuli can be divided into two different types based on the reaction time of the visual neurons. One type is independent of a specific task and can be operated very rapidly in 25–50 ms per item. The image’s color, intensity, and contrast belong to this stimulus; it
The proposed visual saliency estimation approach
To enhance the readability of this paper, Table 1 lists the important notations used in this paper.
Experimental results and analysis
For visual object recognition, a face detector [26], person and car detector [6] and word detector [27] are used to extract the high-level information in our approach. In the experiments, both qualitative and quantitative analyses are conducted to validate the effectiveness of our method. Because all of the parameters that are used in the proposed model are learned from a large number of natural color images, both the qualitative and the quantitative analysis demonstrate that the proposed model
Conclusion
In this paper, we consider the visual attention problem with respect to two aspects: the low-level and the high-level prior. On the one hand, the low-level feature priors, i.e., Color Statistic-based Priors (CSP) and Spatial Correlation-based Priors (SCP), are learned to describe the color distribution and the contrast distribution in natural images. On the other hand, the high-level prior, i.e., the relative relationship between the objects, is learned to describe the conditional priority
References (36)
- et al.
Multi-label ensemble based on variable pairwise constraint projection
Information Sciences
(2013) - et al.
Data embedding for vector quantization image processing on the basis of adjoining state-codebook mapping
Information Sciences
(2013) - et al.
Pairwise constraints based multiview features fusion for scene classification
Pattern Recognition
(2013) - et al.
Saliency based on information maximization
Advances in Neural Information Processing Systems
(2006) - et al.
Saliency, attention, and visual search: an information theoretic approach
Journal of Vision
(2009) - et al.
Faces and text attract gaze independent of the task: experimental data and computer model
Journal of Vision
(2009) - et al.
Predicting human gaze using low-level saliency combined with face detection
Neural Information Processing Systems
(2007) - et al.
From Gestalt Theory to Image Analysis, A Probabilistic Approach
(2008) - et al.
A discriminatively trained, multiscale, deformable part model
IEEE Conference on Computer Vision and Pattern Recognition
(2008) - et al.
Context-aware saliency detection
IEEE Conference on Computer Vision and Pattern Recognition
(2010)
NeNMF: an optimal gradient method for nonnegative matrix factorization
IEEE Transactions on Signal Processing
Video accessibility enhancement for hearing impaired users
ACM Transactions on Multimedia Computing, Communications, and Applications
Multimedia question answering
IEEE MultiMedia
Saliency detection: a spectral residual approach
IEEE Conference on Computer Vision and Pattern Recognition
Bayesian surprise attracts human attention
Advances in Neural Information Processing Systems
Computational modelling of visual attention
Nature Reviews.Neuroscience
A model of saliency-based visual attention for rapid scene analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
Learning to predict where humans look
International Conference on Computer Vision
Cited by (28)
Simultaneous object size and depth adjustment for stereoscopic 3D images
2019, Information SciencesDisparity tuning guided stereoscopic saliency detection for eye fixation prediction
2018, Journal of Visual Communication and Image RepresentationHybrid of extended locality-constrained linear coding and manifold ranking for salient object detection
2018, Journal of Visual Communication and Image RepresentationCitation Excerpt :For the purpose of speeding up the process of sketching foreground, we substitute PKL-RC (saliency maps denoted by ‘PKL’) for the conventional High-level Prior (HP) generation. To illustrate the characteristic of PKL-RC, we combine the Gaussian distribution based Center Bias Prior (CBP) [31], color prior and background prior [25,19] to generate the HP maps (denoted by ‘HP’) for comparison. As shown in Fig. 5 (left), HP maps are better than PKL maps because of the center enhancement in HP maps and the noise spreading across the PKL maps.
Saliency detection integrating global and local information
2018, Journal of Visual Communication and Image RepresentationVideo saliency detection via bagging-based prediction and spatiotemporal propagation
2018, Journal of Visual Communication and Image RepresentationCitation Excerpt :The research on saliency detection for still images has continued for decades and amounts of effective models have been proposed as mentioned in [21,22]. For example, the incorporation of low-level and high-level prior learning is employed by [78] to compute the visual saliency. In [79], the manifold ranking-based matrix factorization model is proposed to incorporate the features extracted from each superpixel.
Saliency integration driven by similar images
2018, Journal of Visual Communication and Image Representation