Keywords

1 Introduction

Salient object detection aims at modeling human visual attention mechanism to segment the most distinct regions or objects from the clutter backgrounds. It has received a great deal of attention in computer vision community because of its wide range of applications including video summarization [1], content-aware image cropping and resizing [3, 4] and person re-identification [2].

Since the seminal approaches of Itti et al. [5] and Liu et al. [6] are reported, extensive visual saliency algorithms have been proposed to simulate human visual attention mechanism in images and videos. Traditional salient object detection methods [7,8,9,10] adopt heuristic priors and manually designed features which are usually considered as low-level information. These generic techniques are useful for keeping fine images structures. However, these models cannot generate satisfied predicted results and are less applicable to a wide range of problems in practice. For example, it is difficult to pop out the salient objects when the background and salient objects share similar attributes (See the first row of Fig. 1). Moreover, it might fail sometimes, when there are multiple salient objects (See the second row of Fig. 1).

Fig. 1.
figure 1

Comparisons of results of different kinds of methods. For input images in (a), we show the salient object detection results of methods based on handcrafted features in (b) [10] and (c) [8], and salient object detection results of methods based on deep features in (d) [25] and (e) Ours.

In recent years, fully convolutional networks have shown powerful ability of feature representation and obtained impressive results in many dense labeling tasks including semantic segmentation [11, 12], edge detection [14, 15] and pose estimation [13]. Inspired by these achievements, researchers in the saliency detection community attempt to utilize its ability of adaptively extracting semantic features from raw images. These FCN-based models [16,17,18] have been successful in overcoming the disadvantages of handcrafted feature-based approaches and capturing high-level information about the objects and their clutter background, thus achieving better performance. However, although the saliency model using high-level information is superior, the low-level and mid-level features are also important in detecting salient objects. Therefore, it is a key and challenging issue to effectively and simultaneously aggregate multi-level saliency cues in a unified learning framework for capturing both the semantic objectness and detailed structure.

Motivated by these discussions, we propose a simple but effective salient object detection model for the pixel-wise saliency prediction task to simultaneously aggregate multi-level features to capture distinctive objectness and detailed information on complex images.

The main contributions are summarized as follows:

  1. (1)

    A novel FCN-based saliency detection network model is proposed, which aggregates multi-level features as saliency cues. It performs image-to-image prediction and learns powerful and rich feature representations on complex images.

  2. (2)

    We utilize the skip-layer scheme to guide low-level feature learning. With the help of deeper side information, shallower side outputs refine their predictions with more accurate location.

  3. (3)

    The proposed model achieves state-of-the-art performance both quantitatively and qualitatively on DUT-OMRON [9], ECSSD [20], HKU [21], PASCAL-S [19] and SOD [34] benchmark datasets in terms of PR curves, F-measure, weighted F-measure and MAE scores.

2 Related Work

Generally, visual saliency detection approaches can be roughly classified into two categories: human fixation prediction and salient object detection. The former [5] is originally proposed to predict the fixation of eye movement, whereas the latter aims to detect and segment each entire salient object with explicit object boundaries from surroundings. Since this paper is focused on salient object detection based on deep learning, we will briefly review existing representative approaches for salient object detection.

2.1 Handcrafted Features Based Models

The majority of salient object detection approaches usually utilize handcrafted pixel/superpixel-level features, such as color, texture and orientation, by either local or global manner. The local based methods use rarity, contrast or distinctiveness of each pixel/region to capture the pixels/regions locally standing out from their surroundings, while the global based methods estimate the saliency of each pixel or region by using holistic priors of the entire image. Some researchers propose to build graphical models of superpixels to implicitly compute contrast [9, 20]. They compute saliency by means of background, center, and compactness priors. However, traditional approaches, which mainly rely on handcrafted features, cannot describe semantic feature representation, therefore, they may fail to pop out salient objects in complex images.

2.2 Deep Neural Networks Based Models

Recently, deep learning based approaches, in particular the convolutional neural networks (CNNs), have been applied to detect salient objects and have improved the performance by a large margin. Wang et al. [23] propose one deep neural network to compute saliency score for each pixel in local context first, and then refine the saliency score for each object proposal over the global view with another network. Li et al. [21] predict saliency score of each superpixel by incorporating multi-scale features in a generic convolutional neural network. Zhao et al. [31] compute saliency by integrating global and local context into a deep learning based framework. Although these models achieve better results than traditional schemes, these models are very time-consuming due to the reason that they take segmented region as a basic unit to train a deep neural network for predicting saliency and the networks have to run many times for predicting saliency degree of all the superpixels in the image.

To remedy above problems, researchers prefer to adopt FCN-like model to detect saliency in a pixel-wise manner. Some researchers propose to use specific-level features for saliency prediction. For example, Lee et al. [25] propose to encode low-level distance map and high-level semantic features of deep CNNs. In [26], a network sharing features for segmentation and saliency tasks is proposed, and a graph Laplician regularized nonlinear regressor model is presented for refinement.

In contrary to these methods only use specific-level features, several works explore to integrate features from different side outputs and indicate that the features from all levels are potential saliency cues and are helpful for saliency prediction. The features from deep layers contain semantic information which is helpful for objectness, while the features from shallow layers contain rich detailed information which is helpful for explicit boundary in high-resolution prediction.

However, how to effectively and efficiently aggregate multi-level convolutional features remains challenging. To this end, several researchers make valuable attempts to solve this problem. Li et al. [27] combines a pixel-level fully convolutional stream and segmented-wise spatial pooling stream. The fully convolutional stream is a multi-scale fully convolutional network, which generates a saliency map with one eighth resolution of the raw input image by exploiting visual contrast across multiscale convolutional layers. Long et al. [11] introduce skip connections and adds high-level prediction layers to intermediate layers to generate pixel-wise prediction results at multiple resolutions. Liu et al. [16] design a two-stage deep network, in which a coarse global prediction is obtained by automatically learning various global structured saliency cues and another network is adopted to further refine the details of saliency maps via integrating local context information.

Though obvious achievement has been made by these deep learning based models in recent years, there is still a large room for improvement over the generic FCN-based models to uniformly highlight the entire salient objects and preserve the detailed boundaries against the cluttered background.

3 Proposed Model

Our proposed salient object detection model mainly consists of two stages: (1) a FCN-based deep network for multi-level features extraction and aggregation; and (2) a spatial coherence scheme for saliency refinement.

Fig. 2.
figure 2

The architecture of the proposed model. In the VGG-16 net, the names of the layers whose features are utilized are shown. The resolution of each step is also shown.

3.1 Network Architecture

To design a FCN-like network that is capable of accounting for both local and global context of an image and incorporating details from various resolutions, we develop a multi-scale deep convolutional neural network for learning discriminant saliency features (our mode is shown in Fig. 2). It consists of two components: feature extraction and aggregation.

Multi-level Feature Extraction. Our proposed model adopts VGG-16 net [28] (pre-trained over the ImageNet dataset for image classification) as our base network, and modifies it to meet our requirements. We retain its 13 convolutional layers, and remove the original 5th pooling layer and fully connected layers. Thus, the modified VGG-16 is composed of 5 groups of convolutional layers. For simplicity, we denote the third sub-layer in the fifth group of convolutional layer as \(Conv5\_3\), and the other convolution layers in the VGG-16 is also denoted by this analogy. For an input image I with size \(W\times {H}\), the modified VGGNet produces five feature maps \(f_i\) with decreasing spatial resolution by stride 2.

For each continuous feature \(f_i\), \(i\in \{5,6, \ldots , 10\}\) extracted from VGG-16, we design a densely connected feature extraction block Convi. It utilizes a simple connectivity pattern: to preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature maps to all subsequent layers, which is similar to DenseNet [24]. Figure 3 illustrates this layout schematically.

Fig. 3.
figure 3

Details of the feature extraction module.

Features Aggregation. We obtain five feature maps with size different resolution from feature extraction blocks. The feature maps of deeper convolutional layers can accurately locate salient objects, while the feature maps generated by shallower convolutional layers contain more details. To help the shallow side output contain more global properties, we refine these feature maps by skip-layer structure, namely, introducing the deeper side-output to its former shallower one. At each Unpool processing block, we combine features through summation. Moreover, we use a score module to integrate different maps and obtain a fused saliency map. To make the output maps of the features at different solutions have the same size for fusing, we use the deconvolutional layer for up-sampling. The strides of the last deconvolutional layers in the last four sides are respectively set to 2, 4, 8 and 16. And then, we combine features by concatenating them.

3.2 Spatial Coherence

To improve spatial coherence and achieve more accurate results, we adopt a pixel-wise saliency refinement model based on a fully connected conditional random field (CRF) [29] in the inference phase. This CRF model solves a binary pixel labeling problem, which is similar to our saliency prediction task, and employs the following energy function,

$$\begin{aligned} E(L) = -\sum _{i}{log}P(l_i)+\sum _{i,j}{\theta _{ij}}{(l_i,l_j)} \end{aligned}$$
(1)

where L represents a binary label assignment for all pixels. \(P(l_i)\) is the probability of pixel \(x_i\) with label \(l_i\), which indicates the likelihood of pixel \(x_i\) being salient. Initially, \(P(1)=S_i\) and \(P(0)=1-S_i\), where \(S_i\) is the saliency score at pixel \(x_i\) from the fused saliency map S. \(\theta _{i,j}(l_i,l_j)\) is a pairwise potential and defined as follows,

$$\begin{aligned} \begin{array}{l} \theta _{ij}=\mu (l_i,l_j)[\omega _{1}exp(-\frac{||p_i -p_j||^2}{2\sigma ^{2}_{\alpha }})-\frac{||I_i-I_j||^2}{2\sigma ^2_\beta }+\omega _{2}exp(-\frac{||p_i-p_j||^2}{2\sigma _{\gamma }^2})]\\ \end{array} \end{aligned}$$
(2)

where \(\mu (l_i,l_j)=1\) if \(l_i\ne {l_j}\), and zero otherwise. \(\theta _{ij}\) involves two kernels. The first kernel depends on pixel positions p and pixel intensities I. This kernel makes nearby pixels having similar colors take similar saliency scores. Three parameters determine the degree of influence by color similarity and spatial relation, respectively. The second kernel is to remove small isolated regions. The parameters of \(\omega _{1}\), \(\omega _{2}\), \(\sigma ^{2}_{\alpha }\), \(\sigma ^{2}_{\beta }\), \(\sigma ^{2}_{\gamma }\) are set to 3.0, 3.0, 60.0, 8.0 and 5.0 respectively in our experiments.

4 Experiments

4.1 Implementation Details

Our network is based on the publicly available Caffe library, an open source framework for CNNs training and testing. As mentioned above, we choose VGG-16 as our pre-trained model and fine-tune it for pixel-wise saliency prediction. We utilize the same training and validation sets as in [8]. The learning rate is set to 1e−9, the momentum parameter is 0.9, the weighted decay is set to 0.0005. The fusion weight in the feature integration module are all initialized with 0.2 in the training phase.

4.2 Datasets

We conduct evaluations on five widely used salient object benchmark datasets. DUT-OMRON is manually selected from more than 140,000 natural images, each of which has one or more salient objects and relatively complex backgrounds. As an extension of the Complex Scene Saliency Dataset (CSSD), ECSSD is obtained by aggregating the images from two publicly available datasets and the Internet. HKU contains 4447 images, most of which have low contrast and multiple salient objects. PASCAL-S is generated from the PASCAL VOC dataset with 20 object categories and complex scenes. SOD is more challenging with multiple salient object and background clutters in images.

4.3 Evaluation Metrics

We adopt the precision-recall (PR) curve to evaluate our proposed model. The precision and recall are computed by binarizing the saliency map with 256 thresholds, ranging from 0 to 255, and comparing the binary map with the ground truth. The PR curves demonstrate the mean precision and recall of saliency maps at different thresholds. We also use F-measure (\(\hbox {F}_\beta \)) and weighted F-measure (\(\omega \hbox {F}_\beta \)) scores to comprehensively consider precision and recall. \(\hbox {F}_\beta \) is given by:

$$\begin{aligned} F_\beta =\frac{(1+\beta ^2)\cdot Precision\cdot Recall}{\beta ^2\cdot Precision + Recall} \end{aligned}$$
(3)

where \(\beta \) is a balance parameter to weight the precision and recall, and \(\beta ^2\) is set to 0.3. Similar to \(\hbox {F}_\beta \), \(\omega \hbox {F}_\beta \) is computed with a weighted harmonic mean of \(Precision^{w}\) and \(Recall^{w}\): \(F_{\beta }^{w}=\frac{(1+\beta ^2)\cdot Precision^{w}\cdot Recall^{w}}{\beta ^2\cdot Precision^{w} + Recall^{w}}\).

Beside, we use the mean absolute error (MAE) to evaluate the average pixel-wise error between the saliency map and ground truth. It is defined as \(MAE = \frac{1}{{h\cdot w}}\sum \limits _{i = 1}^{h} {\sum \limits _{j = 1}^{w} {|S_{ij} - G_{ij}|}}\) where S denotes the saliency map, G denotes the ground truth, and h and w denote the height and width of the image.

4.4 Performance Comparison with State of the Art

We compare our proposed approach with 10 state-of-the-art methods, including UCF [33], MTDS [26], LEGS [23], MDF [21], KSR [30], DRFI [8], SMD [10], ELD [25], MC [31], and ELE [32]. We use either the implementations or the saliency maps provided by the authors for fair comparison. Note that MC, UCF, ELD, MTDS, LEGS, MDF, KSR are deep learning based models.

Table 1. \(\hbox {F}_\beta \) and \(\omega \hbox {F}_\beta \) scores of saliency maps produced by different approaches on DUT-OMRON, ECSSD, HKU, PASCAL-S, and SOD datasets (The top models are highlighted in bold. ‘-’ denotes the saliency maps are not available).
Fig. 4.
figure 4

PR curves of saliency maps produced by different approaches on four datasets. (a) ECSSD, (b) HKU, (c) PASCAL-S and (d) SOD.

Fig. 5.
figure 5

MAE scores of the saliency maps produced by different models on five datasets. Lower is better.

For quantitative evaluation, we show comparison results with PR curves and MAE scores in Figs. 4 and 5. And the comparisons of \(\hbox {F}_\beta \) and \(\omega \hbox {F}_\beta \) are displayed in Table 1. We do not show the comparison of PR curves on DUT-OMRON due to the limited space. In terms of \(\hbox {F}_\beta \), \(\omega \hbox {F}_\beta \) and MAE scores, we can see that our model outperforms all other methods, especially on complex datasets. For the PR curves, our model also achieves a good performance on four datasets and is a little worse than UCF on ECSSD and PASCAL-S.

We show visual comparison in Fig. 6. We can see that our model not only detects and localizes salient objects accurately, but also preserves object details subtly. It can handle various complex situations well, including salient objects being small (row fourth and fifth), clutter backgrounds and salient objects (row first and sixth), backgrounds and salient objects sharing similar appearance (row second, third and fifth).

Fig. 6.
figure 6

Visual comparison results based on different models. (a) Input, (b) ground truth, (c) SMD, (d) DRFI, (e) LEGS, (f) MC, (g) MDF, (h) ELD, (i) MTDS, (j) UCF, and (k) Ours.

4.5 Evaluation on CRF Scheme

A fully connected CRF scheme is incorporated to further uniformly highlight the interior regions of salient object and preserve explicit contour in the saliency map from our proposed multi-scale FCN-like network. To validate its effectiveness, we have also evaluated the performance of our final saliency approach with and without (w/o) CRF scheme on five benchmark datasets in terms of \(\hbox {F}_\beta \), \(\omega \hbox {F}_\beta \), and MAE scores. The results are displayed in Table 2, which shows that the CRF scheme improves the accuracy of our proposed model.

Table 2. Comparisons of our approach with and without(w/o) CRF scheme in terms of \(\hbox {F}_\beta \), \(\omega \hbox {F}_\beta \), and MAE.

5 Conclusion

In this paper, we propose a simple but effective approach for pixel-wise salient object detection based on a fully convolutional network, which extracts multi-level features and utilizes the preceding information through a densely connected module. Moreover, the features from deeper layers are connected to the shallower ones by skip-layer structure for guiding the learning of shallower layers. Besides, a fusion layer is adopted to combine these rich features to generate a saliency map. In order to obtain more fine-gained saliency detection results, we introduce a saliency refinement scheme based on a fully connected CRF to further improve saliency performance. Experimental results demonstrate that our proposed approach achieves encouraging performance against 10 state-of-the-art methods on five benchmark datasets.