Keywords

1 Introduction

In recent years, co-saliency detection has become an emerging issue in saliency detection, which detects the common salient regions among multiple images [1,2,3,4]. Different from the traditional single saliency detection model, co-saliency detection model aims at discovering the common salient objects from an image group containing two or more relevant images, while the categories, intrinsic characteristics, and locations of the salient objects are entirely unknown [5]. The co-salient objects simultaneously exhibit two properties, i.e. (1) The co-salient regions should be salient with respect to the background in each image, and (2) All these co-salient regions should be similar in appearance among multiple images. Due to its superior expansibility, co-saliency detection has been widely used in many computer vision tasks, such as foreground co-segmentation [6], object co-localization and detection [7], and image matching [8].

Most existing co-saliency detection models are focused on RGB images and have achieved satisfactory performances [9,10,11,12,13,14,15,16]. Recently, Co-saliency detection for RGBD images has become one of the popular and challenging problem. RGBD co-saliency detection in [17] is firstly discussed. They proposed a RGBD co-saliency model using bagging-based clustering. Then, Cong et al. [18] proposed an iterative RGBD co-saliency framework, which utilized the existing single saliency maps as the initialization, and generated the final RGBD co-saliency map by using a refinement-cycle model. In their another paper [19], they proposed a co-saliency model based on multi-constraint feature matching and cross label propagation. In this paper, for combining depth and repeatability, we firstly propose a matching algorithm based on neighboring superpixel sets of Multi-Constraint distance to calculate the similarity between images and to depict the occurrence of area repetition. Secondly, inspired by Ref. [23], we propose a 2-Layer co-cellular automata model to calculate the saliency spread of intra-images and inter-images, in order to ensure complete saliency of targeted area. Besides, the depth information and high dimensional features are considered in our method to achieve better result. The major contributions of the proposed co-saliency detection method are summarized as follows.

  1. (1)

    We extract high dimensional features for each image region with a deep convolutional neural network as semantic cue and combine it with color cue, depth cue, and saliency cue to calculate the similarity between two superpixels for the first time.

  2. (2)

    A modified 2-layer co-cellular automata model is used to calculate the saliency spread of intra-images and inter-images, in order to ensure complete saliency of targeted area.

  3. (3)

    Both semantic information and depth information are considered in cellular automata to optimize this co-saliency model in our method.

The rest of this paper is organized as follows. Section 2 introduces the proposed method in detail. The experimental results with qualitative and quantitative evaluations are presented in Sect. 3. Finally, the conclusion is drawn in Sect. 4.

2 Proposed Method

The proposed RGBD co-saliency framework is introduced in this section. Figure 1 shows the framework of the proposed method. Our method is initialized by the existing single saliency maps, and then we propose a matching algorithm based on neighboring superpixel sets of Multi-Constraint distance to calculate the similarity between images and to depict the occurrence of area repetition. Finally, inspired by Ref. [23], we propose a 2-Layer co-cellular automata model to calculate the saliency spread of intra-images and inter-images, in order to ensure complete saliency of targeted area.

Fig. 1.
figure 1

The framework of our algorithm. (a) Input RGB image and the corresponding depth map. (b) Initialization. (c) Superpixel matching and parallel evolution via co-cellular automata. (d) The final saliency result.

Notations:

Given N input images \( \left\{ {I^{i} } \right\}_{i = 1}^{N} \), and the corresponding depth maps are denoted as \( \left\{ {D^{i} } \right\}_{i = 1}^{N} \). The Mi single saliency maps for image Ii produced by existing single image saliency models are represented as \( S^{i} = \left\{ {S_{j}^{i} } \right\}_{j = 1}^{{M_{i} }} \). In our method, the superpixel-level region is regarded as the basic unit for processing. Thus, each RGB image Ii is abstracted into superpixels \( R^{i} = \left\{ {r_{m}^{i} } \right\}_{m = 1}^{{N_{i} }} \) using SLIC algorithm [24] firstly, where Ni is the number of superpixels for image Ii.

2.1 Initialization

The proposed co-saliency framework aims at discovering the co-salient objects from multiple images in a group with the assistance of existing single saliency maps. Therefore, some existing saliency maps produced by single saliency models are used to initialize the framework. It is well known that different saliency methods own different superiority in detecting salient regions. In a way, these saliency maps are complementary in some regions, thus, the fused result can inherit the merits of the multiple saliency maps, and produce more robust and superior detection baseline. In our method, the simple average function is used to achieve a more generalized initialization result. The initialized saliency map for image Ii is denoted as:

$$ S_{f}^{i} \left( {r_{m}^{i} } \right) = \frac{1}{M}\sum\limits_{j = 1}^{{M_{i} }} {S_{\text{j}}^{i} \left( {r_{m}^{i} } \right)} $$
(1)

Where \( S_{j}^{i} \left( {r_{m}^{i} } \right) \) denotes the saliency value of superpixel \( r_{m}^{i} \) produced by jth saliency method for image Ii. In our experiments, four saliency methods including RC [20], DCLC [21], RRWR [22], and BSCA [23], are used to produce the initialized saliency map.

2.2 Superpixel Matching via Multi-constraint Cues

For convenience of calculations and intrinsic structural information, the image is firstly segmented into a set of superpixels by simple linear iterative clustering (SLIC) algorithm [24]. The core of detecting the common salient object is the superpixel matching in different images. In this paper, superpixel matching means, for any superpixel \( r_{m}^{i} \) in image Ii, finding a set of superpixels with high similarity in another image Ij. Note that not all superpixels can be matched and one superpixel can have several matching superpixels in other images. In this paper, high-dimensional semantic cue and low-dimensional cue are both utilizing to compute the similarity between images.

High-Dimensional Cue.

We extract high-dimensional features for each image region with a deep convolutional neural network originally trained over the ImageNet dataset using Caffe, an open source framework for CNN training and testing. The architecture of this CNN has eight layers including five convolutional layers and three fully-connected layers. Features are extracted from the output of the second last fully connected layer, which has 4096 neurons. Although this CNN was originally trained on a dataset for visual recognition, automatically extracted CNN features turn out to be highly versatile and can be more effective than traditional handcrafted features on other visual computing tasks.

Since an image region may have an irregular shape while CNN features have to be extracted from a rectangular region, to make the CNN features only relevant to the pixels inside the region, we define the rectangular region for CNN feature extraction to be the bounding box of the image region and fill the pixels outside the region but still inside its bounding box with the mean pixel values at the same locations across all ImageNet training images. These pixel values become zero after mean subtraction and do not have any impact on subsequent results. We warp the region in the bounding box to a square with 227 × 227 pixels to make it compatible with the deep CNN trained for ImageNet. The warped RGB image region is then fed to the deep CNN and a 4096-dimensional feature vector is obtained by forward propagating a mean-subtracted input image region through all the convolutional layers and fully connected layers. We name this vector feature F.

Thus, the high-dimensional semantic similarity is defined as:

$$ S_{h} \left( {r_{m}^{i} ,r_{n}^{j} } \right) = \exp \left( { - \frac{{\left\| {F_{m}^{i} - F_{k}^{j} } \right\|_{2} }}{{\sigma^{2} }}} \right) $$
(2)

where \( F_{m}^{i} \) denotes 4096 high-dimensional features contrast of superpixel \( r_{m}^{i} \), and σ2 is a constant.

Low-Dimensional Cue.

Three low-dimensional cues include color cue, depth cue, and saliency cue are used to gain a multi-constraint cue.

RGB Similarity.

The color histogram [25] are used to represent the RGB feature on the superpixel level, which are denoted as \( HC_{m}^{i} \). Then, the Chi-square measure is employed to compute the feature difference. Thus, the RGB similarity is defined as:

$$ S_{c} \left( {r_{m}^{i} ,r_{n}^{j} } \right) = 1 - \frac{1}{2}\chi^{2} \left( {HC_{m}^{i} ,HC_{n}^{j} } \right) $$
(3)

where \( r_{m}^{i} \) and \( r_{n}^{j} \) are the superpixels in image Ii and Ij, respectively, and \( \chi^{2} \left( \cdot \right) \) denotes the Chi-square distance function.

Depth Similarity.

Two depth consistency measurements, namely depth value consistency and depth contrast consistency, are composed of the final depth similarity measurement, which is defined as:

$$ S_{d} \left( {r_{m}^{i} ,r_{n}^{j} } \right) = \exp \left( { - \frac{{W_{d} \left( {r_{m}^{i} ,r_{n}^{j} } \right) + W_{c} \left( {r_{m}^{i} ,r_{n}^{j} } \right)}}{{\sigma^{2} }}} \right) $$
(4)

where \( W_{d} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \) is the depth value consistency measurement to evaluate the inter- image depth consistency, due to the fact that the common regions should appear similar depth values.

$$ W_{d} \left( {r_{m}^{i} ,r_{n}^{j} } \right) = \left| {d_{m}^{i} - d_{n}^{j} } \right| $$
(5)

\( W_{c} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \) describes the depth contrast consistency, because the common regions should represent more similar characteristic in depth contrast measurement.

$$ W_{c} \left( {r_{m}^{i} ,r_{n}^{j} } \right) = \left| {D_{c} \left( {r_{m}^{i} } \right) - D_{c} \left( {r_{n}^{j} } \right)} \right| $$
(6)

with

$$ D_{c} (r_{m}^{i} ) = \sum\limits_{k \ne m} {\left| {d_{m}^{i} - d_{n}^{j} } \right|\exp \left( { - \frac{{\left\| {p_{m}^{i} - p_{k}^{i} } \right\|_{2} }}{{\sigma^{2} }}} \right)} $$
(7)

where \( D_{c} \left( {r_{m}^{i} } \right) \) denotes the depth contrast of superpixel \( r_{m}^{i} \), \( p_{m}^{i} \) denotes the position of superpixel \( r_{m}^{i} \), and σ2 is a constant.

Saliency Similarity.

Inspired by the prior that the common regions should appear more similar in single saliency map compared to other regions, the output saliency map from the addition scheme is used to define the saliency similarity measurement in our work:

$$ S_{s} \left( {r_{m}^{i} ,r_{n}^{j} } \right) = \exp \left( { - \left| {S_{sp}^{i} \left( {r_{m}^{i} } \right) - S_{sp}^{j} \left( {r_{n}^{j} } \right)} \right|} \right) $$
(8)

where \( S_{sp}^{i} \left( {r_{m}^{i} } \right) \) is saliency score of superpixel \( r_{m}^{i} \) via initialization.

Based on these cues, the combination similarity measurement is defined as the average of the four similarity measurements.

$$ S_{M} \left( {r_{m}^{i} ,r_{n}^{j} } \right) = \frac{{S_{h} \left( {r_{m}^{i} ,r_{n}^{j} } \right) + S_{c} \left( {r_{m}^{i} ,r_{n}^{j} } \right) + S_{d} \left( {r_{m}^{i} ,r_{n}^{j} } \right) + S_{s} \left( {r_{m}^{i} ,r_{n}^{j} } \right)}}{4} $$
(9)

where \( S_{h} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \), \( S_{c} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \), \( S_{d} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \), and \( S_{s} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \) are the normalized semantic, RGB, depth, and saliency similarities between superpixel \( r_{m}^{i} \) and \( r_{n}^{j} \), respectively. A larger \( S_{M} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \) value corresponds to greater similarity between two superpixels.

2.3 Co-saliency Detection via 2-Layer Co-cellular Automata

In Ref. [23], Cellular Automata method was proposed to calculate the saliency of a single image. The core concept of this method is that the saliency of one superpixel is affected by itself and the adjacent superpixels. All of the superpixels will converge after several times of spread. However, for co-saliency detection,as shown in Fig. 2, the saliency of one superpixel is affected by its intra-neighbor (blue and yellow spots) and its inter-neighbor (purple spot) at the same time.

Fig. 2.
figure 2

Co-saliency detection model. The saliency of one superpixel (red spots) is not only affected by the adjacent superpixels (blue and yellow spots) but also affected by the matched superpixels in other images (purple spots). (Color figure online)

According to this theory, we propose 2-layer Co-cellular Automata via intra image and inter images spread:

$$ S_{m + 1}^{i} = (1 - \kappa_{1} - \kappa_{2} )S_{m}^{i} + \kappa_{1} F_{{\text{int} ra}}^{i} S_{m}^{i} + \kappa_{2} \sum\limits_{j = 1,j \ne i}^{n} {F_{{\text{int} er}}^{i,j} S_{m}^{j} } $$
(10)

where \( S_{m}^{i} \) is the saliency of all superpixels in Ii after m times of status updates, \( S_{0}^{i} \) is the initial saliency via Eq. (1), \( F_{{\text{int} ra}}^{i} \) is the influence matrix of superpixels in Ii, \( F_{{\text{int} er}}^{i,j} \) is the influence matrix from Ij to Ii, \( \kappa_{1} \) and \( \kappa_{2} \) are impact factors. In this model, we utilize the structural information of intra-image, also, the corresponding relationship is considered here.

Intra-image Influence Matrix.

In Ref. [23], the similarity of intra-image superpixels is calculated by color similarity in CIELab color space. Here, we also consider the affect of depth cue and semantic cue. We define the initial intra-image influence matrix as \( F_{{\text{int} ra}}^{\prime i} = \left[ {f_{s,t}^{i} } \right]_{{N^{i} \times N^{i} }} \).

$$ f_{s,t}^{i} = \left\{ {\begin{array}{*{20}l} {\exp \left( { - \frac{{\left\| {c_{s}^{i} - c_{t}^{i} } \right\|^{2} + \left\| {d_{s}^{i} - d_{t}^{i} } \right\|{ + }\left\| {F_{s}^{i} - F_{t}^{i} } \right\|^{2} }}{{2\sigma_{f}^{2} }}} \right)} \hfill & {{\text{t}} \in {\rm N}_{s}^{i} } \hfill \\ 0 \hfill & {\text{t = s or others}} \hfill \\ \end{array} } \right. $$
(11)

Where \( {\rm N}_{s}^{i} \) is superpixels’s 2-layer adjacent region (not only includes its neighbor, but also its neighbor’s neighbor). In order to normalize impact factor matrix, a degree matrix \( D_{{\text{int} ra}}^{i} = diag\left\{ {d_{1} ,d_{2} , \ldots ,d_{{N^{i} }} } \right\} \), where \( d_{i} = \sum {_{t} f_{s,t}^{i} } \). Finally, a row-normalized impact factor matrix can be clearly calculated as follows:

$$ F_{{\text{int} ra}}^{i} = \left[ {D_{{\text{int} ra}}^{i} } \right]^{ - 1} \bullet F_{{\text{int} ra}}^{\prime i} $$
(12)

Inter-image Influence Matrix.

To utilize the affect of other images in the same set, we use the method introduced in Sect. 2.2 to obtain \( S_{M} \left( {r_{m}^{i} ,r_{n}^{j} } \right) \), then the initial inter-image influence matrix is defined as \( F_{{\text{int} er}}^{\prime i} \left[ {f_{s,t}^{i,j} } \right]_{{N^{i} * N^{j} }} \) to capture the relationship of any two superpixels in different images.

$$ f_{s,t}^{i,j} = \left\{ {\begin{array}{*{20}l} {{\text{S}}_{M} \left( {r_{m}^{i} ,r_{n}^{j} } \right)} \hfill & {{\text{S}}_{M} \left( {r_{m}^{i} ,r_{n}^{j} } \right) > \delta } \hfill \\ 0 \hfill & {\text{others}} \hfill \\ \end{array} } \right. $$
(13)

where δ is a threshold to match saliency. Here this parameter is set to be 0.9 according to our experience. Same as above, degree matrix \( D_{{\text{int} er}}^{i} = diag\left\{ {d_{1} ,d_{2} , \ldots d_{{N^{i} }} } \right\} \), where \( d_{i} = \sum {_{t} f_{s,t}^{i,j} } \). And the row-normalized impact factor matrix is indicated as:

$$ F_{{\text{int} er}}^{i} = \frac{1}{N - 1} \bullet \left[ {D_{{\text{int} er}}^{i} } \right]^{ - 1} \bullet F_{{\text{int} er}}^{\prime i} $$
(14)

The overall framework of the proposed method is summarized in Table 1.

Table 1. The procedure of our method.

3 Experiment

In this section, we would evaluate the proposed RGBD co-saliency framework on two RGBD co-saliency datasets. The qualitative and quantitative comparison with other state-of-the-art methods are presented.

3.1 Experimental Settings

Two RGBD benchmarks: the RGBD Coseg183 dataset [27] and the RGBD Cosal150 dataset [18] are used to evaluate our method. The RGBD Coseg183 dataset is composed of 183 pictures, and these pictures are distributed in 16 groups. And the RGBD Cosal150 dataset contains 150 images that are distributed in 21 image sets.

We adopted two quantitative criteria to evaluate the co-saliency map, which is Precision-Recall(PR) curve and F-measure score. The precision and recall score are computed by ground truth. F-measure [28] is defined as the weighted mean of precision P and recall R, which is denoted as:

$$ F - measure = \frac{{\left( {1 + \beta^{2} } \right) \times P \times R}}{{\beta^{2} \times P + R}} $$
(15)

Where β2 is set to 0.3, because the precision is more important than Recall.

In this method, the number of superpixels of each image is set to 200, the maximum number of iterations M is set to 20. And the parameter κ1 and κ2 in Eq. (9) is set to 0.3 and 0.5, respectively.

3.2 Comparison with State-of-the-Art Methods

In this section, we compare our method with 10 state-of-the-art methods, which are RC[20], DCLC [21], RRWR [22], BSCA [23], SE [33], FP [34], CCS [4], EMR [13], AIF [18] and MCLP [19]. The first four methods are single image saliency methods, also they are regarded as the input of this method. SE and FP are classic RGBD single saliency algorithms. CCS and EMR are co-saliency methods for RGB images. The last two method are the latest co-saliency methods for RGDB images.

Some visual examples on two datasets are shown in Fig. 3. The quantitative comparison results including the PR curves and F-measure scores are reported in Fig. 4. As can be seen, on the RGBD Cosal150 dataset, the proposed method’s curve intersects with SE, FP, AIF and MCLP, but the F-measure score of the proposed method is only slightly lower than MCLP. In contrast, the RGBD Coseg183 dataset is more difficult and challenging for co-saliency detection, however, the proposed method’s curve achieves the highest precision of the whole PR curves, and the F-measure is only slightly lower than MCLP, too.

Fig. 3.
figure 3

Visual comparison of different saliency and co-saliency detection methods on two datasets.

Fig. 4.
figure 4

Quantitative comparisons between the proposed method and the state-of-the-art methods on two datasets. “OURS” means the proposed method. (a) and (b) are PR curves and F-measure scores on RGBD Cosal150 dataset, respectively. (c) and (d) are PR curves and F-measure scores on RGBD Coseg183 dataset, respectively.

Table 2 shows the comparison of average run time to process one image between our proposed model and the other two RGBD co-saliency detection methods (AIF, MCLP). The measurement environment is Intel (R) Core (TM) i5-4570 CPU 3.20.GHz workstation with 8 GB RAM under Matlab R2012a platform. It can be seen from the table that our proposed algorithm is faster than AIF and MCLP.

Table 2. Comparison of average run time to process one image between different algorithm.

4 Conclusion

In this paper, we present a co-saliency detection model for RGBD images, which utilize mutli-constraint cues to capture the relationship among multiple images for superpixels matching. Further on, impact factor matrix are constructed for intra-images and inter-images, and the depth cue and high-dimensional semantic cue are considered in intra-images impact factor matrix constructing. In the end, a modified 2-layer co-cellular automata model is using to update initial saliency maps. The comprehensive comparison and discussion on two RGBD co-saliency datasets have demonstrated that the proposed method outperforms other state-of-the-art saliency and co-saliency models.