Keywords

1 Introduction

Saliency detection task aims to detect objects in images that are visually salient for persons [1]. When a person looks at a scene it has the tendency to focus on relevant information and reject the redundant information [2]. This can be useful to make different tasks in computer vision, for example, visual classification [3], image retrieval [4], person re-identification [5] and so on. In general, there are two approaches to search salient objects in images, bottom up and top down. The baseline in the first approach is to get the features less frequent in the scene and consider them as salient regions. Top down is used to look for salient regions with some prior knowledge about the objects.

In recent years, different saliency detection methods have been developed. Erdem and Erdem [6] use region covariance descriptor with features of color, orientation and spatial information, extracted at patches level to compare among themselves. Hu et al. [7] proposed a method where local and global features are combined in a final map and all pixels are weighted according to their distance to center of the image. Liu and Hu [8] make a combination among maps obtained using Quaternion Fast Fourier Transform (QFFT) to look for the optimum. Yu et al. [9] proposed to use the global contrast of color to obtain a salient map grouping the background pixels, considering that these are similar. Wang et al. [10] based their work on two neural networks to learn local and global features to obtain the salient maps and later these are combined in a single map. Rajankar and Kolekar [11] applied a scale reduction using interpolation of the Fourier coefficients in quaternion space getting the salient map.

Colors and contrasts are features very important to obtain salient maps. However, in the works using this features the correlation among them is not considered at different levels with different approaches. To solve this, we propose a saliency detection method FqSD where unlike other works, we link the spatial and frequency information using quaternions preserving the correlation among colors and contrast. Contributions of this paper are summarized as follows: First, a combination of local and global features in quaternion space at different scales is done. Second, a comparative study is developed where we obtain the best color space to use the proposed method.

This paper is organized as follows. In Sect. 2 we explain our approach. In Sect. 3 experimental results are presented and analyzed. Finally the conclusions are set out in Sect. 4.

2 Our Approach

The data input to method FqSD is an image represented in any color space. In step 2, a Gaussian pyramid reduction is applied to get several images with different resolutions where a lot of less important information is eliminated. After, in step 3 each image is processed to build salient maps using local and global approaches in the spatial and frequency domain using full-quaternions (combining values of contrast and color channels). Also previous images are merged into a single image per method. As there are three salient maps, another merge step is required (step 4), which is done by means of a weighted sum of maps. In step 5, two functions (center-bias and refinement) are applied to finally obtain the salient object (output).

Fig. 1.
figure 1

Show the different steps in the proposed method FqSD.

2.1 Multiple Resolution

Generally, salient objects are invariant when a scale transformation is applied to the image. But, the information of not-salient objects is lost during the change of resolution e.g. background information in the image. As explained before, we used a Gaussian pyramid by reduction [12] in step 2, where four images are obtained and we preserve the original image. As original images have different sizes, these are normalized to get a standard size.

2.2 Local and Global Salient Maps

After obtaining the images in step 2, these are transformed from original space to full-quaternion space, where the image has four channels. First channel is the clear-dark contrast effect (for more detail see [13]) and the other channels are values of a color space, for example, Red-Green-Blue (RGB). We develop three approaches to obtain the salient maps (Local Salient Map (LSM), Module Local Binary Pattern Salient Map (MLBP) and Quaternion Fast Fourier Transform to global Salient Map (QFFT)).

For a better understanding of the proposed method, firstly several properties of quaternion algebras are explained. Quaternion is a hypecomplex number given by Hamilton [14] and is denoted with letter \( \mathcal {H} \). If \( q \in \mathcal {H} \), this is represented as follows:

$$\begin{aligned} \{ q = t + xi + yj + zk |(t,x,y,z)\in \mathcal {R} \} \end{aligned}$$
(1)

Where the complex operators \(\mathbf {i,j,k}\) have the next rules \( \{ i^2 = j^2 = k^2 = ijk = -1, ij = k = -ji, ki = j = -ik, jk = i = -kj \} \). It is clear that the multiplication between quaternions is not commutative. If \( t = 0 \), is a pure quaternion and \( q = xi + yj + zk \), if \( t \ne 0 \) is a full-quaternion. The module and the phase are:

$$\begin{aligned} | q | = (t^2 + x^2 + y^2 + z^2)^{1/2} \end{aligned}$$
(2)
$$\begin{aligned} \phi = \tan ^{-1}( \frac{(x^2 + y^2 + z^2)^{1/2}}{t}) \end{aligned}$$
(3)

\(\mathbf {LSM}\): Images are divided in little patches and in each one of them two feature vector are obtained. The first vector is associated to each full-quaternion in the patch and their elements are the module and phase (see Eqs. (2) and (3)). The second feature vector has the average of the module and phase of all the full-quaternions in the patch. To obtain a salient map in each patch an euclidean distance is applied between the first and second feature vector. The full-quaternions with high value have the highest probability to be different from their neighbors (see Fig. 1, 3.(a)).

\(\mathbf {MLBP}\): In this approach, each full-quaternion is codified using Local Binary Pattern (LBP). We extend the LBP to the full-quaternion using the module because is sensitive to the color changes. See following equation:

$$\begin{aligned} mLBP_{S_{i}} \sum _{\j = 0}^{p-1} h(s_{j} - s_{i})2^{j}, h(r) = \left\{ \begin{array}{ll} 0 &{} \text {if }r\ge 0\\ 1 &{} \text {if }r< 0 \end{array} \right. \end{aligned}$$
(4)

Where S is a 3 \(\times \) 3 windows, \( p \in S \), \(s_{i}\) and \(s_{j}\) module of full-quaternion analysis and its neighborhood respectively. Here, the salient map is obtained equal that the LSM method, but using the value of the modules (see Fig. 1, 3.(c)).

\(\mathbf {QFFT}\): Quaternion Fast Fourier Transform is used to build salient map in a global way [8]. Spectral module is modified using a low-pass filter, where a stable color region is obtained with the Inverse Fourier Transform as follows.

$$\begin{aligned} F(p,s)= S \sum _{m = 0}^{m-1} \sum _{n = 0}^{n-1} e^{-\mu 2\pi (\frac{pm}{M})+(\frac{sn}{N})}f(m,n) \end{aligned}$$
(5)
$$\begin{aligned} f(m,n)=| S \sum _{p = 0}^{p-1} \sum _{s = 0}^{s-1} e^{\mu 2\pi (\frac{pm}{M})+(\frac{sn}{N})}(exp(\varUpsilon +(\varLambda (p,s)\circ \varGamma )))|^2 \end{aligned}$$
(6)
$$\begin{aligned} \varLambda = \frac{Vq}{|Vq|} , \varLambda \in F(p,s) \end{aligned}$$
(7)

Where, \( S = \sqrt{\frac{1}{MN}} \), the Eqs. (5) and (6) are the direct and inverse Quaternion Fourier Transforms, \(\mu \) is a unit pure quaternion (module is aqual to 1), p y s are frequency coefficients, m y n are spatial coordinates of the image. \(\varLambda \) y \(\varGamma \) are the eigenaxis and phase, \( \varUpsilon \) is the spectral module modified to the filter and \(\circ \) is the Hadamard product (see Fig. 1, 3.(b)).

2.3 Image Fusion

To show the salient map obtained in step 3, it is necessary to make a fusion among the images obtained by each level in the Gaussian pyramid and process these images (as explained in the Sect. 2.2). Finally, a single image is obtained by each method. We developed the Eq. (8) taking into account that in level 2 there is a trend to keep high values of salient points. Nonetheless, in other levels the values are high or low according to the scene.

$$\begin{aligned} SM = max(Map_{0},Map_{1},Map_{3},Map_{4}) + Map_{2} \end{aligned}$$
(8)

Where, SM, may be LSM, mLBP-SM or QFFT-SM. \( Map\{.\} \) is the map obtained in different levels. The next step is to fuse the three salient maps obtained previously with a weight value to each map as follow:

$$\begin{aligned} MSsingle = \alpha (\psi *LSM) + \beta log((\psi *MLBP)+ 1) + \delta (\psi *QFFT) \end{aligned}$$
(9)

Where, \(\psi \) is a radial filter, \(\{\alpha ,\beta ,\delta \} \) are parameters of weight for the maps and \(\{*\}\) convolution product (see Fig. 1, image 4).

2.4 Refinement

A refinement step is needed to improve the salient map obtained in the last step. Center-bias here is used to give a weight to each element of a salient map [15]. Theory of center-bias is based on the way that images are captured, in general the salient object is localized in the center of the image. Hence, Eq. 10 is used to weight the values according to the distance of a pixel to the center of the image.

$$\begin{aligned} MS_{cb} = MSsingle_{(m,n)}(1-d) , d = \sqrt{ (m- \upsilon )^{2} + (n- \rho )^{2} }/(\upsilon ^{2} + \rho ^{2}) \end{aligned}$$
(10)

Where, \(\upsilon \) and \(\rho \) are the center coordinates of the image.

Generally speaking, in practical tasks we only need to know the values associated with salient objects. To solve this, unlike other works, an adaptive threshold is applied to eliminate values that are far from the interest objects as follows.

$$\begin{aligned} MS_{threshold} = \left\{ \begin{array}{ll} 0 &{} \text {if }r\le 0\\ \omega &{} \text {if }r> 0 \end{array} \right. \end{aligned}$$
(11)

Where, \( \omega \) is the value of the \(MS_{cb} (m,n)\), r is the threshold (sum of the average with the standard deviation in \(MS_{cb}\)).

Other interesting detail of salient maps is that a salient object can have different parts with several probability of be observed. Therefore, it is done a refinement as follow:

$$\begin{aligned} MS_{final} = log(|MS_{threshold}| + 1) \end{aligned}$$
(12)

3 Experimental Results

Our aim is to validate the performance of the developed method using different color spaces and performing a comparison with other state-of-the-art methods. The datasets used are ECSSD-1000 and DUT-OMROM. ECSSD-1000 [16]: has 1000 images (with their salient mask respectively) where there are from 1 to n salient objects. Salient objects have been labeled by five persons to obtain the mask or ground truth (GT). DUT-OMROM [17]: contains 5168 images with complex scene and their masks respectively. We performed experiments using 4 color spaces: RGB, HSV (Hue-Saturation-Value), Lab (lightness, color opponents green-red and blue-yellow) and Ycbcr (Y is the Luma component and CB and CR are the blue-difference and red-difference chroma components). Our best parameter configuration is: \( \delta = 1_{\cdot }4\), \(\beta =1_{\cdot }3 \), \(\delta = 0_{\cdot }3 \), \(\mu = (i,j,k)/\sqrt{3}\). As metric of evaluation we used Mean Absolute Error (MAE), see Eq. 13. The Wilcoxon signed test is used to know the statistical significance than there are among the result obtained in the different color spaces. In this test, a value of 1 means significance (the results are not casual), if the value is 0 the results are doubtful.

$$\begin{aligned} MAE = \frac{1}{MN}\sum _{n = 1}^{N} \sum _{m = 1}^{M} |MS_{final}(m,n) - GT(m,n)| \end{aligned}$$
(13)

We can observe in Table 1 that our method using the HSV color space got the best results in both datasets (see Fig. 2). However, the results with Wilcoxon signed test show difference in the dataset. For the images analyzed in ECSSD-1000 represented in other color spaces different than HSV there is statistical significance (see Table 2) and the results are better than the algorithms of state of the art. On the other hand, the results in the DUT-OMROM dataset have zero value in the Wilcoxon signed test. This result is associated with the characteristic present in dataset, where the patterns in different images are repeated with high frequency along the dataset (the variance among image data is small).

The advantage of HSV color space over Lab, Ycbcr and RGB, is because there is a correlation among different features represented by full-quaternions, where the four features (Hue, Saturation, Value, Clear-Dark Contrast) are combined linearly and processed as a single element. Moreover, the combination between local and global features allows highlighting regions of interest that could be ignored with a simple analysis (local or global). Center-bias and refinement act as a function of adjustment delineating better the contour of the salient objects.

Table 1. Different results among FqSD and other methods in terms of MAE.
Table 2. Statistical significance among the different color space vs HSV.
Fig. 2.
figure 2

Different salient objects with the proposed method FqSD(HSV) in ECSSD-1000 and DUT-OMROM dataset. From top to bottom the rows are, original image, ground truth and salient objects respectively.

4 Conclusions

Experimental results show good performance using the HSV color represented by means of full-quaternions. The integration of the local and global salient maps to look for features that are less frequent in images allows improving the results analyzed with the Mean Absolute Error. The Wilcoxon signed test showed that little variety in the images in a dataset can give untrustworthy results. In future works, we plan to develop a deep learning-based method with neural network using full-quaternions to learn parameters in front of the complexity of different scenes that appear in the real world.