1 Introduction

Visual saliency estimation is a mechanism simulating human vision system to detect the conspicuous content in an image, and lots of excellent saliency models have been proposed [14, 15, 18, 22, 28]. Due to good properties of visual saliency, it has been widely used in many fields, such as abstract extraction, classification, compression, monitoring [4]. In recent years, many researchers devote to introduce visual saliency into image retrieval to improve the searching accuracy. Formally, Content-Based Image Retrieval (CBIR) aims at effectively and efficiently finding out the needed images from a large-scale image database, which has achieved great progress in past two decades. Generally speaking, most of existing image retrieval methods attempt to improve image retrieval performance from the following three aspects: (1) constructing discriminative image features [6, 12, 26]; (2) designing good similarity estimation schemes [19, 30]; (3) handling large-scale issues [2, 5, 13, 21, 31]. Since the user’s query attention is heavily related to the specific regions in query image, the visual saliency is considered to be good information for boosting image retrieval accuracy. For example, Acharya et al. [1] directly employed Itti saliency model [11] to generate saliency map and then extracted feature vectors from the saliency map for image retrieval. Similarly, some researchers [3] exploited an image segmentation and color histogram based saliency model to extract saliency map and then extracted image features from the map. Papushoy et al. [23] employed GBVS visual attention model [10] to extract saliency map and introduced the salient information into region-level based image retrieval system. Similar approaches are also reported in [9, 27]. In their work, the salient information is involved by weighting different regions according to their perceived saliency. In [20, 25], a histogram of saliency map is extracted as separated image feature, and it is integrated into original similarity measure of image retrieval system. Although these approaches have been proposed to combine image retrieval systems with visual saliency models, no a comprehensive and systematic study is made to discover the effect of different saliency models on image retrieval in a qualitative and quantitative manner.

To concretely investigate the diversity of visual saliency models on image retrieval, we conduct extensive experiments based on nine popular saliency models. To cooperatively employing the complementary information from different models, we also propose a novel approach to effectively involve visual saliency into image retrieval systems by a learning process.

Our main contributions can be summarized as follows:

  1. 1.

    Extensive Experimental Studies: Through the experimental studies based on nine classic saliency models, we explicitly evaluate the effect of visual saliency on image retrieval and discover some effective manners of involving the salient information, which will be beneficial to many computer vision problems.

  2. 2.

    A Novel Learning-based Saliency Involving Approach: We propose a novel learning-based approach to optimally involve visual saliency into image retrieval systems. From the leaning perspective, we provide a new framework for optimally involving salient information.

2 Extensive Experimental Studies

In this section, we conduct extensive experiments based on nine popular saliency models to discover the relationship between visual saliency and image retrieval. More specifically, a popular image retrieval framework based on local image features is employed to implement baseline image retrieval systems. Under this framework, two reasonable schemes are designed to involve salient information into image retrieval process and the optimal combination of saliency model and involving schemes is identified experimentally.

2.1 The Image Retrieval Framework

We employ the popular BoW-based image retrieval framework. As a kind of local based framework, it can not only better balance the effectiveness and efficiency but also provide more flexibility for involving salient information. In Fig. 1, we sketch key steps of the framework. In particular, a visual codebook is first constructed by employing some clustering methods [12, 26]. Based on the visual codebook, an orderless collection can be built for one image by replacing the image’s local features with their nearest visual words. After inserting each visual word in image database and its corresponding image ID into the inverted table, we can perform image retrieval process after given a query image. We call this scheme BoW model. If Hamming embedding code, which encodes the quantization error between local feature and its visual word, is inserted into the inverted table with image ID, we call it BoW+HE scheme. In our experimental studies, both schemes are exploited. Based on the image retrieval framework, two saliency involving schemes are employed to introduce salient information into image retrieval process.

Fig. 1.
figure 1

The overall framework. In order to build an inverted indexing structure, each key point in database images is first mapped to the nearest visual word, and then its image ID without (BoW) or with (BoW+Embedding) embedding code is inserted into the list corresponding to the visual word. In the online query stage, each key point in the query image is also mapped to the nearest visual word, and the items in the corresponding list are returned as matches. If embedding codes are employed, the returned list will be further refined and only top n items whose are most similar to query key point in distances among their codes are returned as matches.

2.2 Dataset and Saliency Involving Schemes

INRIA Holidays Dataset [12]: It is a commonly used image benchmark in image retrieval area, and it contains 500 image groups with different scenes or objects. For each group, there are several images, and the total number of images in all groups is 1491. To evaluate image retrieval task, the first image in each group is treated as query, which results in a query set with 500 images. The other 991 images are treated as database images.

As indicated in Fig. 1, there are two manners to introduce visual salient information into the local-based image retrieval framework. They are listed as follows:

  • Saliency Region Representation Embedding (SRE): A 4-dimensional vector is extracted for each image based on the whole saliency map. In particular, the first component is the average value of H channel values of all salient points, and the rest three components correspond to S channel, I channel, and sum of three channels, respectively. All the key points in an image are associated with the same vector. Given any key point in the query image, all its similar items returned from the inverted table is sorted in ascending order by the distances between the 4D vector of the query key point and the 4D vectors of database key points. Only top n items are selected and assigned to a big weight.

  • Saliency Region Representation Re-ranking (SRR): Given a query image, the returned database images are re-ranked by the distances between the 4D vector of the query image and the 4D vectors of database images. Different from SRE, SRR is a kind of global saliency involving scheme.

2.3 A Quantitative Study on Saliency Involving Schemes

As discussed above, different saliency models will result in quite different saliency maps. To evaluate their effectiveness on improving the image retrieval quality, we conduct extensive experiments by individually employing 9 state-of-the-art saliency models (i.e., AWS [8], BMS [29], GBVS [10], HFT [17], ITTI [11], LDS [7], RARE [24], SP [16], SSD [15]). The parameters for all the experiments are optimal. The experimental results are illustrated in Fig. 2. The first row is the results from BoW model with different combinations of saliency models and involving methods, and the second row is from the BoW+Embedding method. In order to clearly show the effect of different saliency models, the cases without any salient information (BL) also illustrated as baselines.

For BOW model, whatever saliency involving methods we adopt, almost all the saliency models outperform the baseline system. This means that the saliency maps extracted from existing models indeed reflect the true human salient information more or less. For these saliency models, LDS achieves the best performance, compared all the other models with any saliency involving scheme. The highest MAP is achieved at the combination of LDS and SRE schemes, which is up to 0.540. It is higher than Baseline (0.456) by 8.4% points.

For BOW+HE model, SRE involving scheme cannot provide positive effect on the baseline system. That is, BOW+HE model heavily depends on the saliency involving schemes. When we introduce Hamming embedding codes into image retrieval system, the searching accuracy has been improved significantly, compared BoW baseline (0.456) with BoW+HE baseline (0.667). In this situation, if saliency maps are not accurate enough, it will remarkably degrade the positive effect of Hamming embedding codes. According to our experiments, only the SRR scheme can play a positive role in image retrieval boosting. For saliency models, SP model performs the best. The highest MAP value is up to 0.697 and is higher than the baseline (0.667) by 3% points. In fact, LDS model still works well, which achieves nearly the same performance with SP.

From the 4 sub figures in Fig. 2, SRR saliency involving approach provides stable and consistent improvement for all saliency models in both image retrieval methods. That is, saliency involving approach plays an important role when introducing salient information into the image retrieval framework.

Fig. 2.
figure 2

Evaluation on various combinations of saliency models and saliency involving schemes in two image retrieval methods. The green lines in the first row denote the performance of BoW baseline system, whose MAP is 0.456. The green lines in the second row denote the performance of BoW+HE baseline system, whose MAP is 0.667. The yellow lines mean that the saliency involving schemes are combined with salient information labeled manually. (Color figure online)

3 Learning-Based Saliency Involving Approach

In real-world image retrieval scenario, it is impossible to manually obtain saliency maps for a large-scale image database. Therefore, most of existing methods directly employed one of saliency models to extract saliency map to approximate the human vision system. However, different saliency models will result in quite different saliency maps, which can be treated as different approximations of true saliency map. In addition, the performance of different saliency involving schemes also varies with different image retrieval schemes. If we can find a saliency involving approach that obtains an optimal complement of various saliency models and involving schemes, we can better improve the image retrieval performance of original search engines. Toward this end, we propose a novel learning-based saliency involving approach. Figure 3 illustrates the key idea of the proposed approach.

Fig. 3.
figure 3

Illustration of the proposed learning-based saliency involving approaches. It is divided into two parts, i.e. online involving and offline learning. In the offline learning stage, the similarity scores of image pairs in training set are calculated by employing all possible combinations of saliency model and involving schemes. Then, they are utilized with pair labels to train an optimal involving model. In the online stage, all scores between query image and any database image are first estimated and then are employed to obtain the a final score by involving model.

Formally, we suppose that there are N different saliency models \(M=\{M_{1},M_{2},......M_{N}\}\) and K saliency involving methods \(\varPhi =\{\varPhi _{1},\varPhi _{2},......\varPhi _{K}\}\). Given an image pair \((x_i,x_j)\), we can get their similarity score matrix \(\mathcal {X}_{i j}\) by employing different combinations of saliency models and saliency involving methods, which can be formulated as follows:

$$\begin{aligned} f_{KN}(x_{i},x_{j})=\mathcal {X}_{i j}= \begin{bmatrix} x^{11}_{ij}&...&...&...&x^{1N}_{ij}\\ ...&...&...&...&...\\ ...&...&x^{kn}_{ij}&...&...\\ ...&...&...&...&...\\ x^{K1}_{ij}&...&...&...&x^{KN}_{ij} \end{bmatrix}\quad \end{aligned}$$
(1)

where \(x^{kn}_{ij}\) is the two images’ similarity score obtained by combining the \(n^{th}\) saliency model and the \(k^{th}\) saliency involving method.

Our aim is to learn a weight matrix W, which can optimally involve the salient information from all combinations and provide a more reliable similarity estimation between two images. Toward this end, we propose a new similarity estimation function, which is formulated as follows:

$$\begin{aligned} \mathcal {F}(\mathcal {X}_{i j},W)=tr(W^{T}\mathcal {X}_{i j}) \quad \qquad \qquad \nonumber \\ \nonumber \\ W=\begin{bmatrix} w^{11}&...&...&...&w^{1N}\\ ...&...&...&...&...\\ ...&...&w^{kn}&...&...\\ ...&...&...&...&...\\ w^{K1}&...&...&...&w^{KN} \end{bmatrix}\quad \end{aligned}$$
(2)

where \(\mathcal {F}(\mathcal {X}_{i j},W)\) can be treated as the new similarity score between two images \(x_i\) and \(x_j\), and \(w^{kn}\) is the weight of similarity score \(x^{kn}_{ij}\).

To learn the weight matrix W, we must construct a training set of triplets \(\{(x_i,x_j;y^{ij})\}_{i,j=0}^L\), where \(y^{ij}\) is one if images \(x_i\) and \(x_j\) are truly relevant, otherwise zero. Our approach attempts to approximate the score \(\mathcal {F}(\mathcal {X}_{i j},W)\) to the relevant label \(y^{ij}\), which can be formulated by minimizing the following approximation error:

$$\begin{aligned} \mathcal {L}(x_i,x_j;y^{ij}) = \Vert \mathcal {F}(\mathcal {X}_{i j},W) - y^{ij}\Vert _{2}^{2} \end{aligned}$$
(3)

The overall objective function is defined as follows:

$$\begin{aligned} \min \limits _\mathbf{W }&\sum \limits _{i=1}^{L}\sum \limits _{j=1}^{L}{\mathcal {L}(x_i,x_j;y^{ij})} +\lambda \Vert W \Vert _{F}^{2} \end{aligned}$$
(4)

where L is the number of training images and \(\lambda \) is the parameter controlling the sparse term.

4 Experiments

4.1 Experimental Setup

The INRIA Holidays dataset is employed for evaluation. For each image, 9 state-of-the-art saliency models are employed individually to extract saliency maps. In addition, 2 saliency involving methods above-mentioned are exploited to evaluate the performance of different combinations. To further show the scalability of the proposed learning-based saliency involving approach, a large-scale distracted image dataset, i.e., Flickr1M dataset, is employed in our large-scale image retrieval experiments.

4.2 Evaluation on Learning-Based Saliency Involving Scheme

To obtain an optimal complement of various saliency models and involving approaches, we propose a learning-based scheme. In this section, we conduct some experiments to evaluate its effectiveness. To facilitate the experiments, the saliency involving approach is fixed to re-ranking scheme (i.e., SRR) due to its stable performance. In addition, we only employ the BoW+HE framework, since boosting its performance is more challenging.

Table 1. Performance of 9 saliency models and the learning-based scheme

The experimental results are shown in Table 1. Clearly, the proposed learning-based scheme outperforms the best performance of all traditional methods. The possible reason lies in that the optimal process generates a more complete saliency map than GND saliency map. This means that the learning-based scheme can even complement the labeling error from human labelers.

4.3 Large-Scale Image Retrieval Experiments

To evaluate the scalability of the proposed learning-based scheme, we conduct some experiments on large-scale image database. In our experiments, we only employ BoW+HE framework, and all the parameters are set to be optimal. The experimental results are demonstrated in Table 2. As expected, after introducing the salient information by using the proposed scheme, the final retrieval accuracy is remarkably improved comparing the baseline system. This means that the proposed scheme can work well on large-scale image retrieval tasks.

Table 2. Performance on large-scale image retrieval scenario

5 Conclusion

In this paper, we make comprehensive and systematic study to discover the essential relation between image retrieval and visual saliency. Specially, we explicitly discover the effect of visual saliency on image retrieval in a quantitative manner. The key finding is that salient information indeed has positive effect on image retrieval and the manner of introducing salient information play an important role on performance boosting. According to the finding, we propose a novel approach to effectively involve visual saliency into image retrieval systems by a learning process. Extensive experiments on a generally used image benchmark demonstrate that the new image retrieval system remarkably outperforms the original one and the learning-based visual saliency involving approach is also better than the traditional ones. In addition, large-scale experiments show good scalability of the proposed approach.