Abstract
The current models of image representation based on Convolutional Neural Networks (CNN) have shown tremendous performance in image retrieval. Such models are inspired by the information flow along the visual pathway in the human visual cortex. We propose that in the field of particular object retrieval, the process of extracting CNN representations from query images with a given region of interest (ROI) can also be modelled by taking inspiration from human vision. Particularly, we show that by making the CNN pay attention on the ROI while extracting query image representation leads to significant improvement over the baseline methods on challenging Oxford5k and Paris6k datasets. Furthermore, we propose an extension to a recently introduced encoding method for CNN representations, regional maximum activations of convolutions (R-MAC). The proposed extension weights the regional representations using a novel saliency measure prior to aggregation. This leads to further improvement in retrieval accuracy.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keyword
1 Introduction
With the introduction of scale invariant local features, such as SIFT [20], the field of image retrieval has benefited tremendously over the last decade, extending its popularity and applicability to other fields of research like loop closure in robotics [19], and structure from motion [3]. In particular, the extension of bag-of-words model from text retrieval to the case of particular object retrieval in videos by Sivic and Zisserman [33] has been seminal for the developments in image retrieval [8, 24]. One of the advantages of such local features is that it allows to follow-up the initial retrieval results with a costly but more accurate spatial verification step [31]. The initial retrieval results are obtained by matching local descriptors using selective matching kernels [34]. The issue of scalability in terms of memory requirements and computational cost associated with pairwise descriptor matching for large scale image retrieval databases was addressed by encoding the local descriptors into a single compact global image representation. Popular techniques are Fisher vectors [23], and VLAD [5].
The increase in computational capacity of GPUs and the generation of large datasets, such as ImageNet [29] have made Convolutional Neural Networks (CNN) the popular choice for a broad spectrum of computer vision tasks like image classification [18], object detection [9], and camera pose estimation [17]. As training a CNN from scratch requires a large amount of data, using activations from different layers of a CNN, trained on a large dataset like [29], as off-the-shelf image representation has bridged the applicability of CNN to different domains [9, 30]. In case of limited amount of data, the parameters of a CNN pre-trained on ImageNet or other large datasets can be used to initialize the network parameters before training the CNN on the target dataset. Such a process is known as fine-tuning and has leveraged the use of CNN to other domains [22, 38]. For the case of image retrieval, several works [7, 10, 27, 35] propose the use of activations from a pre-trained CNN as image descriptors. As the pre-trained CNNs employed for such instance-level retrieval tasks were generally trained to suppress intra-class variations (observed in generic computer vision problems, like object detection), the performance of CNN based descriptors lagged behind the conventional local descriptors. Babenko et al. [7] first demonstrated that fine-tuning a pre-trained CNN on a Landmark dataset [7] significantly improves the retrieval accuracy on standard benchmark datasets of landmarks, such as Oxford5k [24] and Paris6k [22]. Arandjelovic et al. [4] used a similar paradigm of learning the image representations from a large dataset of geo-tagged images. However, the training was done using a ranking loss instead of the classification loss used in [7]. Radenovic et al. [26] showed the importance of hard positive and hard negative mining using unsupervised methods in improving the retrieval accuracy. Post training, the final representation of an image is encoded/computed using regional maximum activations of convolutions (R-MAC) [26, 31]. R-MAC aggregates maximum activations over multiple regions into a compact image representation. The regions are generated using a fixed grid, which is designed to make the final image representation robust to scale and translation variation. Gordo et al. [11] proposed learning R-MAC representation using a Region Proposal Network (RPN) [28]. The R-MAC and the RPN were trained in an end-to-end manner resulting in powerful image representation that obtain the existing state-of-the-art performance on benchmark retrieval datasets.
In this paper, we focus on particular object retrieval, which is a special case of image retrieval, whereby, a query image(s) is given along with a region of interest (ROI) containing an object of interest. The retrieval engine then returns a ranked list of the database images, such that the images containing the object of interest are ranked higher. Traditionally, image retrieval methods encode the query image using feature representation extracted only from the ROI. This serves two purposes: reduced interference from background clutter, and, suppression of distractive patterns outside the ROI, which, sometimes maybe more salient than the ROI. On the other hand, regions outside the ROI can add contextual information to the ROI representation that can facilitate improvement in retrieval performance. Thus it can be stated that the suppression and encoding of information from regions outside the ROI is a tightly coupled problem. To address this issue, we propose the extension of computational model of hippocampus spatial attention, introduced by Mozer et al. in 1998 [21], to the problem of particular object retrieval. In particular, we show that by partial suppression of intermediate CNN representations (representations from intermediate layers of the CNN) of regions in the query image outside the ROI, we can obtain a proper trade-off between the two problems stated above, and thereby, obtain state-of-the-art retrieval performance on standard benchmark datasets like Oxford [24] and Paris [22].
Our second contribution is that we propose an extension to the conventional R-MAC encoding technique. The standard R-MAC suffers from the drawback of assigning uniform weightage to all the regions generated by the pre-defined grid prior to aggregation. As the regions are generated independent of the image content, responses from background clutter can cause negative interference due to equal weightage. Gordo et al. [11] proposed the use of RPN to generate image content dependent regions to circumvent this problem. However, this has certain challenges: (i) the number of regions from RPN is 3–10 times more than R-MAC [12], and (ii) additionally, the RPN model parameters need to be trained for the given task separately. Instead, we show that by using a simple saliency measure, obtained from the existing representation technique, one can weight the regional representations of R-MAC before aggregating. The saliency measure assigns higher weight to landmark type regions and lower weight to the background clutter.
Using the proposed modifications, we are able to achieve state-of-the-art retrieval results in standard object retrieval datasets. Our work can be seen as an extension to the work of [11, 12] since we use off the shelf CNN representations from their trained network.
2 Background
In this section, we provide the reader with a brief background with the methods and terminologies encountered in CNN based literature.
2.1 CNN
When using a pre-trained CNN network like VGGNet (VGG) [32] or Residual Network (ResNet) [14], the network is often cropped at the last convolutional or pooling layer [11, 22]. For example, \(\texttt {conv5} / \texttt {pool5}\) layer in VGG, or \(\texttt {res5c}\) in ResNet-101. As such in the remainder of the paper, the term CNN will be associated with such cropped networks. Now, consider a CNN with L layers. Given an input image \(I \in \mathbb {R}^{W_I \times H_I \times 3}\), the response obtained at the output of the layer \(l \in L\) is a 3D tensor \({{\varvec{X}}}^l \in \mathbb {R}^{W \times H \times K} \). K is the number of channels and \(W \times H\) the spatial dimensions of the output feature map. The spatial resolution of the feature map depends on the network architecture and size of the input image, while the number of output channels K equals the number of filters in layer l. Additionally, it is assumed the feature map \({{\varvec{X}}}^l\) is passed through a rectified linear unit (ReLU) activation function to ensure the non-negativeness of the activations.
The feature map \({{\varvec{X}}}^l\) can be denoted as a set of K 2D feature maps \({{\varvec{X}}}^l\) = \(\{\textit{X}_k^l\}, k = 1...K\). Alternatively, each feature map \({{\varvec{X}}}^l\) can be said to be a set of K dimensional feature representations for \(W \times H\) activations. Instead of the term ‘pixels’, the term ‘activation’ is used in CNN feature space. The activation at spatial location \(p \in \mathbb {R}^2\) in feature map \(X_k^l\) is represented by \(\textit{X}_{k,p}^l\). The set of all such locations p in a feature map be represented by \({{\varvec{S}}} = [1,W] \times [1,H]\). As the layers are arranged in an hierarchical order, each layer computes a higher level abstraction from the feature representations of the previous layers.
2.2 R-MAC
Although the feature maps represent high quality abstraction of the image, they are very high dimensional. Typical dimensionality of feature maps extracted at the output of \(\texttt {res5c}\) layer in ResNet-101 network is \(23 \times 13 \times 2048\) for an image of spatial resolution \(800 \times 600\). As such, these high dimensional representations are encoded to fixed length global representations using techniques [4, 6, 16, 35]. Among the various encoding methods, R-MAC has shown highest performance [35] and is a widely popular choice of encoding CNN representations.
As R-MAC can be used to encode feature map from any layer, l, we continue with the same set of notations introduced in Sect. 2.1, and, do not introduce any layer specific notations. For a given feature map, \({{\varvec{X}}}^l\), R-MAC first generates a set of rectangular regions \({{\varvec{R}}}\) = \(\{\textit{R}_i\}, i = 1...N\), where \(R_i \subseteq {{\varvec{S}}}\) and N is the number of regions that depends on the size of feature map. For each region, the maximum activations of convolutions (MAC) [27] is computed by spatial max-pooling across the K dimensions resulting in a \(1 \times K\) dimensional feature vector \(\mathbf f _{\textit{R}_i}\) per region \(\textit{R}_i\), where

Each region vector \(\mathbf f _{\textit{R}_i}\)is \(l_2\) normalized, followed by whitening with PCA and \(l_2\) normalized again. The regional feature vectors are sum aggregated to get the final image representation f:
As a result of sum aggregation, the final feature vector still retains the \(1 \times K\) dimensionality. The final image representation \(\mathbf f \) is again \(l_2\) normalized, such that a simple dot product can be used to compute image similarity.
3 Contextual Information
In the general setting of particular object retrieval, contextual information can be viewed as the information outside the ROI, \(\mathcal {R}\) in the query image. Such information can be facilitatory or inhibitory in the retrieval process. In other words, the contextual information along with ROI can increase or decrease the distinctiveness of the query image. Traditional models for extracting query image CNN representations include:
An overview of feature extraction using a CNN. The network has \(\textit{l}\) layers, \(\textit{l=1..L}\), where each layer computes a feature map \({{\varvec{X}}}^l\) using the representations from the previous layers. The representations (\(\mathcal {R}_1..\mathcal {R}_l..\mathcal {R}_L\)) denote projections of the ROI, \(\mathcal {R}\), on feature maps from different layers, l (bold black line) [13]. \(\tilde{\mathcal {R}}_L\) is the cumulative receptive field of the activations inside \(\mathcal {R}_L\) (Color figure online).
Full Query (FQ): The full query image is fed to a CNN [6], and the resulting feature representation is encoded to fixed length feature vector.
Cropped ROI (RQ): The ROI in the query image is cropped [11, 26] and fed into a CNN, followed by encoding the resulting feature map.
Cropped Activation (AQ): The feature map representation is first obtained by feed-forwarding the whole query image through a CNN. Projection of the ROI on the feature map is computed, which is represented by the set of activations that have their center of receptive fields inside the ROI as shown in Fig. 1. The representations within the projected ROI are then encoded using standard encoding methods [4, 6, 26].
Methods that use FQ models are expected to perform better than RQ based methods when the contextual information is facilitatory. However, the facilitatory nature of context almost cannot be known apriori. On the other hand, methods that extract query image representation using AQ [4, 6, 26] are able to encode certain amount of contextual information. As can be seen from the Fig. 1, the total receptive field, \(\tilde{\mathcal {R}}_L\) [2](dotted red colored box) of the activations in the projected ROI, \(\mathcal {R}_L\) extend beyond the ROI, \(\mathcal {R}\). Thus the extended ROI (\(\tilde{\mathcal {R}}_L\) in Fig. 1) encodes certain amount of context. Note that the receptive field of layers \(l < L\), \(\tilde{\mathcal {R}}_l\) will have a smaller area than \(\tilde{\mathcal {R}}_L\).
However, due to limited reach of the receptive field of the boundary activations, a large amount of contextual information is discarded. We propose to combine the advantages of the three models mentioned above, by leveraging the computational model of spatial attention observed in the hippocampus to extract full query image CNN representation. The attention model is presented next.
3.1 Computational Model of Spatial Attention
The main advantage of RQ and AQ models is that the representation from the ROI has the highest response in the final query representation. Particularly, in RQ, ROI representation has the sole representation, while, in AQ, it has higher representation than the context. For AQ, this is based on the assumption that only boundary activations are affected by regions outside the ROI. On the other hand, the disadvantage of FQ based methods is that the ROI ceases to have a higher prominence in the final representation.
Thus, an ideal model should not only encode information from regions beyond the ROI, but, in the process should also maintain higher response from the ROI in the final representation. Such a constraint can be modelled using the computational model of spatial attention observed in hippocampus [21].
The attention model initiates with a saliency or attention map \(A \in \mathbb {R}^{W_a \times H_a}\) defined over a feature map \({{\varvec{X}}}^l\), such that \(W_a = W\) and \(H_a = H\). We do not introduce any layer specific notations as the mask can be applied to feature map at the output of any layer, l, of a CNN architecture. The same attention mask is applied across all the channels, K, of the feature map \({{\varvec{X}}}^l\) (see Sect. 2.1). Therefore, each element p in the mask A, \(A_p \in [0,1]\), affects the activations occurring at the spatial location p across the K channels, \(X_{1:K,p}^l\). The activity levels are defined as follows:
where \(\mathcal {R}_l\) is the projection of the ROI, \(\mathcal {R}\), onto the feature map \({{\varvec{X}}}^l\) [13] (see Fig. 1), and M is the saliency map introduced in Sect. 4.1. Note that the attention mask is specific to the spatial location and independent of the channel dimension. So, identifying the activations, across different channels, with their spatial location suffices, i.e. the notation \(p \in \mathcal {R}_l\) represents all activations occurring at spatial location p in the feature map \({{\varvec{X}}}^l\) and lying within \(\mathcal {R}_l\), \(X_{1:K,p}^l \in \mathcal {R}_l\). Now, activations occurring at position p in each feature map \(X_k^l \in {{\varvec{X}}}^l, k = 1...K\), are modulated by the attention mask as follows:
g(.) is a monotonic function [21]:
The constants \(\lambda _1\), \(\lambda _2 \in (0,1)\) are so chosen such that the function g(.) always maintains a value less than one i.e. \(g(.) < 1\). Additionally, the function g(.) has a lower bound at \(\lambda _1\), that defines the maximum attenuation which can be applied to any activation outside the projected ROI, \(\mathcal {R}_l\). In all our experiments we set \(\lambda _1 = 0.5\) , and, \( \lambda _2 = 0.4\). The constant \(\phi \) suppresses activations with weak attention levels (less salient). As in [21], we set \(\phi = 4\) for all experiments.
The modulated feature map representations \(\tilde{{{\varvec{X}}}}\) are feed-forwarded through the remaining layers of the CNN network, or, directly encoded into a fixed length global image representation (detailed in Sect. 5).
4 Weighted R-MAC
The feature map is now encoded to a lower dimensional feature representation. Recent state-of-the-art use R-MAC for this operation [11, 26]. However, one of the criticality with the standard R-MAC is the equal weighting of each region vector \(\mathbf f _{R_i}\) while aggregating in Eq. 2. This implies that responses generating from image clutter and background will negatively affect the retrieval process due to assignment of uniform weightage. As one does not have the prior information about the location of the object of interest in the database images, increasing the number of regions ensures higher coverage, but, it also increases interference from irrelevant regions.
Instead, we propose to use a weighted version of the standard R-MAC (WR-MAC) such that the final image representation is a weighted combination of representations from each region \(R_i \in {{\varvec{R}}}\). In general, Eq. 2 is modified to
The weights \(w_i \in \mathbb {R}\) are generated using a saliency measure (Sect. 4.1) such that landmark type regions have higher weights than background clutter.
4.1 Saliency Map
We use a simple, yet effective saliency measure [36, 37]. Given the feature map \({{\varvec{X}}}^l\), the saliency function maps the 3D tensor \({{\varvec{X}}}^l\) to a 2D tensor M by sum aggregating the feature map \({{\varvec{X}}}^l\) over the channel dimensions. Mathematically, the function can be defined as \(\psi : \mathbb {R}^{W \times H \times K} \rightarrow \mathbb {R}^{W \times H}\) such that \(\psi ({{\varvec{X}}}^l) = M\), where \(M = \sum _{k=1}^K{\textit{X}_k^l}\). The map is additionally max-normalized such that each element p has a range, \(M_p \in [0,1].\)
In Sect. 2.2 where we introduced R-MAC, we observe that the regions \({{\varvec{R}}}\) generated by the rigid grid depend on the spatial dimension \({{\varvec{S}}}\) of the feature map \({{\varvec{X}}}^l\). As the spatial dimension of feature map \({{\varvec{X}}}^l\) is retained in the saliency map M, we can define the same set of regions \({{\varvec{R}}}\) over the map M. Now, for each region \(\textit{R}_i\) we compute MAC to obtain the weight \(w_i\). That is
5 Experiments
5.1 Network and Datasets
For all our experiments we use the ResNet-101 [14] CNN network architecture. The network has a very deep architecture and attains the state-of-the-art results in a variety of computer vision problems [14]. The original network, pre-trained on ImageNet, is cropped at the layer \(\texttt {res5c\_ReLU}\) and fine-tuned using the method proposed in [12]. The network parameters of the fine-tuned model (ResNet-IR) are also publicly available [1]. Additionally, the model ResNet-IR has additional layers on top of \(\texttt {res5c\_ReLU}\) which performs PCA whitening on the region vectors obtained from \(\texttt {res5c\_ReLU}\) representations using R-MAC grid and performs the final R-MAC aggregations. The additional layers were trained end-to-end during fine-tuning [12]. We use the learnt PCA parameters from [12] in our experiments.
For evaluation, we use the standard and publicly available particular object retrieval datasets: Oxford Buildings [24] and Paris Buildings [25]. Each dataset contains several thousands of images. Within these images, 55 queries are defined and a ROI containing the precise location of the landmark to be queried is defined. The retrieval performance is computed using mean average precision (mAP), where the mean is taken over all queries.
5.2 WR-MAC
As mentioned in Sect. 2.2, for a given image, R-MAC can be used to encode representations from any given CNN layer, but, similar to [12] we use the \(\texttt {res5c\_ReLU}\) representations for WR-MAC. A saliency map is computed using the \(\texttt {res5c\_ReLU}\) representations which is then normalized. The normalized map is used to generate weights. The weights are only applied at the time of aggregating the regional representations generated from the R-MAC grid. Note that prior to aggregation, these representations are \(l_2\) normalized, whitened with PCA and \(l_2\) normalized again. Similar to [12], we extract R-MAC and WR-MAC representations from 3 different scales of the given image: 550, 800, and, 1050 pixels on the largest side and maintaining the original aspect ratio. The representations from the 3 scales are aggregated and \(l_2\) normalized to obtain the final image representation.
5.3 Attention Model
The only parameter in the attention model which needs to be defined is the layer number, l, i.e. the CNN layer on which the attention model is to be applied. As at each layer, feature representations are computed using the representations from the previous layers, application of attention model at a certain layer affects the representations of all the layers above. As the number of layers is high for ResNet-IR network, we choose the following layers: (i) \(\texttt {res2c\_ReLU}\), (ii) : \(\texttt {res4b15\_ReLU}\), and, (iii) : \(\texttt {res5c\_ReLU}\) and evaluate their performance as described below.
Given a query image and a ResNet-IR layer, \(l'\) (sampled as mentioned above), the feature representations at the output of layer \(l'\) are used to compute the normalized saliency map. Thus, it is to be noted that the saliency map used in the experiments vary, i.e. for WR-MAC the saliency map is defined over the \(\texttt {res5c\_ReLU}\) representations, while for attention model it is defined over the layer, \(l'\). The modulated representations of the layer, \(l'\) are then used as an input to the layer, \(l'+1\), and feed-forwarded through the remaining layers of ResNet-IR to obtain the final \(\texttt {res5c\_ReLU}\) representations. The \(\texttt {res5c\_ReLU}\) representations are then encoded as discussed in Sect. 5.2.
5.4 Results
In order to better interpret the results, the following points need to be noted: (i) the spatial attention model is only applied to the query side, and (ii) the proposed encoding method, WR-MAC, is applied to both the query and database images.
The motivation for evaluating the performance of attention model across different CNN layers is to observe where in the CNN should the contextual representations be suppressed. In other words, should it be suppressed as early as lower level layers (\(\texttt {res2c\_ReLU}\)), or, late at high level layers (\(\texttt {res5c\_ReLU}\)), or, somewhere in between at mid level (\(\texttt {res4b15\_ReLU}\)). From the results in Table 1, it can be observed that performance across all the layers (considered in Table 1) improve over the baseline (Table 2). However, the performance of \(\texttt {res4b15\_ReLU}\) layer representations is more consistent across the datasets. As such we consider spatial attention model (SA) applied on \(\texttt {res4b15\_ReLU}\) layer representations as our proposed model and compare with the existing approaches in Table 2. Similar to Sect. 5.3, for all the models in Table 2, the \(\texttt {res5c\_ReLU}\) representations are obtained and then encoded as discussed in Sect. 5.2. All the models FQ, AQ, and SA perform better than the baseline RQ due to the addition of contextual information in the final image representation. However, the SA model outperforms the existing models across both the datasets.
The comparison between the performance of FQ and AQ models gives an idea of the effect of the contextual information in image retrieval. The model AQ, which encodes limited contextual information, outperforms FQ in Oxford dataset, but, is outperformed by FQ in Paris dataset. Our assumption is that in Paris dataset, the contextual information has a facilitatory effect, while, in Oxford dataset, it has a certain degree of inhibitory effect on the retrieval performance. An in-depth analysis is left for future work. Although, the performance of SA model is constrained by the requirement of ROI on the query side, that does not diminish the scope of the proposed encoding method, which shows consistent improvement even without the spatial attention model.
6 Conclusion
In this paper we address the different models of extracting and encoding image representation using CNN for the task of image retrieval. The current models of extracting CNN representations from query images cannot fully exploit the advantages of the provided ROI in the query image. We propose an extension of an attention model to this problem and demonstrate increase in retrieval accuracy on standard particular object retrieval datasets. Additionally, we also propose an extension to a recently introduced popular encoding method for CNN representations, R-MAC and show further improvements in retrieval performance. To the best of our knowledge our results are the best among global image representations reported so far for these datasets. Using efficient re-ranking strategies [8, 15] can lead to further improvements in retrieval accuracy.
References
Deep Image Retrieval with Residual Network. http://www.xrce.xerox.com/Our-Research/Computer-Vision/Learning-Visual-Representations/Deep-Image-Retrieval/
Receptive Field. http://cs231n.github.io/convolutional-networks/
Agarwal, S., Furukawa, Y., Snavely, N., Simon, I., Curless, B., Seitz, S.M., Szeliski, R.: Building rome in a day. Commun. ACM 54(10), 105–112 (2011)
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5297–5307 (2016)
Arandjelovic, R., Zisserman, A.: All about vlad. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1578–1585 (2013)
Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1269–1277 (2015)
Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 584–599. Springer, Cham (2014). doi:10.1007/978-3-319-10590-1_38
Chum, O., Mikulik, A., Perdoch, M., Matas, J.: Total recall ii: query expansion revisited. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 889–896. IEEE (2011)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 392–407. Springer, Cham (2014). doi:10.1007/978-3-319-10584-0_26
Gordo, A., Almazán, J., Revaud, J., Larlus, D.: Deep image retrieval: learning global representations for image search. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 241–257. Springer, Cham (2016). doi:10.1007/978-3-319-46466-4_15
Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep visual representations for image retrieval (2016). arXiv:1610.07940
He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). doi:10.1007/978-3-319-10578-9_23
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Iscen, A., Tolias, G., Avrithis, Y., Furon, T., Chum, O.: Efficient diffusion on region manifolds: recovering small objects with compact CNN representations (2016). arXiv:1611.05113
Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggregated deep convolutional features. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9913, pp. 685–701. Springer, Cham (2016). doi:10.1007/978-3-319-46604-0_48
Kendall, A., Grimes, M., Cipolla, R.: Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Laskar, Z., Huttunen, S., Herrera, D., Rahtu, E., Kannala, J.: Robust loop closures for scene reconstruction by combining odometry and visual correspondences. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 2603–2607. IEEE (2016)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Mozer, M.C., Sitton, M.: Computational modeling of spatial attention. In: Pashler, H. (ed.) Attention, vol. 9, pp. 341–393. UCL Press, London (1998)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Perronnin, F., Liu, Y., Sánchez, J., Poirier, H.: Large-scale image retrieval with compressed fisher vectors. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3384–3391. IEEE (2010)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1–8. IEEE (2007)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2008, pp. 1–8. IEEE (2008)
Radenović, F., Tolias, G., Chum, O.: CNN image retrieval learns from BoW: unsupervised fine-tuning with hard examples. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 3–20. Springer, Cham (2016). doi:10.1007/978-3-319-46448-0_1
Razavian, A.S., Sullivan, J., Carlsson, S., Maki, A.: Visual instance retrieval with deep convolutional networks (2014). arXiv:1412.6574
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)
Shen, X., Lin, Z., Brandt, J., Wu, Y.: Spatially-constrained similarity measure for large-scale object retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1229–1241 (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv:1409.1556
Sivic, J., Zisserman, A., et al.: Video google: a text retrieval approach to object matching in video
Tolias, G., Avrithis, Y., Jégou, H.: To aggregate or not to aggregate: selective match kernels for image search. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1401–1408 (2013)
Tolias, G., Sicre, R., Jégou, H.: Particular object retrieval with integral max-pooling of CNN activations (2015). arXiv:1511.05879
Wei, X.S., Luo, J.H., Wu, J.: Selective convolutional descriptor aggregation for fine-grained image retrieval (2016). arXiv:1604.04994
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer (2016). arXiv:1612.03928
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 834–849. Springer, Cham (2014). doi:10.1007/978-3-319-10590-1_54
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Laskar, Z., Kannala, J. (2017). Context Aware Query Image Representation for Particular Object Retrieval. In: Sharma, P., Bianchi, F. (eds) Image Analysis. SCIA 2017. Lecture Notes in Computer Science(), vol 10270. Springer, Cham. https://doi.org/10.1007/978-3-319-59129-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-59129-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59128-5
Online ISBN: 978-3-319-59129-2
eBook Packages: Computer ScienceComputer Science (R0)