1 Introduction

Image matching and retrieval is an interesting and highly needed application in computer vision field. Traditional solutions to this problem are based on image segmentation, region description and region matching [6, 38, 39]. In these techniques, regions can be obtained at super-pixel level as done in [6] or by grouping salient features with mean-shift clustering [38]. The next step of generating region descriptors is usually performed by exploiting visual and spatial information of image content, mostly based on some robust descriptors like SIFT [25], GIST [33]. The last step of region matching can be done by comparing the extracted visual features as in [6, 38] or by exploiting both geometry and color features [39].

In the present work, we are interested in dealing with efficient feature matching that is applicable to many computer vision systems, typically near-duplicate image search, object matching and recognition, content based image retrieval. Specifically, the studied problem can be formulated as approximate nearest-neighbor (ANN) search whose crucial goal is to quickly produce (approximate) answers for a given query with very high accuracies (> 80%). State-of-the-art approaches to this problem have been focused on inverted index [44], visual vocabulary tree [31], space partitioning (KD-tree [8]), product quantization combined with inverted index [16] or inverted multi-index [3], and additive quantization [2].

Combination of tree structure and quantization idea has been also exploited in recent works such as product split tree (PS-tree) [5], tree quantization (TQ) [4], product quantization tree (PQT) [49], embedding product quantization (EPQ) [37]. The PS-tree divides data into m sub-sets (\(m \le 2\) in the work), each of which is used to learn a codebook. However, codebook learning is done in a principal feature analysis manner which is resemble to principal axis tree [27]. The PQT can be considered as a two-level quantization tree. At the first level, it creates m sub-codebooks independently as PQ does [16]. At the second level, the sub-clusters of each sub-codebook are further refined by using a vector quantizer. The resulting tree can play the role of a quantizer as well as an indexing structure.

Alternatively, the hierarchical vocabulary tree can be used as an effective encoding method in combination with the PQ technique [37] so-called embedding product quantization (EPQ). Unlike PQ, the sub-codebooks in EPQ are created by building many hierarchical clustering trees. Each tree is generated in a low dimensional sub-space, hence capturing a different aspect of the multi-fold domain of the original data. Encoding an input vector in EPQ is processed by searching for the identification (ID) of a leaf node containing the centroid that is closest to the input vector in terms of the Euclidean distance. The search starts from the root and traverses down the tree. At each internal node, the son closest to the query is selected to further explore and so on. Naturally, the tree also provides non-exhaustive search without using an additional coarse quantizer as PQ does. In addition, EPQ is greatly benefited from the use of re-ranking technique that uses the hash tables to pre-compute the distances among codewords of each sub-codebook.

In the present work, we aim at optimizing the EQP technique so that the search performance is even improved further while reducing the memory overhead incurred by the use of large hash tables. Although EPQ gives quite good search performance, it is subjected to several weaknesses. First, the memory overhead of the hash tables used in EPQ is \(O(mV^{2})\) where m is the number of sub-codebooks and V is the number of leaf nodes of each tree. To favor the search accuracy, V must be set to a high value, hence raising a problem of memory overhead. Second, the idea of using a separate clustering tree in each sub-space poses an additional computation overhead and extra memory complexity as well. In EPQ, the compression codes and the IDs of input vectors are stored at the leaf nodes of each tree. This information is also duplicated over all the trees. When searching for the answers of a given query, an additional checking has to be performed in order not to process the IDs that have been visited. The search speedups are thus degraded to some extent. To address these limitations, the current work differs from EQP technique in the following aspects: (i) only a single finer tree is created to globally capture all data points of the original dataset, (ii) asymmetric distance computation (ADC) is employed in the re-ranking step instead of symmetric distance computation (SDC).

For the remainder of this paper, Sect. 2 reviews the main approaches in the literature for approximate nearest-neighbor search. Next, the proposed indexing scheme is described in Sect. 3. Experiment results and discussions are presented in Sect. 4. Finally, we conclude the paper and discuss possible room of improvements in Sect. 5.

2 Related work

According to the taxonomy of indexing approaches in the literature [36, 37], we shall discuss the most representative methods that belong to the following classes: space partitioning, hierarchical clustering, product quantization, deep learning, and hashing.

Space partitioning As for space partitioning approach, the data space is hierarchically represented by a balanced tree structure. To this aim, the dataset is iteratively partitioned into equal-size sub-groups until the size of each sub-group is smaller than a threshold. KD-tree [8] is one of the most popular techniques that uses a hyperplane to split the data into two roughly equal-size groups by using the median pivot. The principal axis tree (PA-tree) [27] is an improved version of the KD-tree in that the data is pre-aligned by using principal component analysis (PCA). Multiple KD-trees can be combined to improve the search performance as proposed in [42]. In this work, each tree is constructed by choosing randomly, at each internal node, a split axis from a few dimensions having the highest variance. The LM-tree [36] suggests using the two highest variance axes in the PCA space to split the data at each level of decomposition. Multiple randomized LM-trees can be simultaneously used to boost the search performance at the cost of increasing memory overhead.

Hierarchical clustering The hierarchical clustering approach is quite similar to the space partitioning approach but differs in that a clustering algorithm is employed for data decomposition. For this purpose, different clustering algorithms can be used such as K-means [26], K-medoids [18], or DBSCAN [7]. The K-means method considers the center of a cluster as the gravity of the points assigned to that cluster and defines the objective function as the sum of squared errors. In contrast, the K-medoids algorithm minimizes the sum of dissimilarities between every point of a cluster and a medoid (i.e., the center with minimal average dissimilarity to all the points in the cluster). Alternatively, one can also employ the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm [7]. While K-means and K-medoids produce the clusters with convex shape, DBSCAN is capable of discovering the clusters of arbitrary shape. Its basic idea is that the points that are closely packed in a small neighborhood (i.e., high density) would take part of a cluster. Here, a neighborhood is characterized by a radius parameter as well as the choice of a distance function between two points. In addition, to be considered as high density, the number of points in each neighborhood must exceeds a given threshold. DBSCAN is very attractive for the cases of spatial datasets which contain clusters of irregular shapes, and especially, it is quite time efficient when working on large datasets. Recent applications of DBSCAN concern real-time super-pixel segmentation task and it demonstrates quite interesting results [41].

Back to the hierarchical clustering-based indexing approach, the vocabulary K-means tree is first introduced in [31] to create a hierarchical representation of the data. In this work, the K-means algorithm is iteratively applied to partition the feature vectors into smaller groups. Each sub-group is represented by either a leaf node if it is small enough or an internal node otherwise. In this way, the resulting tree defines the quantized cells that are treated as the visual vocabularies. Several attempts in [28, 30] improve the vocabulary K-means tree by using the priority search strategy. Multiple randomized clustering trees (RC-trees) have been also studied in [29]. The last noticeable method (POC-tree) [35] proposes maximizing, at each level of decomposition, the separation space for every pair of sub-clusters, resulting in a compact representation of the underlying data.

Product quantization Contrary to the aforementioned approaches, a product quantization (PQ) method divides the data space into disjoint sub-spaces and quantizes them separately. The original vectors are then quantized to the cells of each sub-space, forming a compact code for each feature vector. Therefore, the PQ itself is just a quantization technique which maps an input vector into a small code, resulting in the significant benefits of memory storage and computation aspect. However, searching with PQ codes still requires exhaustive scan of the whole database and thus is very time consuming. To alleviate this issues, the authors in [16] proposed to combine the PQ-based codes with an invert index (using a codebook). The resulting combination was shown to significantly improve the search performance. In [3], the PQ approach has been further advanced by exploiting the idea of inverted multi-index. In this work, the data in each sub-space is separately quantized and coupled with an inverted index, hence the name of inverted multi-index. Experimental results show that the second-order multi-index (i.e., the original data space is divided just into two sub-spaces) is a good choice in most situations. This observation has not been thoroughly justified and seems to be less convinced because search performance is highly dependent on the curse-of-dimensionality problem.

Optimized product quantization There have been several attempts to improve the coding quality of the quantization process as done in [2, 9, 17, 22, 32]. All these works propose optimizing the quantization step to make better fitting to the underlying data (e.g., by reducing the quantization errors). In [32] (ck-means and ok-means), the data is adaptively transformed (i.e., in terms of rotation, scaling and translation) in the same process with quantization task to capture intrinsic variances of the underlying feature vectors. In [9] (OPQ), data dimensions are globally pre-aligned by using PCA and then re-ordered to jointly achieve the independence and balance (in terms of the variances of data dimensions) among sub-spaces. Similarly, locally OPQ (LOPQ) [17] also exploits the idea of OPQ to the data but in a local manner. This means that OPQ is not applied to the whole dataset but to the sub-clusters that are obtained from a coarse vector quantizer. Intuitively, LOPQ should gives better fitting and reduces the construction errors. However, jointly optimization of coarse vector quantization and local product quantization has not been addressed yet.

Differing from the aforementioned methods, distribution sensitive product quantization (DSPQ) [22] proposes to dynamically balance the data distribution and the bit budget allocated to each sub-space. Previous PQ-based methods consider uniform bit allocation strategy to each sub-space and thus may create over- or under-quantization results. DSPQ addresses this issue by introducing a novel measurement called aggregation degree (AD) that indicates how compact a sub-space is. A uniform sub-space should corresponds to a low AD value and vice versa. A bit arrangement strategy is then performed in the senses that more bit budget is donated to uniform distributions, while small bit budget is assigned to clustered sub-spaces. As a result, DSPQ has been shown to be effective compared to other PQ methods, but it still adopts the basic assumption of PQ that the sub-spaces are mutually independent and thus applies uniform data projection.

Alternatively, the mutual independence assumption of PQ-based methods has been also addressed in recent works [2, 4, 53, 54]. In [2], the authors introduced an additive quantization (AQ) method that encodes each input vector as the sum of R codewords coming from R codebooks. Moreover, the codewords in AQ are of the same length as the input vectors. As the sub-space independence assumption is omitted, the AQ method gives better coding accuracy but is subjected to the efficiency issues. Similarly, the work in [4] advances the PQ spirit to create a new encoding tree, so-called tree quantization (TQ) where the vertices represent the codebooks, while each edge corresponds to a dimension. Each codebook (i.e., vertex) encodes only the dimensions whose edges have direct connection to the vertex. Other dimensions are set to zero, and thus, the codewords in TQ are full-length D but contain many zero-valued elements. Similar to AQ, an input vector is now encoded as the sum of m codewords coming from m codebooks. In addition, as in the case of OPQ, an optimal version of TQ, called optimized tree quantization (OTQ), applies a global rotation of the data to reduce the quantization errors. Experiments reported that TQ and OTQ give better quantization quality than PQ, but they are not superior to IVFADC [16] in search speedups [4].

In the same line of using full-length codewords, the authors in [53] proposed a technique to approximate an input vector as a composition of R codewords coming from R dictionaries under a constraint that the product of inter-dictionary elements is a constant. This constraint contributes to a degradation in computing the distances between a query and the codewords. However, the obtained efficiency is still not satisfactory. An extension of this work is presented in [54] where an additional sparsity constraint is imposed to make the codewords very sparse. As a result, the online distance computation cost is significantly reduced.

Deep learning methods In recent years, deep learning techniques have been actively contributed to the amazing success of computer vision applications and communities. The most successful application domains of this approach are probably concerned with object detection and recognition [47, 48] (typically face detectors [14, 52] and face representation/verification [40, 51]), and speech recognition [11, 50]. Good surveys of deep learning techniques for salient object detection can be found in [12, 24]. In the other place, the adaptation of deep learning techniques to address the ANN search has been less discussed. In this part, we aim at discussing several recent works featuring special focus on this direction. Deep product quantization (DPQ) [19] is perhaps the leading technique that integrates the PQ spirit into a deep learning framework. While a standard PQ technique provides many desired properties for ANN search, it is commonly trained using an unsupervised fashion that prevents it from benefiting task-related supervised signal when possible. To alleviate this situation, DPQ is the first attempt to learn compact quantization codes in an end-to-end and supervised manner integrated in a deep learning network.

DPQ works on top of input feature vectors (i.e., embeddings) and is composed of two small multilayer perceptron (MLP) networks, a product quantizer, and is connected to a joint central loss function at the final layer. Like PQ, DPQ creates m sub-codebooks in the product quantization learning, each of which is trained on the sub-vectors of each sub-space and is composed of k sub-clusters. The MLP networks learn the probability distribution of each sub-vector over k sub-clusters. DPQ outputs the soft and hard representations for each input vector, where the soft code is created as a combination of the probabilities and sub-clusters in each sub-space, while the hard code is chosen as the sub-cluster with highest probability. The joint central function plays two important roles: encouraging features of the same class to come together in a cluster and reducing the discrepancy between the soft and hard representations of the points in the same class. DPQ was shown to outperform PQ and many other CNN-based methods for both image retrieval and classification tasks. The limitation of QPD is perhaps in the case where it is applied to test data that is significantly different from the training data (i.e., cross-domain).

Apart from DPQ, there also exist many deep leaning-based methods for image retrieval such as deep autoencoders [20], region proposal network (RPN) [10], deep supervised hashing (DSH) [23]. However, all these methods are applied directly on the images, not feature embeddings as in case of DPQ or all the indexing schemes discussed in this section. For interested readers, they are referred to the review studies in [45, 46].

Hashing methods The crucial idea of a hashing method is to project the data into sub-spaces (e.g., \(R^{1}\) in [15] or \(R^{t}\) with \(t >1\) in [1]) such that similar points would have a high probability of collision (i.e., hashed into the same bucket). As for the Locality-Sensitive Hashing (LHS) family [15], a hashing function must be formulated by using a similarity distance D(uv), which is defined for two given points u and v, as follows: \(Pr_{h \in H}[h(u) = h(v)] = D(u,v)\) where D(uv) is assumed to be in the range of [0, 1]. The similarity function can be also extended to an arbitrary kernel function as utilized in [13, 21]. However, these methods request an increasing number of hash tables, making them impractical for constrained memory configurations. Entropy LSH [34] handles this matter by creating several synthetic neighbors of the query and use them to access the corresponding buckets. General speaking, the hashing methods have formed a solid theoretical foundation to the problem of approximate nearest-neighbor search, but several studies have shown that they perform less effectively in practice [4, 29, 30].

Fig. 1
figure 1

Demonstration of EPQ [37] and the proposed tree index: Left—the EQP divides input vectors into m disjoint sub-spaces, each of which is indexed by a separate hierarchical clustering tree. The quantization, \(\mathbf q (x)\), of an input vector \(x \in \mathbb {R}^{D}\) is obtained by concatenating the sub-codes from m sub-quantizers (SQ). EPQ requires m copies of \(\mathbf q (x)\) to be stored at the leaf nodes over all the trees, hence raising an additional memory overhead and extra computational cost. In addition, EPQ employs symmetric distance computation (SDC) to re-rank the shortlist at the cost of increasing memory consumption. Right—our new indexing scheme uses a single fine tree to represent all the data points, avoiding information redundancy in storing the compression codes. The sub-quantizers are obtained via a standard codebook learning algorithm (as in PQ) and are combined with asymmetric distance computation (ADC) to re-rank the answers. The outputs of PQ only work at the leaf nodes

3 The proposed indexing algorithm

3.1 Background and formulation

In the present work, we investigate a novel usage of hierarchical clustering tree in combination with the product quantization idea. As for the EQP method, the data space is divided into m disjoint sub-spaces and a sub-quantizer (SQ) is learned for each sub-space. The difference of EPQ from PQ is that codebook learning is performed by creating a hierarchical clustering tree as illustrated in Fig. 1 (left). As a result, there are m clustering trees in total, each of which plays the role of a sub-quantizer for the corresponding sub-space. For clarity of presentation, denote \(T_{j}\) the jth clustering tree (\(j = 1, 2, \ldots , m\)) and \(a_{j}(x)\) the jth sub-vector of an input vector \(x \in \mathbb {R}^{D}\).

Quantization of a sub-vector \(a_{j}(x)\) by the tree \(T_{j}\) is denoted as:

$$\begin{aligned} T_{j}(a_{j}(x)) = k^{*}_{j} \end{aligned}$$
(1)

where \(k^{*}_{j}\) is the ID of a leaf node of a specific path in \(T_{j}\) that the path starts from the root and always chooses the next son whose centroid is closest to \(a_{j}(x)\) in terms of Euclidean distance. By concatenating the quantization results from m clustering trees, we obtain a complete EPQ-based code for the input vector x as follows:

$$\begin{aligned} \mathbf q (x) = [T_{1}(a_{1}(x)),T_{2}(a_{2}(x)), \dots ,T_{m}(a_{m}(x))]. \end{aligned}$$
(2)

To support quick access during the query phase, EPQ duplicates \(\mathbf q (x)\) and stores them at the leaf nodes over all the trees. For instance, if the sub-vector \(a_{j}(x)\) is partitioned into a specific leaf node of \(T_{j}\), then a copy of \(\mathbf q (x)\) shall be stored at that node. The side effect of keeping duplicated EPQ codes is that an additional checking step is needed to avoid searching the data points that have been visited before. This extra computation thus damages search performance of EPQ to some degree.

Unlike EPQ, the current work employs a standard clustering algorithm to train the sub-quantizers exactly as PQ does [16]. This means that the original space is partitioned into m sub-spaces from which a separate sub-codebook is learned. For the sake of presentation, let \(\mathcal {L}_{1}, \mathcal {L}_{2}, \ldots , \mathcal {L}_{m}\) be m sub-codebooks, each of which contains k sub-codewords. We also denote two sub-quantization operators, \(q_{j}(s)\) and \(\hat{q}_{j}(s)\) with \(1 \le j \le m\), for a sub-vector \(s \in \mathbb {R}^{D/m}\) as follows:

$$\begin{aligned} q_{j}(s) = \mathop {\hbox {arg min}}\limits _{1 \le i \le k}\, \mathbf {d}(s, \mathcal {L}_{j}^{i}) \end{aligned}$$
(3)

and

$$\begin{aligned} \hat{q}_{j}(s) = \mathop {\hbox {arg min}}\limits _{\mathcal {L}_{j}^{i}}\, \mathbf {d}(s, \mathcal {L}_{j}^{i}) \end{aligned}$$
(4)

where \(\mathcal {L}_{j}^{i}\) is the ith sub-codeword of \(\mathcal {L}_{j}\). In other words, \(q_{j}(s)\) gives the index of a sub-codeword in \(\mathcal {L}_{j}\) that is closest to the sub-vector s, while \(\hat{q}_{j}(s)\) gives the sub-codeword itself.

Given an input vector \(x \in \mathbb {R}^{D}\), its complete quantization code \(\mathbf {q}(x)\) is formed by concatenating m sub-quantizations in the following way:

$$\begin{aligned} \mathbf {q}(x) = [q_{1}(a_{1}(x)),q_{2}(a_{2}(x)),\ldots , q_{m}(a_{m}(x))]. \end{aligned}$$
(5)

The detailed instruction of this step is outlined in Algorithm 1. Here, we first divide the input vector into m sub-vectors. For each sub-vector, we compute the Euclidean distances from it to all the sub-codewords of the corresponding sub-codebook. The centroid having smallest distance is selected as the sub-quantization. Finally, all the sub-quantizations are concatenated to form the complete quantized vector of x.

figure b

In addition to build m sub-quantizers, we also train a fine hierarchical clustering tree to globally represent all the data points in the original space. Algorithm 2 provides an outline of this process. Specifically, the algorithm starts by creating a root node corresponding to the whole dataset. Next, we iteratively apply a standard clustering algorithm (i.e., K-means in our implementation) to cluster the underlying data into K sub-clusters, each of which is represented by an internal node along with a centroid of underlying data.

figure c

Note that all the centroids are full length as the original vectors. We then repeat this process to every sub-cluster until its size is sufficiently small. Each of these small sub-clusters is represented by a leaf node, also accompanying a centroid of underlying data. In addition, for all the data points that are positioning at a leaf node, their quantization vectors are also stored at the corresponding leaf node for efficient access during the query phase. Here, as for the K-means method, the centroid of a given subset S is computed by taking the mean of all data points belonging to S, as follows:

$$\begin{aligned} c(S) = \frac{1}{|S|}\sum _{x \in S}{ x}. \end{aligned}$$
(6)

Searching for the approximate answers of a given query is a non-exhaustive procedure and proceeds in the following way. Starting from the root node, we select a child whose centroid is closest to x in terms of the Euclidean distance. We then continue applying this process until a leaf node is reached. Like EPQ, Algorithm 3 basically demonstrates all these steps although it is intended to a different purpose.

figure d

When positioning at a leaf node, all the data points associated to this node are considered for the answers after applying a re-ranking process. The goal of this re-ranking process is to select the top candidates closest to the query by using the ADC distance. By its definition [16], the ADC distance between a query \( r \in \mathbb {R}^{D}\) and a data point x is approximated by the distance between r and the sub-quantization vectors of x. This implies: \(\mathbf {d}(r,x) \approx \hat{\mathbf {d}}(r, x)\) where \(\hat{\mathbf {d}}(r, x)\) is computed as follows:

$$\begin{aligned} \hat{\mathbf {d}}(r, x) = \sum _{j=1}^{m}{ \mathbf {d}(a_{j}(r), \hat{q}_{j}(a_{j}(x)))} \end{aligned}$$
(7)

where \(a_{j}(r)\) is the jth sub-vector of r and \(\hat{q}_{j}(a_{j}(x)))\) denotes the corresponding quantized sub-vector (i.e., the sub-centroid closest to \(a_{j}(r)\) in the jth sub-codebook).

As the ADC distances are dependent on the query which occurs in the online phase, we have employed the same technique described in PQ [16] to pre-compute the distances. Specifically, we create m 1D hash tables H, where \(H_{j}\) (with \(1 \le j \le m\)) stores the distances between the sub-vector \(a_{j}(r)\) and all the sub-codewords of \(\mathcal {L}_{j}\). The memory cost of storing the hash tables is just O(mk) that is significantly lower than the cost \(O(mV^{2})\) of EPQ, where \(V \gg k\) be the number of leaf nodes of each tree [37].

It has been commonly noticed that looking at the data points in a single leaf node is not sufficient to achieve desired ANN accuracy. Hence, a simple solution [37] is to examine other leaf nodes in increasing order of distance from the current one. We adopt this technique in our work by pre-computing, during the tree construction, a list of t leaf nodes whose centroids are closest to a given leaf node. This procedure is performed for every leaf node. As a result, each leaf node shall keep a pointer list that refers to t closest leaf nodes of the tree. The complete ANN search procedure is finally sketched in Algorithm 4. As can be seen, two processes of computing the ADC distances and re-ranking the shortlist (i.e., by using the Maxheap structure) are not detailed here for simplification.

figure e

3.2 Memory and computation complexity

In this section, we carry out a detailed analysis of memory requirement and computation complexity of the proposed method. For simplification, denote M the size of input dataset.

As for the memory requirement, during the offline phase of index creation, the main tasks include construction of the hierarchical clustering and the PQ sub-codebooks. Let a and b be the numbers of internal nodes and the leaf nodes of the tree, respectively, the memory cost to store the internal nodes is O(aD) as we store the centroid of underlying data points at each node (see Algorithm 2). At the leaf nodes, the memory overhead should be \(O(b(D+Cm))\) where we have to store additional C quantization vectors at maximum, each of which is an 1D integer vector of length m. In the average case, it is expected that the tree is balanced. The number of internal nodes \(a = \frac{K^{h}-1}{K-1}\) and the number of leaf nodes \(b = K^{h}\), where h is the height of the tree and it can be computed by \(h = \log _{K}(\frac{M}{C})\).

In addition, it is also needed to store the PQ sub-codebooks. However, the memory requirement is just O(mkD) where k is the number of sub-codewords of each sub-codebook. In short, the overall memory overhead to construct the indexing structure is about \(O(aD + b(D+Cm) + mkD)\), something negligible when compared to the size of dataset.

Apart from the storage cost aforementioned, the proposed algorithm needs to maintain m 1D hash tables \(\{H_{j}\}\) (with \(1 \le j \le m\)) which are used during the online querying phase. Each hash table stores the pre-computed distances between the query and the sub-codewords of each sub-codebook. The memory cost of storing the hash tables is just O(mk) as mentioned previously.

Concerning the computation complexity, we shall study the cost related to tree training and query time running. In the first place, the procedure of creating the sub-codebooks is similar to that in PQ technique [16]. Therefore, the computation cost involves in running a clustering algorithm (e.g., K-means) to build m sub-codebooks. Next, the hierarchical clustering tree is built by recursively applying the K-means algorithm for the underlying data. It is worth noting that the dataset is iteratively divided into K sub-sets until the size of each set is sufficiently small. Hence, the running times of K-means are significantly dropped for smaller datasets. However, when compared to the PQ technique [16], the process of tree training is still more expensive in computation because we have to apply the K-means algorithm as many as the dataset is finely decomposed.

For the querying times (i.e., Algorithm 4), the computation cost for finding the nearest centroid of the query (namely, u in Algorithm 3) is about O(hD) where h is the height of the tree. Next, it is required to calculate m hash tables, each of which stores the pre-computed distances from the query to sub-codewords of the corresponding sub-codebook. The complexity for this task is thus O(kD) with k the maximum number of sub-codewords. After that, the re-ranking step is applied to maintain the top N best answers of the query in the senses of the ADC distance. This process can be efficiently implemented by using a Maxheap technique with the computational cost of \(O(\log (N))\) [16]. In a nutshell, the querying times of the proposed method is \(O(hD + kD + t\log (N)+\log (N))\) where t is a user-defined input parameter that controls the size of the list P (see Algorithm 3). In some specific circumstances, one may wish to incorporate an additional verification step to filter out the best answer from the shortlist of N candidates. The search time overhead is thus added by O(ND). We shall present in detail the search performance report of this case in the next section.

4 Experimental results

4.1 Experimental settings

We conduct a number of experiments to justify the performance of our method in comparison with state-of-the-art techniques. These methods include randomized KD-trees [30], randomized K-medoids [29], K-means tree [28], POC-trees [35], and EPQ [37]. All the methods are coded in C++ environment without parallel implementation. Especially, we have used the optimized implementation version of the first three methods from the open source FLANN library.Footnote 1 The source code of the proposed method is also made publicly available at this host.Footnote 2 All the tests are carried out on a standard computer with following configuration: Windows 7, 16Gb RAM, Intel Core (Dual-Core) i7 2.1 GHz.

Table 1 The datasets used in our experiments
Fig. 2
figure 2

ANN search performance (speedups vs. precisions) for GIST (a) and VGG (b) features

All the methods are evaluated on public datasets, including ANN\(\_\)SIFT1M [16], ANN\(\_\)GIST1M [16], and VGG [43]. These datasets are commonly used in the literature for the purpose of ANN search evaluation. In these datasets, we are interested in the database and test sets only. The database sets are used to create the clustering trees of all the studied algorithms, whereas the test sets are used for running the queries. Some basic information about the sizes and features is presented on Table 1. The SIFT dataset is composed of one million 128-dimensional SIFT database vectors and 10,000 query vectors, whereas the GIST dataset contains one million 960-dimensional GIST and 1000 queries. We also test the proposed method on a very high dimensional feature space, hence the choice of VGG dataset. This dataset is composed of 500K 4096-dimensional VGG vectors and 1000 queries. The VGG vectors have been extracted from the images of the Yahoo Flickr Creative Commons (YFCC) dataset.Footnote 3 by using a 16-layer CNN model proposed in [43].

As for the evaluation metrics, we follow the standard protocol of speedups versus precisions as commonly used in the literature [28,29,30, 35]. The speedups are relatively computed to brute-force search for avoiding the impact of computer configuration. For a given method \(\mathcal {A}\), the speedup (\(S_{\mathcal {A}}\)) is computed as follows:

$$\begin{aligned} S_{\mathcal {A}} = \frac{t_{\mathcal {A}}}{t_{Seq}} \end{aligned}$$
(8)

where \(t_{\mathcal {A}}\) is the response time of method \(\mathcal {A}\) for a given query, and \(t_{Seq}\) is the response time of using sequence search for the same query.

The search precision measures the fraction of queries for which a search method returns correctly one answer for each query. Both the search speedups and precisions are averaged for n queries, where \(n=10{,}000\) for SIFT and \(n=1000\) for GIST and VGG datasets.

For parameter settings, all the baseline methods in this study use the optimal settings as presented in the original papers. The POC method [35] is evaluated for both cases of using a single tree and 8 trees. Using multiple trees (i.e., 8 trees) is also applied to the randomized KD-trees [30] (namely flann-KD-8trees), randomized K-medoids [29] (namely flann-RC-8trees), and EPQ-trees [37]. At last, both the K-means tree [28] (namely flann-kmeans-1tree) and the proposed method use a single tree. Specifically, there are two parameters (K the branching factor and C the maximum number of points at a leaf node) used by the tree in our method and their values (\(K = 16, C = 100\)) are fixed for all of our tests. In addition, we need two parameters t and N (see Algorithm 4) to control the varying points of speedups versus precisions. The shortlist returned from Algorithm 4 contains the best N answers for a given query. To compute the precision, a verification step is added to select the best candidate as commonly used in previous works [16, 37]. For other parameters of the proposed method, we shall provide further details on the choice and impact of them in the following sub-sections.

4.2 Results and discussions

We report the ANN search performance of all the methods for three datasets in Figs. 2, 3. To build the sub-quantizers, we have set the two parameters m (i.e., the number of sub-codebooksFootnote 4) and k (i.e., the number of sub-codewords) adaptively to each dataset as presented in Table 2. As a general rule, the value of m should be set higher with respect to the increase in dimensionality of the features. It is worth noting that in this work we use natural order for partitioning the datasets into small sub-spaces. This splitting strategy is less effective than the structured order as employed in [16].

Figure 2a shows the gains in speedups of all methods for various search precisions ranging in 80–95%. As can be seen, the proposed method significantly outperforms other methods for all the operating points of search precisions. On average, our method is about \(2 \times \) faster than the recent EPQ method and is roughly \(7 \times \) speedy than the best technique of FLANN (i.e., flann-RC-8trees). Especially, for very high search precisions (> 90%), the proposed method still achieves a far superior performance when compared with others. At the best of our knowledge, this is the first time that such an excellent results on GIST features are reported.

Table 2 Parameters of building sub-quantizers

The same behavior can be observed in Fig. 2b for really high dimensional VGG features. Although the EPQ technique works quite efficiently in this case, it is still outperformed by the proposed method at a gap in speedups from \(1.3\times \) to \(2\times \). The reason for which ours is superior to EPQ can be attributed to the omission of checking step used in EQP to discard the points that have been visited (i.e., due to the duplicates of EPQ codes \(\mathbf {q}(x)\)). In addition, because all the centroids of our clustering tree are full length as original vectors, the tree traversing step is driven more accurately than the case of using sub-vectors as in EPQ.

Fig. 3
figure 3

ANN search performance (speedup/precision) for 1M SIFT dataset

It is also worth mentioning that the number of sub-codewords (\(k=128\)) used for VGG features is relatively low when compared with GIST features. Although setting a higher value to k shall give better coding quality, it is important not to raise the computational overhead of computing the ADC distances in high dimensional spaces. We shall come back to the analysis of setting these parameters in the later part of this section.

Figure 3 reports the search performance for the SIFT dataset. As can be seen, the proposed method is more or less on the same level of speedups as EPQ, but is again far superior to other methods. To explain for the evenly good performance between EPQ and ours, we have experimentally observed that the number of data points to explore in SIFT is much fewer than that for high dimensional features like GIST and VGG. As a result, the computational overhead of computing the ADC distances in Algorithm 4 (i.e., ANNSearch) partially attributed to a drop in search speedups. This result is interesting showing that our method would be more preferred for high dimensional feature spaces.

Comparison with PQ-based methods The last experiment in this part aims at reporting a short analysis of search efficiency of PQ-based methods, including PQ integrated with an inverted file structure (IVFADC) [16], product quantization tree (PQT) [49], product split tree (PS-tree) [5], and the proposed method. For this purpose, we choose FLANN as the baseline method and all other methods are relatively compared to FLANN on the SIFT dataset. The benefit of this way is to reduce the impact of computer configurations on which the methods have been run. The results reported here have been mainly reproduced from the corresponding works. It is worth noting that all the methods have been implemented in C/C++ and thus the evaluation would be reasonably objective.

Table 3 Search performance evaluation with PQ-based methods (SIFT1M)

Table 3 shows that the proposed method performs best with the speedups in the range of \(2.5\times \) to \(3.0\times \) over FLANN when considering the search precisions of 80–95%. PS-tree and IVFADC come in the second place and are quite competitive to each other. These results confirm the efficiency of our method although we have not taken into account the optimization of PQ technique yet.

4.3 Impact of parameter settings

In this part, we empirically analyze the effect of the parameters concerning with the construction of sub-quantizers or sub-codebooks. The first parameter (d) is the length in dimensionally of each sub-space, and the second one (k) is the number of sub-codewords in each sub-codebook. Naturally, these parameters should work differently for different data distributions due to the intrinsic variation and characteristics of each dataset. Hence, we have studied various settings of these two parameters for three datasets GIST, VGG, and SIFT.

Figure 4 shows the variations of search performance on GIST features for three pairs of (d, k). One can see that the three curves perform consistently and the curve on the top corresponds to the pair \((d=48, k=256)\). These plots also suggest that using lower value of either d or k does not give better results. The main reason is probably due to the increase in quantity of the sub-codebooks, hence adding an extra computational cost to the re-ranking step.

Fig. 4
figure 4

The sensitivity of parameter settings (dk) for GIST dataset

Fig. 5
figure 5

The effect of the two parameters (dk) for VGG dataset

Figure 5 shows the sensitivity of (dk) on VGG features. This time the three curves behavior a bit differently for the intermediate range of precisions (< 82.5%) but they are quite close from each other for higher precisions. The best performer is obtained with the values: \(d=64, k = 128\). Setting higher values to either d or k gives worse results. Noticeably, using higher number of sub-codewords \((k=256)\) actually damages the search speedups for this very high dimensional space because the complexity of computing ADC distances is now increased by double.

Lastly, Fig. 6 shows the impact of (dk) on SIFT features. It can be observed that all the curves are very similar in attitude for all operating points of search precision. The best results have been achieved with \(d = 16, k = 256\). As the dimensionality of SIFT features is relatively low, using many sub-codebooks (\(d=8\)) seem not to be a good choice. Indeed, several works [16, 17] have also reported that using 8 sub-codebooks (\(d=16\)) is a reasonably good trade-off.

Fig. 6
figure 6

Impact of parameter settings for SIFT dataset

As a conclusion, it can be stated that the number of sub-quantizers should be set adaptively to the scale and dimensionality of a certain data distribution. Generally, higher dimensional feature spaces preferably require a large number of sub-codebooks to reduce quantization errors. There are also several attempts [9, 17, 22] that try to dynamically perform the best fitting of the allocated bit budget to the underlying data distribution. Hence, one can also integrate these optimized PQ variants (e.g., OPQ, LOPQ, PSPQ) with the hierarchical clustering tree to push the search performance even further. In addition, setting a proper number of sub-codewords is also important to obtain good performance. However, it is less trivial to carry out an automatic tuning procedure of this parameter for a given data distribution.

4.4 Timings report

Table 4 reports the running times of the proposed method for different datasets. The size of each dataset is fixed as described previously. The timings was averaged over n queries for 10 iterations, where \(n = 10{,}000\) for SIFT features and \(n = 1000\) for GIST and VGG datasets. The goal is to measure how fast it is for performing a query provided that the search precision should be sufficiently high (e.g., 90%). As can be seen in Table 4, the proposed method takes about 0.411 (milliseconds) to perform such a query for SIFT dataset. The timings are higher for GIST and VGG features because of the increase in dimensionality but still are quite efficient.

Table 4 A report of running times of the proposed method

5 Conclusion

In this work, we have presented an optimized extension of the EQP indexing method. Like EQP, we exploit the strength of hierarchical clustering tree and product quantization. The novel contribution of our method is two-fold. First, instead of training multiple clustering trees (i.e., one tree for each sub-space), we have created only a single finer tree to globally represent all the data points. This tree plays the role of an inverted index to drive the search more accurately as the centroids are full length at each level of the decomposition. In addition, the quantized codes (i.e., PQ short codes) are not duplicated at the leaf nodes as EPQ does. Hence, the search process is performed more efficiently without the need of checking the items that have been visited before. Second, the proposed method explores the benefits of using ADC distances to build the shortlist of answers for a given query. Compared with SDC distances, the use of ADC distances gives higher search precisions, requires much less memory consumption, but is subjected to the increase in computational overhead.

The proposed method has been validated on a number of experiments with different well-known datasets. The experimental results demonstrate a significant improvement of search performance in comparison with other state-of-the-art methods. Nonetheless, the current work has not investigated the impact of mutual independence among the sub-spaces yet. Different work on this direction would be interested to study in the follow-ups.