1 Introduction

Similarity searching consists in retrieving the most similar objects from a database to a given query. This problem is also known as nearest neighbor searching, which is a crucial task to several areas such as multimedia retrieval (i.e. images), computational biology, pattern recognition, etc. The similarity between objects can be measured with a distance function, usually considered expensive to compute, defined by experts in a specific data domain. Thus, the main objective of several proposed indexes is to reduce the number of distance evaluations to get the most similar objects in a database with respect to a given query.

Similarity searching can be mapped into a metric space problem. It can be seen as a pair \((\mathbb {X}, d)\), where \(\mathbb {X}\) is the universe of objects and d is the distance function \(d:\mathbb {X} \times \mathbb {X} \rightarrow \mathbb {R}^{+} \cup \{0\}\). The distance satisfies, for all \(x,y,z \in \mathbb {X}\), the following properties: reflexivity \(d(x,y)=0\) iff \(x=y\), symmetry \(d(x,y)=d(y,x)\), and triangle inequality \(d(x,y)\le d(x,z)+d(z,y)\). In practical applications, we have a working database \(|\mathbb {U}|=n\), \(\mathbb {U} \subseteq \mathbb {X}\).

Basically, there are two kinds of queries: range query (qr) and k-nearest neighbor query \(kN\!N(q)\). The first one retrieves all the objects within a given radius r measured from q, that is, \((q,r) = \{u \in \mathbb {U}, d(q,u) \le r\}\). The other one retrieves the k objects in \(\mathbb {U}\) that are the closest to q. Formally, \(|kN\!N(q)|=k\), and \(\forall ~u \in kN\!N(q), v \in \mathbb {U}\), \(d(u,q)\le d(v,q)\).

Several algorithms [8, 10, 11] have been proposed in order to answer these kind of queries. Examples of indexes that are relevant to this work are Fixed Queries Tree (FQT) [2] and the Permutation-Based Algorithm (PBA) [5]. Some indexes suffer the well known curse of dimensionality (that is, the searching effort increases as the dataset intrinsic dimensionality grows) [8]. In practice, there are indexes more adequate to certain dimensionality. For instance, the FQT is well suited for low dimensionality and the PBA for medium to high dimensionality. As the distance is considered expensive to compute, in this work we measure any cost in terms of the number of distance computations needed.

In this paper, we introduce an improvement on top of these two classic metric spaces searching algorithms: a variant of FQT (Fixed Height Queries Tree - FHQT) and PBA. This new index works well in both low and high dimensionality, because adequately combines the best of them.

The organization of this paper is as follows. Section 2 presents the basic concepts about metric spaces algorithms. Section 3 describes in detail the indexes FHQT and PBA, that are the base of our work. Section 4 introduces our proposal in details (indexing and searching). In Sect. 5, we experimentally prove our claims using both synthetic and real world datasets, the first one help us to known the performance of the parameters. In the last section we arrive to some conclusions and a discussion of future work.

2 Basic Concepts

Firstly, an algorithm aims to establish some structure or index over the database \(\mathbb {U}\); then, when a query is given, the algorithm uses this structure to speed up the response time. Of course, in order to traverse through the index some distances computations are needed. This process obtains a set of non-discarded objects, then the query is compared with all these objects to answer the similarity query.

The answer to a similarity query can be exact or approximate. An approximate similarity searching is appealing when we require efficiency and instead we accept to lose some accuracy. This is specially relevant when we work on high intrinsic dimensionality spaces. Algorithms that obtain exact answers can be classified in two groups, namely pivots-based algorithms (PB) and compact-partitions algorithms (CP). While PB algorithms work well in low dimensions, CP algorithms work better in high dimensions. On the other hand, a good approach to solve similarity queries in an approximated fashion is to use the PBA, that are unbeatable in high and very high dimensions.

Pivot-Based Algorithms. A pivot-based algorithm chooses a set of pivots \(\mathbb {P}=\{p_1,p_2,\dots , p_j\} \subseteq \mathbb {U}, j=|\mathbb {P}|\). For each database element \(u \in \mathbb {U}\), the PB algorithm computes and stores all the distances between u and the members of \(\mathbb {P}\). The resulting set of distances \(\{d(p_1,u),d(p_2,u),\dots ,d(p_j,u)\}, \forall u \in \mathbb {U}\) is used for building the index.

For a range query (qr), the distances \(d(p_i,q) ~\forall p_i \in \mathbb {P}\) are computed. By virtue of the triangle inequality property, for each \(u \in \mathbb {U}\), it holds that \(\max _{1\le i \le j}|d(q,p_i)-d(p_i, u)| \le d(q,u)\). Therefore, if an element u satisfies that \(r < \max _{1\le i \le j}|d(q,p_i)-d(p_i, u)|\), then u can be safely discarded. Finally, every non-discarded element is directly compared with q and reported if it fulfills the range criterion.

Compact Partition Algorithms. A compact partition algorithm exploits the idea of dividing the space in compact zones, usually in a recursive manner, and storing a representative object (a “center”) \(c_i\) for each zone plus a few extra data that permits quickly discarding the zone at query time. During search, entire zones can be discarded depending on the distance from their cluster center \(c_i\) to the query q. Two criteria can be used to delimit a zone, namely hyperplane and covering radius.

3 Related Works

In this section we describe the two algorithms that we use as base of our work: Fixed Height Queries Tree and Permutation-based Algorithm.

3.1 Fixed Height Queries Tree

The FHQT belongs to a family of indexes, all of them with about the same efficiency in terms of number of distances computed. These indexes are known as Fixed-Queries Tree (FQT) [2], Fixed Height FQT (FHQT) [1, 2], Fixed-Queries Array (FQA) [7], and Fixed Queries Trie (FQTrie) [4]. All of them partition \(\mathbb {U}\) in classes according to a distance, or range of distances, to a pivot p. In particular, the FHQT is built as follows. Firstly, a set of j pivots is chosen, the ranges of distances are stablished, and we associate a label l to each range. We start with pivot \(p_1\) and for every range of distance l we pick the objects whose distance to \(p_1\) belongs to the range labeled as l. For every non-empty subset, a tree branch with label l is generated. Next, we repeat the process recursively using the next pivot. Each pivot is used for all subtrees in the same level; therefore, pivot \(p_2\) is used for all subtrees in the second level, \(p_3\) for the third level and so on. The height of the tree is j.

For a given query q and radius r, \(d(q,p_1)\) is computed and all branches whose range of distance do not intersect with \([d(q,p_1)-r,d(q,p_1)+r]\) are discarded. The process is repeated recursively for those branches not yet discarded using the next pivot. Hence, a set of candidates elements is obtained. Finally, all the candidates are directly compared with q to answer the similarity query.

3.2 Permutation-Based Algorithm (PBA)

A permutation-based algorithm can be described as follows: Let \(\mathbb {P} \subset \mathbb {U}\) be a set of permutants with m members. Each element \(u \in \mathbb {U}\) induces a preorder \(\le _u\) given by the distance from u towards each permutant, defined as \(y \le _u z \Leftrightarrow d(u,y)\le d(u,z)\), for any pair \(y,z \in \mathbb {P}\).

Let \(\varPi _u = i_1, i_2, \dots , i_{m}\) be the permutation of u, where permutant \(p_{i_{j}} \le _u p_{i_{j+1}}\). Permutants at the same distance take an arbitrary but consistent order. For every object in \(\mathbb {U}\), its preorder of \(\mathbb {P}\) is computed and associated to a permutation. The resulting set of permutations conforms the index, since a PBA does not store any distance.

Given the query \(q \in \mathbb {X}\), the PBA search algorithm computes its permutation \(\varPi _q\) and compares it with all the permutations stored in the index. Then, the dataset \(\mathbb {U}\) is traversed in increasing dissimilarity of permutations, comparing directly the objects in \(\mathbb {U}\) with the query using the distance d of the particular metric space. There are many similarity measures between permutations. One of them is the \(L_s\) family of distances, that obeys Eq. (1), where \(\varPi ^{-1}(i_{j})\) denotes the position of permutant \(p_{i_{j}}\) within permutation \(\varPi \).

$$\begin{aligned} L_s(\varPi _u,\varPi _q) = \sum _{1\le j \le |\mathbb {P}|}|\varPi _u^{-1}(i_j) - \varPi _q^{-1}(i_j)|^s \end{aligned}$$
(1)

There are some special values for s. If \(s=2\) the distance is known as Spearman Rho (\(S_{\!\rho }\)); and for \(s=1\) it, is called Spearman Footrule (\(S_{\!f}\)).

4 Fixed Height Queries Tree Permutation (FHQTP)

We propose a new efficient metric index as result of a clever combination of the best features of the two aforementioned indexes. This way, at search time we can take advantage of both the search pruning of the FHQT and the prediction capability of the PBA. We use the same pivots of the FHQT as the permutants of our PBA, so that we produce the PBA index with no extra distance computation.

4.1 Index Construction

The first stage of the FHQTP consists in building the classic FHQT, maintaining all the distances computed during the process. As we have computed the real distances between every object \(u \in \mathbb {U}\) and each pivot \(p \in \mathbb {P}\), we use them to compute the permutation for each u. The FHQTP stores all these permutations in the index, and after computing them, it discards the distances. Finally, we have a complete FHQT and each element has its permutation. The construction cost of FHQTP is the same as that of the classic FHQT.

If we have memory restrictions, we can save some space by not storing the central part of the permutations, because these permutants have a lesser contribution to the PBA prediction power. This was previously shown in [9].

4.2 Searching

To solve a range query (qr), we divide the process in two steps. In the first one, we use the FHQTP as a standard FHQT in order to discard non-relevant objects. However, once a leaf is reached, instead of computing the distance d between all the objects stored in the leaf with the query, we put them in a candidate list (since they were not discarded by the FHQT). This process continues adding to the candidate list every non discarded object.

In the second step, we compute \(\varPi _{q}\) and we sort the candidate list in increasing order of permutation distance (e.g., Spearman Rho or Footrule), in a similar way as the standard PBA, but only considering the resulting list of candidates. Thanks to PBAs prediction property, it suffices to review just a small percentage of the top of the list.

4.3 Example

Figure 1 depicts an example of our technique. This database consists of the small circles and we use the Euclidean distance. Pivots/permutants are the black filled circles. The circle in bold line centered at q represents a range query. Each circle in dash line, centered around pivots/permutants, represents a branch; and the elements inside it are part of the same branch. Note that for all pivots/permutants, we consider that every element beyond the third branch belongs to branch four. For example, object \(u_1\) is placed in the first root branch as it is located at the first concentric circle from \(p_1\) (notice that \(u_1\) is the only one), and \(u_1\) is in the third ring of both \(p_2\) and \(p_3\). The remaining elements are subject to the same process. Finally, the resulting FHQTP is composed by the FHQT shown in Fig. 1(b) and the corresponding permutations in Fig. 1(c).

Fig. 1.
figure 1

Sketch of our proposal.

In Fig. 1(a), the query q is located inside the rings 2, 4, and 2 of pivots/permutants \(p_1, p_2\), and \(p_3\), respectively; its permutation is \(\varPi _q = 1~3~2\). Thin lines in Fig. 1(b) show the excluded branches. This means that elements \(u_{1}, u_{5}, u_{6}, u_{7}, u_{8}\), and \(u_{11}\) are not discarded when solving the range query (qr). However, according to their permutations (Fig. 1(c)), the permutations distances are \(S_{\!f}(q,u_{1}) = 2, S_{\!f}(q, u_5)= 0, S_{\!f}(q, u_6)= 0, S_{\!f}(q, u_7)= 2, S_{\!f}(q, u_8)= 2\), and \(S_{\!f}(q, u_{11})= 4\). Therefore, the possible relevant objects can be reviewed in the following order: \(u_{5}, u_{6}, u_{1}, u_{7}, u_{8}\), and \(u_{11}\). Note that if we review a fraction (as authors describe in [6]) from the top of this list we have a good chance of retrieving the complete answer, and the chances increase as long as we review more non discarded elements.

4.4 Partial vs Full Permutation

Since we are processing the candidate list with PBA, we propose just to evaluate (and to keep) just a small part of permutations. According to [9], the most important part of permutations are the first and the last permutants. For example, if we divide each permutation in 3 parts and dismiss the central part, we use just 2/3 part of the permutations and we can get a good order to review the candidate list. In our example, in Fig. 1(c), we have \(u_1=1~3, u_2=1~3, u_3=1~3, u_4=1~3, u_5=1~2\), and so on.

5 Experiment Results

We tested our proposal using two kinds of metric databases. The first one is composed by several synthetic datasets that consists of uniformly distributed vectors in the unitary hypercube of dimensions 6 to 16 using the Euclidean distance. The second one consists of three real world datasets, namely, a set of English words using the Levenshtein’s distance, a set of images from NASA, and a set of images from CoPhIR (Content-based Photo Image Retrieval). For the two image datasets we also use the Euclidean distance.

Since \(kN\!N\) queries are more appealing than range queries in practical applications, during the experimental evaluation, we simulate \(kN\!N\) queries with range queries using a radius that retrieves exactly the number of neighbors we need.

5.1 Synthetic Datasets

Each synthetic datasets is composed by 100,000 vectors uniformly distributed. These datasets allow us to control the intrinsic dimensionality of the space and analyze how the curse of dimensionality affects our proposal. We tested \(kN\!N(q)\) queries with a query set of size 100 in dimensions that vary from 6 to 16.

The performance of our technique is shown in Fig. 2. Notice that our proposal computes less distances than the original FHQT technique. In high dimension, where FHQT is affected by the curse of dimensionality, using our proposal we can retrieve quickly the most similar objects. We repeated these experiments using different number of pivots, for \(dim=6\), we use \(m=6\), for \(dim=12\) we use \(m=12\), per each dimension we have 4 lines (FHQT, FHQTP full, FHQTP partial and original PBA). Our proposal made less distance computations than FHQT and PBA, and partial is better than FHQT but not than FHQT full.

Fig. 2.
figure 2

Comparison of search performance in the unitary cube synthetic spaces, 100,000 objects.

5.2 Real World Datasets

We use three real datasets. The first one is an English dictionary consisting of 69,069 words using the Levenshtein’s distance, also known as edit distance. This distance is equivalent to the minimum number of single character edit operations (character insertion, deletion or substitution) needed to convert one word into the other.

The second dataset contains 40,150 20-dimensional feature vectors, generated from images downloaded from NASAFootnote 1, where duplicated vectors were eliminated. Any quadratic form can be used as a distance, so we chose the Euclidean distance as the simplest, meaningful alternative.

Finally, the third dataset is a subset of 100,000 images from CoPhIR [3]. For each image, the standard MPEG-7 image feature have been extracted. For the aforementioned reason, in this space we also use the Euclidean distance.

Fig. 3.
figure 3

Performance of FQTP using an English dictionary.

Fig. 4.
figure 4

Performance of FHQTP on real database.

In Fig. 3, we show the performance of our technique for the English dictionary. Notice that our technique has an excellent performance for 1, 2, and 3 \(N\!N\). Figure 4 shows how FHQT is not competitive in real databases. In the left side, Fig. 4(a) is for NASA database and, in the right side, Fig. 4(b) is for the CoPhIR dataset. Both figures are showing the performance of our proposal for different values for k in \(kN\!N\) searches. We can get the answers up to 10x times faster. In this cases we use \(m=12\).

Fig. 5.
figure 5

Performance of FHQTP on real databases, \(1N\!N\) queries.

Finally, as shown in Fig. 5, we get the nearest neighbors faster than PBA when we review an incremental fraction, both for NASA database (Fig. 5(a)) and the CoPhIR dataset (Fig. 5(b)). Notice that we have a better performance than PBA in both databases.

6 Conclusions and Future Work

In this paper a new index is proposed. It consists in merging the best features of two known algorithms for metric spaces FHQT (Fixed Height Query Tree) and the Permutation based algorithm (PBA). Following the FHQT, when a query decides which branches are used and arrives at a leaf, all the objects in this leaf must be compared with the query. Instead, we can reuse the distance comparison performed, and to make the permutation per each object. At quering time, we use the FHQT to make a candidate list, then using the PBA technique, we review in a new order this list. This combination is a winner approach, it has an excellent performance because it works well in all dimensions.

For future work, we are interested in designing an algorithm for solving the k nearest neighbor query using our proposal, and extend this technique for other algorithms like FQA (Fixed Query Array).