Query filtering using two-dimensional local embeddings
Introduction
In general, metric indexes partition the data on the basis of the distances to one or more reference objects (pivots) so that, at query time, some partitions can be included or excluded from the search without the need to calculate the actual distances between the query and the data objects within that partition. The triangle inequality used together with the knowledge of the distances between the pivots and the data/query objects allows computing upper and lower bounds for these distances.
The concept of local pivoting is to partition a metric space so that each element in the space is associated with precisely one of a fixed set of pivots. The idea is that each object of the data set is associated with the pivot that is best suited to filter that particular object if it is not relevant to a query, maximising the probability of excluding it from a search. The notion does not in itself lead to a scalable search mechanism, but instead gives a good chance of exclusion based on a tiny memory footprint and a fast calculation. It is therefore most useful in contexts where main memory is at a premium, or in conjunction with another, scalable, mechanism.
In this paper we apply similar reasoning to metric spaces that possess the four-point property [1], which notably include Euclidean, Cosine, Triangular, Jensen–Shannon, and Quadratic Form spaces. This property allows computing bounds for the actual distance that are tighter that than that obtained using the triangle inequality [2]. We show a novel way of exploiting this situation: each element of the space can be associated with two reference objects, and a four-point lower-bound property is used instead of the simple triangle inequality. The probability of exclusion is strictly greater than with simple local pivoting; the space required per object and the calculation are again tiny in relative terms. Specifically, we store each object using a tuple of four values, as follows. From a finite metric space , a relatively small set of reference objects is selected. For all , the distance is calculated and stored. For each element in , a single pair of reference objects is selected. The distances and are calculated and used together with to isometrically project the objects , , in the 2D Euclidean vectors , , respectively. The triangle inequality guarantees that such isometric embedding exists; moreover, the coordinates of vector can be easily computed by exploiting the distances to the selected pivots (see Section 3 for the details). Thus the space is represented as a set of tuples , indexed by , therefore requiring only a few bytes per object.
When a query is executed, the distances for each are first calculated. At this point, considering any and the objects , it is possible to compute a lower-bound for the unknown distance with a cheap geometric calculation, without any requirement to access the original value . By exploiting the knowledge of the distances to the pivots it is possible to compute the 2D projection of the point with respect to the pivots . The four-point property guarantees that the Euclidean distance between and is a lower-bound for the actual distance .
We show that the resulting mechanism can be very effective. However, for a selection of reference points, there exist pairs from which the representation of each data point can be selected. This number, of course, becomes rapidly very large even with modest increases in . If for each element of we can find a particularly effective pair , within this large space, then this tiny representation of can be used as a powerful threshold query filter. This exclusion mechanism leads to a sequential scan, which is virtually unavoidable in light of a recent conditional hardness result in [3] for nearest neighbour search, even in the approximate setup, for every there exist constants such that with preprocessing time computing a -approximation to the nearest neighbour requires time, with the size of the database.
The above hardness result has been suspected for a long time by the indexing community, and it has been named the curse of dimensionality. It is known, for example, that a metric inverted index [4] has high recall rates only if a substantial part of the candidate results is revised. We aim our approach at this final part of query filtering or re-ranking.
The contributions of this paper are as follows:
- 1.
We show that the outline mechanism is viable. For SISAP benchmark data sets [5] we show that exclusion rates of over 98% can be achieved using our small memory footprint and cheap calculations.
- 2.
We use an observation of the mechanism applied in much higher-dimensional spaces which leads to two different approximate mechanisms which can be applied to range and nearest-neighbour search respectively. For both mechanisms, for a space which is completely intractable for metric indexing methods we can achieve a reduction of search cost of around 90%, in order to return around 90% of the correct results.
- 3.
We examine the problem of finding the best pair of reference points per datum; this can be done well, but expensively, by an exhaustive search of the pair space; however the cost of this is quadratic with respect to the number of reference objects selected. We show that much cheaper heuristics are also effective.
- 4.
Finally, we show an example of how the mechanism can be used as a post-filter adjunct to another mechanism. We describe its incorporation with the List of Clusters index. Using a pragmatic selection of reference objects it can be ensured that no new distances are measured at either construction or query time, which can nonetheless lead to a halving of the overall query cost.
A preliminary version of this work appeared in [6]. The present contribution gives a more detailed description of the proposed approach and a widely extended experimental evaluation. Moreover, it investigates the use of our approach also for approximate pre-filtering and ranked order nearest-neighbour queries.
The rest of the paper is structured as follows. Section 2 reviews related work and gives background information on the four-point property and lower-bound. Section 3 discuss properties of the planar projection used in this work to map metric data to 2D Euclidean space. Section 4 presents our proposed search strategies that rely on local embedding of the data into 2D coordinate space. Section 5 present results of a thorough experimental analysis of the proposed approaches. Section 6 draws conclusions. Table 1 summarises the notation used in this paper.
Section snippets
Background and related work
Pivot based indexes have populated the metric indexing scene for a long time. A standard approach is creating a pivot table, obtained by pre-computing and storing the distances between data objects and some pivots (reference objects). Each object is then represented as the vector of its distances to the pivots. Therefore, a pivot table is just the direct product of one dimensional projections obtained from a single pivot at a time. The object-pivot distance constraint [7], which is a direct
Planar projection and distribution of values in the 2D plane
In this work, we use the planar projection based on , that is we project all the points in the same 2D plane, since this choice guarantees that the Euclidean distance between two projected objects is a lower-bound of the actual distance, and thus can be safely used for filtering purposes. The value of this is that, independently of the size of individual data values and the cost of the distance metric, any value can be represented, for a fixed choice of reference points, as a small 2D
Local embedding and search strategies
By exploiting the planar projection and the four-point planar lower bound, we propose a mechanism with the following properties:
- •
at pre-processing time:
- –
for a metric space and a finite search space , we first select a distinguished set of reference points . The value of is chosen according to properties of the space, as discussed in Section 5.
- –
for each of the pairs of points , the distance is calculated and stored in a lookup table
- –
for each , we select a
- –
Experimental evaluation
To evaluate the potential of the proposed mechanism, experimental evaluation is performed on the following metric spaces:
- nasa, colors
The SISAP nasa and colors [5] are two benchmarks for metric indexing and searching approaches. The nasa set contains 40,150 real vectors of dimension 20, each obtained from images downloaded from the NASA photo and video archive site. The colors set contains 112,682 feature vectors of dimension 112. Each vector is a colour histogram of a medical image. These data
Conclusions
We presented a method to obtain good distance bounds between a query and all the database elements using a minimally-sized representation comprising only two reference object identifiers, and two floating point values, per database object. The two floating point values are the coordinates in a two-dimensional Euclidean space where a lower-bound for the actual distance to a query can be efficiently computed. The combination of the very large space of object pairs available from a relatively
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was partially supported by VISECH project (ARCO-CNR, CUP B56J17001330004) co-funded by the Tuscany region .
References (36)
- et al.
Supermetric search
Inf. Syst.
(2019) New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (aesa)
Pattern Recognit. Lett.
(1994)- et al.
A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements
Pattern Recognit. Lett.
(1994) - et al.
Pivot selection techniques for proximity searching in metric spaces
Pattern Recognit. Lett.
(2003) - et al.
A compact space decomposition for effective metric indexing
Pattern Recognit. Lett.
(2005) Theory and Applications of Distance Geometry
(1953)Hardness of approximate nearest neighbor search
- et al.
Mi-file: Using inverted files for scalable approximate similarity search
Multimedia Tools Appl.
(2014) - et al.
Metric spaces library
(2007) - et al.
Query filtering with low-dimensional local embeddings
Similarity Search: The Metric Space Approach, Vol. 32
Speeding up spatial approximation search in metric spaces
J. Exp. Algorithmics
Some approaches to best-match file searching
Commun. ACM
Proximity matching using fixed-queries trees
Priority vantage points structures for similarity queries in metric spaces
Extreme pivots for faster metric indexes
Nearest neighbours search using the pm-tree
Spaghettis: an array based algorithm for similarity queries in metric spaces
Cited by (1)
Vec2Doc: Transforming Dense Vectors into Sparse Representations for Efficient Information Retrieval
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)