Elsevier

Information Systems

Volume 101, November 2021, 101808
Information Systems

Query filtering using two-dimensional local embeddings

https://doi.org/10.1016/j.is.2021.101808Get rights and content

Highlights

  • The four-point property possessed by many metric spaces is used to create very small object proxies

  • Distances between these proxies are a lower bound of the true distance

  • Fast approximate kNN query with high precision and recall can be achieved

  • The mechanism can be added to other search techniques with negligible extra cost

  • The mechanisms described are well suited to a GPU implementation

Abstract

In high dimensional data sets, exact indexes are ineffective for proximity queries, and a sequential scan over the entire data set is unavoidable. Accepting this, here we present a new approach employing two-dimensional embeddings. Each database element is mapped to the XY plane using the four-point property. The caveat is that the mapping is local: in other words, each object is mapped using a different mapping.

The idea is that each element of the data is associated with a pair of reference objects that is well-suited to filter that particular object, in cases where it is not relevant to a query. This maximises the probability of excluding that object from a search. At query time, a query is compared with a pool of reference objects which allow its mapping to all the planes used by data objects. Then, for each query/object pair, a lower bound of the actual distance is obtained. The technique can be applied to any metric space that possesses the four-point property, therefore including Euclidean, Cosine, Triangular, Jensen–Shannon, and Quadratic Form distances.

Our experiments show that for all the data sets tested, of varying dimensionality, our approach can filter more objects than a standard metric indexing approach. For low dimensional data this does not make a good search mechanism in its own right, as it does not scale with the size of the data: that is, its cost is linear with respect to the data size. However, we also show that it can be added as a post-filter to other mechanisms, increasing efficiency with little extra cost in space or time. For high-dimensional data, we show related approximate techniques which, we believe, give the best known compromise for speeding up the essential sequential scan. The potential uses of our filtering technique include pure GPU searching, taking advantage of the tiny memory footprint of the mapping.

Introduction

In general, metric indexes partition the data on the basis of the distances to one or more reference objects (pivots) so that, at query time, some partitions can be included or excluded from the search without the need to calculate the actual distances between the query and the data objects within that partition. The triangle inequality used together with the knowledge of the distances between the pivots and the data/query objects allows computing upper and lower bounds for these distances.

The concept of local pivoting is to partition a metric space so that each element in the space is associated with precisely one of a fixed set of pivots. The idea is that each object of the data set is associated with the pivot that is best suited to filter that particular object if it is not relevant to a query, maximising the probability of excluding it from a search. The notion does not in itself lead to a scalable search mechanism, but instead gives a good chance of exclusion based on a tiny memory footprint and a fast calculation. It is therefore most useful in contexts where main memory is at a premium, or in conjunction with another, scalable, mechanism.

In this paper we apply similar reasoning to metric spaces that possess the four-point property [1], which notably include Euclidean, Cosine, Triangular, Jensen–Shannon, and Quadratic Form spaces. This property allows computing bounds for the actual distance that are tighter that than that obtained using the triangle inequality [2]. We show a novel way of exploiting this situation: each element of the space can be associated with two reference objects, and a four-point lower-bound property is used instead of the simple triangle inequality. The probability of exclusion is strictly greater than with simple local pivoting; the space required per object and the calculation are again tiny in relative terms. Specifically, we store each object using a tuple of four values, as follows. From a finite metric space S, a relatively small set of reference objects P is selected. For all pj,pkP, the distance d(pj,pk) is calculated and stored. For each element si in S, a single pair of reference objects pi1,pi2 is selected. The distances d(si,pi1) and d(si,pi2) are calculated and used together with d(pi1,pi2) to isometrically project the objects pi1, pi2, si in the 2D Euclidean vectors (0,0), 0,d(pi1,pi2), (xsi,ysi) respectively. The triangle inequality guarantees that such isometric embedding exists; moreover, the coordinates of vector (xsi,ysi) can be easily computed by exploiting the distances to the selected pivots (see Section 3 for the details). Thus the space S is represented as a set of tuples i1,i2,xsi,ysi, indexed by i, therefore requiring only a few bytes per object.

When a query is executed, the distances d(q,pj) for each pjP are first calculated. At this point, considering any siS and the objects q,pi1,pi2, it is possible to compute a lower-bound for the unknown distance d(q,si) with a cheap geometric calculation, without any requirement to access the original value siS. By exploiting the knowledge of the distances to the pivots pi1,pi2 it is possible to compute the 2D projection (xq,i,yq,i) of the point q with respect to the pivots pi1,pi2. The four-point property guarantees that the Euclidean distance between (xq,i,yq,i) and (xsi,ysi) is a lower-bound for the actual distance d(q,si).

We show that the resulting mechanism can be very effective. However, for a selection of m reference points, there exist m2 pairs from which the representation of each data point can be selected. This number, of course, becomes rapidly very large even with modest increases in m. If for each element of S we can find a particularly effective pair pi1,pi2, within this large space, then this tiny representation of S can be used as a powerful threshold query filter. This exclusion mechanism leads to a sequential scan, which is virtually unavoidable in light of a recent conditional hardness result in [3] for nearest neighbour search, even in the approximate setup, for every δ>0 there exist constants ϵ,c>0 such that with preprocessing time O(Nc) computing a (1+ϵ)-approximation to the nearest neighbour requires O(N1δ) time, with N the size of the database.

The above hardness result has been suspected for a long time by the indexing community, and it has been named the curse of dimensionality. It is known, for example, that a metric inverted index [4] has high recall rates only if a substantial part of the candidate results is revised. We aim our approach at this final part of query filtering or re-ranking.

The contributions of this paper are as follows:

  • 1.

    We show that the outline mechanism is viable. For SISAP benchmark data sets [5] we show that exclusion rates of over 98% can be achieved using our small memory footprint and cheap calculations.

  • 2.

    We use an observation of the mechanism applied in much higher-dimensional spaces which leads to two different approximate mechanisms which can be applied to range and nearest-neighbour search respectively. For both mechanisms, for a space which is completely intractable for metric indexing methods we can achieve a reduction of search cost of around 90%, in order to return around 90% of the correct results.

  • 3.

    We examine the problem of finding the best pair of reference points per datum; this can be done well, but expensively, by an exhaustive search of the pair space; however the cost of this is quadratic with respect to the number of reference objects selected. We show that much cheaper heuristics are also effective.

  • 4.

    Finally, we show an example of how the mechanism can be used as a post-filter adjunct to another mechanism. We describe its incorporation with the List of Clusters index. Using a pragmatic selection of reference objects it can be ensured that no new distances are measured at either construction or query time, which can nonetheless lead to a halving of the overall query cost.

A preliminary version of this work appeared in [6]. The present contribution gives a more detailed description of the proposed approach and a widely extended experimental evaluation. Moreover, it investigates the use of our approach also for approximate pre-filtering and ranked order nearest-neighbour queries.

The rest of the paper is structured as follows. Section 2 reviews related work and gives background information on the four-point property and lower-bound. Section 3 discuss properties of the planar projection used in this work to map metric data to 2D Euclidean space. Section 4 presents our proposed search strategies that rely on local embedding of the data into 2D coordinate space. Section 5 present results of a thorough experimental analysis of the proposed approaches. Section 6 draws conclusions. Table 1 summarises the notation used in this paper.

Section snippets

Background and related work

Pivot based indexes have populated the metric indexing scene for a long time. A standard approach is creating a pivot table, obtained by pre-computing and storing the distances between data objects and some pivots (reference objects). Each object is then represented as the vector of its distances to the pivots. Therefore, a pivot table is just the direct product of one dimensional projections obtained from a single pivot at a time. The object-pivot distance constraint [7], which is a direct

Planar projection and distribution of values in the 2D plane

In this work, we use the planar projection based on α=0, that is we project all the points in the same 2D plane, since this choice guarantees that the Euclidean distance between two projected objects is a lower-bound of the actual distance, and thus can be safely used for filtering purposes. The value of this is that, independently of the size of individual data values and the cost of the distance metric, any value can be represented, for a fixed choice of reference points, as a small 2D

Local embedding and search strategies

By exploiting the planar projection and the four-point planar lower bound, we propose a mechanism with the following properties:

  • at pre-processing time:

    • for a metric space (U,d) and a finite search space SU, we first select a distinguished set of m reference points PU. The value of m is chosen according to properties of the space, as discussed in Section 5.

    • for each of the m2 pairs of points pj,pkP, the distance d(pj,pk) is calculated and stored in a lookup table

    • for each siS, we select a

Experimental evaluation

To evaluate the potential of the proposed mechanism, experimental evaluation is performed on the following metric spaces:

    nasa, colors

    The SISAP nasa and colors [5] are two benchmarks for metric indexing and searching approaches. The nasa set contains 40,150 real vectors of dimension 20, each obtained from images downloaded from the NASA photo and video archive site. The colors set contains 112,682 feature vectors of dimension 112. Each vector is a colour histogram of a medical image. These data

Conclusions

We presented a method to obtain good distance bounds between a query and all the database elements using a minimally-sized representation comprising only two reference object identifiers, and two floating point values, per database object. The two floating point values are the coordinates in a two-dimensional Euclidean space where a lower-bound for the actual distance to a query can be efficiently computed. The combination of the very large space of object pairs available from a relatively

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was partially supported by VISECH project (ARCO-CNR, CUP B56J17001330004) co-funded by the Tuscany region .

References (36)

  • ZezulaP. et al.

    Similarity Search: The Metric Space Approach, Vol. 32

    (2006)
  • FigueroaK. et al.

    Speeding up spatial approximation search in metric spaces

    J. Exp. Algorithmics

    (2010)
  • BurkhardW.A. et al.

    Some approaches to best-match file searching

    Commun. ACM

    (1973)
  • Baeza-YatesR. et al.

    Proximity matching using fixed-queries trees

  • CelikC.

    Priority vantage points structures for similarity queries in metric spaces

  • RuizG. et al.

    Extreme pivots for faster metric indexes

  • SkopalT. et al.

    Nearest neighbours search using the pm-tree

  • ChavezE. et al.

    Spaghettis: an array based algorithm for similarity queries in metric spaces

  • Cited by (1)

    View full text