A comparison of pivot selection techniques for permutation-based indexing☆
Introduction
Given a set of objects C from a domain , a distance function , and a query object , a similarity search problem can be generally defined as the problem of finding a subset of the objects that are closer to q with respect to d. Specific formulations of the problem can, for example, require to find the k closest objects (k-nearest neighbors search, k-NN), i.e., and , or all the objects that are closer than a given threshold distance t. i.e., . The k-NN formulation is the most common one.
Similarity search is a difficult problem and various indexing schema have been defined to process similarity queries efficiently. Good surveys of the various approaches proposed in the literature can be found in [39], [34]. However, in most applications, as for instance multimedia retrieval, an exact solution to the similarity search problem is not strictly required. In these cases, performing an approximate similarity search [40], [30] is sufficient. Accepting even a small degree of approximation in results allows to obtain them much more efficiently.
Permutation-based indexes have been proposed as a new approach to efficient and effective approximate similarity search [2], [12], [16], [28]. In permutation-based indexes, data objects and queries are represented as appropriate permutations of a set of n pivots . Formally, every object is associated with a permutation Πo that lists the identifiers of the pivots by their closeness to o, i.e., , where indicates the pivot at position j in the permutation associated with object o. For convenience, we denote the position of a pivot pi, in the permutation of an object , as so that .
The similarity between objects is approximated by comparing their representation in terms of permutations. The basic intuition is that if the permutations relative to two objects are similar, i.e. the two objects see the pivots in a similar order of distance, then the two objects are likely to be similar also with respect to the original distance function d.
Once the set of pivots P is defined it must be kept fixed for all the indexed objects and queries, because the permutations deriving from different sets of pivots are not comparable. A selection of a “good” set of pivots is thus an important step in the indexing process, where the “goodness” of the set is measured by the effectiveness and efficacy of the resulting index structure at search time.
Permutation based methods share some ideas with the Shared Nearest Neighbors methods (SNN) [22], [32], [14]. These methods introduce the concept of secondary similarity measures, which evaluates the similarity among two objects by considering the amount of overlap of their neighborhoods. The neighborhood of an object is determined using the original distance and all objects of the dataset. Secondary similarity measures have been shown to be able to reduce the impact of the curse of dimensionality in cases in which the discriminative power of the primary similarity measure is reduced by the high dimensionality of the similarity space. The difference between the permutation-based methods and the SNN methods is that permutation-based methods encode original objects with neighbor objects taken from a very small subset of the entire dataset, rather than the entire dataset. Moreover, the primary purpose of this encoding is to build efficient and scalable approximate similarity search index structures, rather than computing a better distance than the original distance, as SNN method do. In addition, no pivot selection technique is needed by SNN methods, given that they use the entire dataset to determine the neighborhood of objects.
In the field of permutation-based access methods the most commonly adopted technique for the definition of P is to randomly select the n objects from C [2], [12], [16]. Even though there is a relatively rich literature on pivot selection techniques for the general class of pivot-based access methods [39] (see 2 Related work, 3 Pivot selection techniques), to the best of our knowledge, no rigorous comparison of the effectiveness of the various selection techniques, when used in combination with permutation-based access methods, has been performed yet. In this paper we compare five techniques for the definition of sets of pivots to be used by permutation-based access methods. One of the techniques that we compare is a novel proposal that we have designed to be used with permutation based index.
In summary, the contribution of this paper is twofold.
- 1.
We test various pivot selection techniques, including random selection, on a number of permutation based indexes. An interesting result is that different selection methods were optimal for different index schema. In fact, the way in which permutations are used, by different indexing schema, is basically different, and this is reflected in the pivot selection techniques.
- 2.
We propose and compare a new pivot selection criterium, expressly designed to be used with permutation-based indexes. This method is clearly superior when used with the MI-File index [2].
The paper is structured as follows. In Section 2 we discuss related work. Section 3 presents the techniques being compared. The tested similarity search access methods are presented in Section 4. Section 5 describes the experiments and comments their results. Conclusion and future work are given in Section 6.
Section snippets
Related work
The study of pivot selection techniques for access methods usually classified as pivot-based [39] has been an active research topic, in the field of similarity search in metric spaces, since the nineties. Most access methods make use of pivots to reduce the number of data objects accessed during similarity query execution. The choice of the pivots plays a relevant role in allowing the access methods to achieve their best performance. In an early work by Shapiro [37], it was noticed that good
Pivot selection techniques
Permutation-based access methods use pivots to build permutations that represent data objects. However, different permutation-based indexes make different use of the permutations. We can broadly identify two different roles played by the permutations in these indexes schema:
- 1.
Rank data objects according to the distance (or dissimilarity) between permutations, rather than the original distance.
- 2.
Provide focused access in the database to identify and retrieve candidate objects, in which the
Permutation-based similarity access methods
We have compared the pivot selection techniques on three permutation based index structures that reasonably cover the various approaches adopted in literature by access methods based on permutations.
Datasets and groundtruth
Experiments were conducted using the CoPhIR dataset [6], which is currently the largest multimedia metadata collection available for research purposes. It consists of a crawl of 106 millions images from the Flickr photo sharing website. We have run experiments by using as the distance function d a linear combination of the five distance functions for the five MPEG-7 descriptors that have been extracted from each image. As weights for the linear combination we have adopted those proposed in [5],
Conclusion
In this paper we compared five pivot selection techniques on three permutation-based access methods. For all the tested access methods we found at least one technique that significantly outperforms the random selection. Another interesting point is that there is no technique that is universally the best for all the access methods.
In Section 3, we identified two roles that the permutations can play in the access methods. First, they can be used for approximating the original distance between two
Acknowledgments
This work was partially supported by EAGLE (Europeana network of Ancient Greek and Latin Epigraphy, co-founded by the European Commision, CIP-ICT-PSP.2012.2.1 – Europeana and creativity, Project Reference 325122) and Secure! (Piattaforma intelligente basata su tecnologie crowdsourcing e crowdsensing per la sicurezza e la gestione delle crisi ed emergenze, Regione Toscana POR CReO 2007 2013, Linea di Intervento 1.5.a, 1.6, Bando Unico R&S Anno 2012).
References (40)
- et al.
Pivot selection techniques for proximity searching in metric spaces
Pattern Recogn. Lett.
(2003) - et al.
Performance guarantees for hierarchical clustering
J. Comput. Syst. Sci.
(2005) Use of permutation prefixes for efficient and scalable approximate similarity search
Inf. Process. Manag.
(2012)Clustering to minimize the maximum intercluster distance
Theor. Comput. Sci.
(1985)- et al.
A new version of the nearest-neighbour approximating and eliminating search algorithm aesa with linear preprocessing time and memory requirements
Pattern Recogn. Lett.
(1994) - et al.
Metric indexan efficient and scalable solution for precise and approximate similarity search
Inf. Syst.
(2011) - et al.
Approximate similarity searcha multi-faceted problem
J. Discrete Algorithms
(2009) - et al.
Pivot selection strategies for permutation-based similarity search
- Giuseppe Amato, Claudio Gennaro, Pasquale Savino, Mi-file: using inverted files for scalable approximate similarity...
- Giuseppe Amato, Pasquale Savino, Approximate similarity search in metric spaces using inverted files, in: Proceedings...
Building a web-scale image similarity search system
Multimedia Tools and Applications
Indexing large metric spaces for similarity search queries
ACM Trans. Database Syst.
Effective proximity retrieval by ordering permutations
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (20)
HubHSP graph: Capturing local geometrical and statistical data properties via spanning graphs
2024, Information SystemsA survey on graph-based methods for similarity searches in metric spaces
2021, Information SystemsHubHSP Graph: Effective Data Sampling for Pivot-Based Representation Strategies
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)