Elsevier

Information Systems

Volume 52, August–September 2015, Pages 176-188
Information Systems

A comparison of pivot selection techniques for permutation-based indexing

https://doi.org/10.1016/j.is.2015.01.010Get rights and content

Abstract

Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots (or reference objects). Similarity queries are executed by searching for data objects whose permutation representation is similar to that of the query, following the assumption that similar objects are represented by similar permutations of the pivots. In the context of permutation-based indexing, most authors propose to select pivots randomly from the data set, given that traditional pivot selection techniques do not reveal better performance. However, to the best of our knowledge, no rigorous comparison has been performed yet. In this paper we compare five pivot selection techniques on three permutation-based similarity access methods. Among those, we propose a novel technique specifically designed for permutations. Two significant observations emerge from our tests. First, random selection is always outperformed by at least one of the tested techniques. Second, there is no technique that is universally the best for all permutation-based access methods; rather different techniques are optimal for different methods. This indicates that the pivot selection technique should be considered as an integrating and relevant part of any permutation-based access method.

Introduction

Given a set of objects C from a domain D, a distance function d:D×DR, and a query object qD, a similarity search problem can be generally defined as the problem of finding a subset SC of the objects that are closer to q with respect to d. Specific formulations of the problem can, for example, require to find the k closest objects (k-nearest neighbors search, k-NN), i.e., |S|=k and xS,y(C\S)(d(x,q)d(y,q)), or all the objects that are closer than a given threshold distance t. i.e., S={x|xCd(x,q)t}. The k-NN formulation is the most common one.

Similarity search is a difficult problem and various indexing schema have been defined to process similarity queries efficiently. Good surveys of the various approaches proposed in the literature can be found in [39], [34]. However, in most applications, as for instance multimedia retrieval, an exact solution to the similarity search problem is not strictly required. In these cases, performing an approximate similarity search [40], [30] is sufficient. Accepting even a small degree of approximation in results allows to obtain them much more efficiently.

Permutation-based indexes have been proposed as a new approach to efficient and effective approximate similarity search [2], [12], [16], [28]. In permutation-based indexes, data objects and queries are represented as appropriate permutations of a set of n pivots P={p1pn}D. Formally, every object oD is associated with a permutation Πo that lists the identifiers of the pivots by their closeness to o, i.e., j{1,2,,n1},d(o,pΠo(j))d(o,pΠo(j+1)), where pΠo(j) indicates the pivot at position j in the permutation associated with object o. For convenience, we denote the position of a pivot pi, in the permutation of an object oD, as Πo1(i) so that Πo(Πo1(i))=i.

The similarity between objects is approximated by comparing their representation in terms of permutations. The basic intuition is that if the permutations relative to two objects are similar, i.e. the two objects see the pivots in a similar order of distance, then the two objects are likely to be similar also with respect to the original distance function d.

Once the set of pivots P is defined it must be kept fixed for all the indexed objects and queries, because the permutations deriving from different sets of pivots are not comparable. A selection of a “good” set of pivots is thus an important step in the indexing process, where the “goodness” of the set is measured by the effectiveness and efficacy of the resulting index structure at search time.

Permutation based methods share some ideas with the Shared Nearest Neighbors methods (SNN) [22], [32], [14]. These methods introduce the concept of secondary similarity measures, which evaluates the similarity among two objects by considering the amount of overlap of their neighborhoods. The neighborhood of an object is determined using the original distance and all objects of the dataset. Secondary similarity measures have been shown to be able to reduce the impact of the curse of dimensionality in cases in which the discriminative power of the primary similarity measure is reduced by the high dimensionality of the similarity space. The difference between the permutation-based methods and the SNN methods is that permutation-based methods encode original objects with neighbor objects taken from a very small subset of the entire dataset, rather than the entire dataset. Moreover, the primary purpose of this encoding is to build efficient and scalable approximate similarity search index structures, rather than computing a better distance than the original distance, as SNN method do. In addition, no pivot selection technique is needed by SNN methods, given that they use the entire dataset to determine the neighborhood of objects.

In the field of permutation-based access methods the most commonly adopted technique for the definition of P is to randomly select the n objects from C [2], [12], [16]. Even though there is a relatively rich literature on pivot selection techniques for the general class of pivot-based access methods [39] (see 2 Related work, 3 Pivot selection techniques), to the best of our knowledge, no rigorous comparison of the effectiveness of the various selection techniques, when used in combination with permutation-based access methods, has been performed yet. In this paper we compare five techniques for the definition of sets of pivots to be used by permutation-based access methods. One of the techniques that we compare is a novel proposal that we have designed to be used with permutation based index.

In summary, the contribution of this paper is twofold.

  • 1.

    We test various pivot selection techniques, including random selection, on a number of permutation based indexes. An interesting result is that different selection methods were optimal for different index schema. In fact, the way in which permutations are used, by different indexing schema, is basically different, and this is reflected in the pivot selection techniques.

  • 2.

    We propose and compare a new pivot selection criterium, expressly designed to be used with permutation-based indexes. This method is clearly superior when used with the MI-File index [2].

The paper is structured as follows. In Section 2 we discuss related work. Section 3 presents the techniques being compared. The tested similarity search access methods are presented in Section 4. Section 5 describes the experiments and comments their results. Conclusion and future work are given in Section 6.

Section snippets

Related work

The study of pivot selection techniques for access methods usually classified as pivot-based [39] has been an active research topic, in the field of similarity search in metric spaces, since the nineties. Most access methods make use of pivots to reduce the number of data objects accessed during similarity query execution. The choice of the pivots plays a relevant role in allowing the access methods to achieve their best performance. In an early work by Shapiro [37], it was noticed that good

Pivot selection techniques

Permutation-based access methods use pivots to build permutations that represent data objects. However, different permutation-based indexes make different use of the permutations. We can broadly identify two different roles played by the permutations in these indexes schema:

  • 1.

    Rank data objects according to the distance (or dissimilarity) between permutations, rather than the original distance.

  • 2.

    Provide focused access in the database to identify and retrieve candidate objects, in which the

Permutation-based similarity access methods

We have compared the pivot selection techniques on three permutation based index structures that reasonably cover the various approaches adopted in literature by access methods based on permutations.

Datasets and groundtruth

Experiments were conducted using the CoPhIR dataset [6], which is currently the largest multimedia metadata collection available for research purposes. It consists of a crawl of 106 millions images from the Flickr photo sharing website. We have run experiments by using as the distance function d a linear combination of the five distance functions for the five MPEG-7 descriptors that have been extracted from each image. As weights for the linear combination we have adopted those proposed in [5],

Conclusion

In this paper we compared five pivot selection techniques on three permutation-based access methods. For all the tested access methods we found at least one technique that significantly outperforms the random selection. Another interesting point is that there is no technique that is universally the best for all the access methods.

In Section 3, we identified two roles that the permutations can play in the access methods. First, they can be used for approximating the original distance between two

Acknowledgments

This work was partially supported by EAGLE (Europeana network of Ancient Greek and Latin Epigraphy, co-founded by the European Commision, CIP-ICT-PSP.2012.2.1 – Europeana and creativity, Project Reference 325122) and Secure! (Piattaforma intelligente basata su tecnologie crowdsourcing e crowdsensing per la sicurezza e la gestione delle crisi ed emergenze, Regione Toscana POR CReO 2007 2013, Linea di Intervento 1.5.a, 1.6, Bando Unico R&S Anno 2012).

References (40)

  • Michal Batko et al.

    Building a web-scale image similarity search system

    Multimedia Tools and Applications

    (2010)
  • Michal Batko, Petra Kohoutkova, Pavel Zezula, Combining metric features in large collections, in: Proceedings of the...
  • Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Claudio Lucchese, Raffaele Perego, Tommaso Piccioli, Fausto Rabitti,...
  • Tolga. Bozkaya et al.

    Indexing large metric spaces for similarity search queries

    ACM Trans. Database Syst.

    (1999)
  • Sergey Brin, Near neighbor search in large metric spaces, in: VLDB׳95, Proceedings of the 21th International Conference...
  • B. Bustos, G. Navarro, E. Chavez, Pivot selection techniques for proximity searching in metric spaces, in: Proceedings....
  • B. Bustos, O. Pedreira, N. Brisaboa, A dynamic pivot selection technique for similarity search, in: IEEE 24th...
  • Edgar Chávez et al.

    Effective proximity retrieval by ordering permutations

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2008)
  • Agni Delvinioti, Hervé Jégou, Laurent Amsaleg, Michael Houle, Image retrieval with reciprocal and shared nearest...
  • Andrea Esuli, Mipai: using the pp-index to build an efficient and scalable similarity search system, in: SISAP, 2009,...
  • Cited by (20)

    • HubHSP Graph: Effective Data Sampling for Pivot-Based Representation Strategies

      2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    This is a revised and extended version of a paper appeared as [1].

    View full text