Editorial
The NOBH-tree: Improving in-memory metric access methods by using metric hyperplanes with non-overlapping nodes

https://doi.org/10.1016/j.datak.2014.09.001Get rights and content

Abstract

In order to speed up similarity query evaluation, index structures divide the target dataset into subsets aimed at finding the answer without examining the entire dataset. As the complexity of the data types handled by modern applications keeps growing, searching by similarity becomes increasingly interesting, that makes the Metric Space Theory as the theoretical base to build the structures employed to index complex data. Also, as the main memory capacity grows and the price decreases, increasingly larger databases can be completely indexed in the main-memory. Thus, more and more applications require the data base management systems to quickly build indexes that can take advantage of memory-based indexes. In this paper, we propose a new family of metric access methods, called NOBH-trees that allow a non-overlapping division of the data space, combining Voronoi-shaped with ball-shaped regions to partition the metric space. We show how to query the subspaces divided by the hyperplanes and how the distance from any element to the hyperplane can be evaluated. The results obtained from the experiments show that the new MAM achieves better performance than the existing ones during both the construction and querying phases.

Introduction

Nowadays, new applications require handling increasingly complex data, such as images, audio, time series, genetic sequences, among others. Typically, those data domains do not possess the ordering relationship, which is required by the indexing structures of the existing Database Management Systems (DBMS), developed to handle data in numeric and short character string domains. In fact, the relational operators (<,≤, > or ≥) in general do not make sense to handle complex data, and even the identity ones (= or ≠) are of little help.

Interesting queries for applications that handle complex data require finding elements that are similar, but not identical, to a given one. Thus, the similarity among complex elements is the most important property [17]. An example is retrieving the images from an image dataset that are similar to a given one. Similarity-based retrieval can benefit from the abstraction of distances from the mathematical theory of metric spaces [26], leading to the development of metric access methods (MAMs). MAMs rely only on the distances among the pairs of elements in a dataset, and therefore are also called as distance-based index structures.

An indexing method usually employs a data structure to perform the task of searching within the data. The data structure can be a hierarchy, a graph, a hash method, a vector or a multivariate array. Indexing methods are composed of algorithms for manipulating the data structures in order to speed up query answering. Access methods combine a data structure to the restrictions of the medium where the data is stored to provide efficient data retrieval. They take advantage of indexing methods, considering the storage medium, buffer availability, type and sequences of accesses, i.e., all requirements to access data inside a database management system environment. MAMs employ an indexing method to access data in a metric space, where the properties of a metric are held by the data domain. In this paper we use the term metric access method as an indexing method able to access data in a metric space.

Metric spaces are defined by a data domain and a distance function. The distance function measures the dissimilarity between two elements from a given dataset, in such a way that smaller distances correspond to more similar elements. An insightful discussion on similarity measures applied over images as an example of complex objects can be found in [19]. Evaluating (dis-)similarity using a distance function allows representing data in metric spaces. Formally, a metric space is a pair <S,d>, where S is the data domain and d:S×S+ is the distance function, often called a metric, that holds the following properties for any s1,s2,s3S:

  • Identity: d(s1, s2) = 0  s1 = s2;

  • Symmetry: d(s1, s2) = d(s2, s1);

  • Non-negativity: 0  d(s1, s2) < ∞, and

  • Triangular inequality: d(s1, s2)  d(s1, s3) + d(s3, s2).

A similarity query over a dataset SS returns the elements si  S that meet a given similarity criterion, expressed through a reference element sqS. Exemplifying for an image database, someone may ask for images that are similar to a given one regarding image related features such as color, texture or shapes. There are two main types of similarity queries: the range and the k-nearest neighbor queries.

  • Range query — Rq: given a maximum query distance rq, the range query Rq(sq, rq) retrieves every element si  S such that d(si, sq)  rq. An example is: “Select the images that are similar to the image P by up to five similarity units”, represented as Rq(P, 5);

  • k-Nearest neighbor query — kNNq: given an integer k  1, the k-nearest neighbor query kNNq(sq, k) retrieves the k elements in S nearest from the query center sq. An example is: “Select the 3 images most similar to the image P”, represented as kNNq(P, 3).

Equivalent queries involving similarity join exist, for example to obtain pairs of elements from two subsets that are closer than a given radius (range join) or the k pairs of closest neighbors (k-nearest join).

A MAM can be classified as memory-based or as disk-based. Besides the storage medium, the main conceptual difference among hierarchical MAMs relies on the data space partitioning and on the tree node construction policies. Disk-based trees keep several elements per node, and have enforcement rules to keep the tree height-balanced. It is also usually acceptable to keep a relatively low occupancy rate (the amount of useful memory space employed to effectively store data compared to the total amount of memory employed to store the whole data structure), aimed at reducing the number of disk accesses required. On the other side, memory-based trees do no impose limits on the number of elements per node, leading to higher occupancy rates and allowing more freedom regarding memory space partitioning.

Disk-based access methods are designed based on the premise that the index is built once and accessed several times, making it worth to spend time in their construction if several queries can be answered faster. Therefore, there is a trade-off between the time spent to build an index once and the time gained to execute a several number of queries. When only one or few queries will use an index, the overhead imposed by the disk accesses required to build the index makes it not worth to create disk-based index structures. On the other hand, the main-memory-based structures remain worth using, even to answer few or just one query, especially if it involves join operations. Similarity queries are expensive queries, thus they can benefit from the availability of indexes even when few (or often just one) of them must be executed.

With main memory increasing in capacity and declining in price, larger databases can be completely indexed in the main-memory. Thus, more and more applications require the DBMS to quickly build indexes, taking advantage of main memory-based indexes, especially for data domains where the indexes need to be frequently rebuilt, usually to perform similarity queries over the intermediate data resulting from a sub-query statement. Therefore, when persisting the index for a long time is not required, a main memory-based access method is a valuable alternative.

Height-balancing is a concept that drove the development of index structures since its early days. Indeed, it is of paramount importance when a single comparison can partition the search space into a set of non-intersecting subspaces, as it occurs when the ordering relationship holds. For those domains, a depth-first search is able to locate a data element at a single dive. Thus, keeping the tree as shallow as possible is the best way to improve query answering, and height balancing shelters a tree to degenerate into a long linked list. However, both metric and multidimensional access methods often divide the data into overlapping subspaces and, in such cases, the data structures must take care of both height-balancing and node overlapping – two requirements that are, in general, conflicting. For memory-based trees, reading a descending or a sister node has the same cost. Therefore, it is not worth to enforce height balancing at the expense of increased node overlapping. In this paper we focus on a main-memory based MAM that warrants a good tradeoff between height balancing and node overlapping.

Distance calculations employed to compare complex data are in general expensive (such as those for high dimensional data), so one of the main concerns when developing a MAM is how to minimize the number of distances calculated both during the index construction and at the query execution phases. Thus, in this paper we use the number of distance calculations as one of the important measurements to be considered for the evaluated methods.

Trigonometric pruning exploits trigonometric properties of the data space to reduce the search space. It was employed in the work of Emrich [16] to speed up AkNNQ queries (All k nearest neighbor queries) based on spherical page regions. It was also employed by Jacox [21] to improve similarity join algorithms in high dimensional sets by using a method called QuickJoin. The performance of AESA and derived methods were improved by Ban [2] using geometric properties defining lower bounds for approximating distances on metric spaces. Most of the trigonometric properties are derived from geometric properties of the embedding space and are applicable to positive semi-definite metric space models. Therefore, the chosen metric spaces must be positive definite, being in accordance with kernels methods [41]. A very detailed discussion on positive definite functions can be found in the work of Choenberg [40].

The problem of checking whether an n-point metric space allows a D-Embedding, i.e., without any restriction on dimension, can be evaluated by a polynomial-time algorithm [31]. A positive-definite metric space has a positive definite function associated, which is related to the existence of a positive semi-definite matrix in a suitable convex set of matrices. This problem is also related to D-Embedding into Euclidean spaces, which is expressed as follows.

Let <S,d> be an n-point metric space, let f:Sn be an embedding, and let X be the n × n matrix whose columns are indexed by the elements of S such that the ith column is the vector f(i)  n. The matrix Q = XTX has entries qij, where qij is the scalar product (f(i), f(j)). Thus, matrix Q is positive definite, since for any x  n, we have xTQx = (xTXT)(Xx) = Xx2  0.

Accordingly, the metric space <S,d> can be embedded into L2 if and only if there is a symmetric real positive semi-definite matrix Q whose entries satisfy the following conditions:dij2qii+qjj2qijD2dij2for any i,jS.

Metrics L1 and L2 are positive definite for any dimension of the points [32]. However, squared Euclidean metrics do not generally satisfy the triangle inequality [40]. Thus, the problem of finding a positive semidefinite matrix is solving the system of linear inequalities, which is performed in polynomial time regarding the size of Q. Ultrametric spaces were also proved to be positive definite [48]. An extension of earth mover's distance for image matching was proposed and was proved to be positive definite metric in [28].

The Mahalanobis distance is often used in neural information processing systems, being applied as a metric for classification. One of the involved burdens is to ensure that the Mahalanobis matrix employed remains positive semidefinite [42]. Although some variants of the Hausdorff distance can be a metric, in some cases it is not positive definite [6]. The quadratic form distance is a metric distance function on the class of feature histograms if its inherent similarity function is positive definite [4]. In general, it is possible to embed general non-Euclidean distances onto a Euclidean space by transforming the problem using positive eigenvalues [38].

In this paper we have used the trigonometric pruning as a technique to create MAMs, able to generate a whole new family of metric trees, called the NOBH-tree (the Non-Overlapping Balls and Hyperplanes trees). We show how to use metric hyperplanes and balls, centered at elements defined as pivots, to partition the metric regions in such a way that it enables storing the data in trees. We also show how to evaluate the distance from any given element to a metric hyperplane. The NOBH-tree can be built in several variants, depending on how the metric space is partitioned.

This paper is structured as follows. The next section surveys the main existing index structures and shows how they partition the data space. Section 3 presents the NOBH-tree family of MAMs and details the algorithms that construct two members of the NOBH-tree family. Section 4 presents retrieval algorithms to perform both range and k-nn queries over those MAMs. Section 5 presents experiments and the results achieved when using them over the new structure to index several representative, real-world datasets. Finally, Section 6 concludes the work.

Section snippets

Related work

Designing efficient access methods has long been a main goal of many researchers. The need to retrieve multi-dimensional data has triggered the development of several spatial access methods. However, most of them are efficient only for low-dimensional domains, such as those targeting geographical applications.

Spatial access methods index data by considering each object as a point in a multidimensional space. However, they often suffer from the “dimensionality curse”. Several methods were

The NOBH-tree

A metric access method can only take advantage of the properties shared by the metric associated to the data domain – no property specific to a given dataset can be employed. Recall that only the objects and the distances among them are available. The triangular inequality property, from metric spaces, is probably the most frequently used one to improve performance when accessing index disk blocks, as that is the main property that allows pruning subtrees formed with ball-shaped regions.

Querying the NOBH-tree

Answering queries on any NOBH-tree that uses the hyperplane boundary requires knowing the distance d˙ from an element si to an hyperplane η in order to determine if a given region can be pruned from further analysis. Assuming H as the set of elements shS that lies on the hyperplane η, we define the distance from si to η asd˙siH=infdshsishHwhere inf represents the infimum value in the set.

Fig. 6 shows examples of a range query posed on (a) the NOBH3-tree and on (b) the NOBHab3-tree. The

Experiments

We performed several experiments on the NOBH-tree members. In this section we show those most representative, comparing the NOBH3-tree and NOBHab3-tree with the Slim-tree, the VP-tree, the MM-tree, the Onion-tree and the M-Index. We also compared with the sequential scan strategy, to analyze and reveal the situations where it is worth using indexes.

Although the Slim-tree was designed as a disk-based MAM, we developed a version that runs entirely in the main memory, to allow a fair time

Conclusion

Both metric and multidimensional access methods often divide the data space into overlapping subspaces and, in such cases, the data structures must take care of both height-balancing and node overlapping. Regarding memory-based trees, reading a descending or a sister node has the same cost. Therefore, it is not worth to enforce height balancing at the expenses of increased node overlapping.

In this paper we propose the application of the Metric Space Theory to build a new family of memory-based

Ives R. V. Pola received the B.Sc. degree in Computer Science from the Universidade Estadual Paulista Júlio de Mesquita Filho – UNESP, Brazil, in 2002. He received the M.Sc. and Ph.D. degrees in computer science from the University of São Paulo, Brazil in 2005 and 2010, respectively. He is currently a Post-Doctoral researcher at the same institute. His research interests include indexing methods for multidimensional data, similarity operators and metric joins.

References (53)

  • M. Batko et al.

    Messif: Metric similarity search implementation framework

    (2007)
  • C. Beecks

    Distance-based similarity models for content-based multimedia retrieval

  • S. Berchtold et al.

    The X-tree: an index structure for high-dimensional data

  • T. Bozkaya et al.

    Distance-based indexing for high-dimensional metric spaces

  • S. Brin

    Near neighbor search in large metric spaces

  • W.A. Burkhard et al.

    Some approaches to best-match file searching

    Commun. ACM (CACM)

    (1973)
  • B. Bustos et al.

    Adapting metric indexes for searching in multi-metric spaces

    Multimedia Tools Appl.

    (2012)
  • K. Chakrabarti et al.

    The hybrid tree: an index structure for high dimensional feature spaces

  • E. Chávez et al.

    Searching in metric spaces

    ACM Comput. Surv.

    (2001)
  • P. Ciaccia et al.

    Indexing metric spaces with m-tree

  • T. Emrich et al.

    Optimizing all-nearest-neighbor queries with trigonometric pruning

  • C. Faloutsos

    Indexing of multimedia data

  • V. Gaede et al.

    Multidimensional access methods

    ACM Comput. Surv.

    (1998)
  • A. Goshtasby

    Similarity and dissimilarity measures

  • G.R. Hjaltason et al.

    Ranking in spatial databases

  • E.H. Jacox et al.

    Metric space similarity joins

    ACM Trans. Database Syst.

    (2008)
  • Cited by (0)

    Ives R. V. Pola received the B.Sc. degree in Computer Science from the Universidade Estadual Paulista Júlio de Mesquita Filho – UNESP, Brazil, in 2002. He received the M.Sc. and Ph.D. degrees in computer science from the University of São Paulo, Brazil in 2005 and 2010, respectively. He is currently a Post-Doctoral researcher at the same institute. His research interests include indexing methods for multidimensional data, similarity operators and metric joins.

    Agma J. M. Traina received the B.Sc., the M.Sc. and Ph.D. degrees in Computer Science from the University of São Paulo, Brazil, in 1983, 1987 and 1991, respectively. She is currently a full professor with the Computer Science Department of the University of São Paulo at São Carlos, Brazil. Her research interests include image databases, image mining, indexing methods for multidimensional data, information visualization and image processing for medical applications. She is a member of the IEEE Computer Society, ACM, SIGKDD, and SBC.

    Ives R. V. Pola received the B.Sc. degree in Computer Science from the Universidade Estadual Paulista Júlio de Mesquita Filho – UNESP, Brazil, in 2002. He received the M.Sc. and Ph.D. degrees in computer science from the University of São Paulo, Brazil in 2005 and 2010, respectively. He is currently a Post-Doctoral researcher at the same institute. His research interests include indexing methods for multidimensional data, similarity operators and metric joins.

    This work is supported, in part, by CAPES (grant PNPD20131729), SticaMSUD, the RESCUER project, funded by the European Commission (grant 614154) and by the Brazilian National Council for Scientific and Technological Development CNPq/MCTI (grant: 490084/2013-3), CNPq (grants 308898/2013-3, 454800/2014-2, 308933/2013-3, 458064/2014-9) and the State of Sao Paulo Research Foundation.

    View full text