Elsevier

Information Systems

Volume 72, December 2017, Pages 117-135
Information Systems

CLAP, ACIR and SCOOP: Novel techniques for improving the performance of dynamic Metric Access Methods

https://doi.org/10.1016/j.is.2017.10.003Get rights and content

Highlights

  • Techniques for improving Metric Access Methods in similarity queries are proposed.

  • CLAP technique reduces the number of distance calculations.

  • ACIR and SCOOP techniques reduce the number of disk accesses.

  • Gains of up to 63% in point queries are obtained.

  • Gains of up to 53% in queries retrieving multiple elements are obtained.

Abstract

Constant technological advances in electronic devices have led to the growth of elaborated data such as large texts, time series, georeferenced imagery, genetic sequences, photos, videos and several other types of complex data. Differently from scalar, traditional data types such as numbers and strings, complex data do not present the order relation property, which allows identifying whether an element precedes another according to some criterion. Therefore, these data are usually compared by the similarity degree among them. The Metric Access Methods (MAMs) are recognized as well-suited to perform similarity queries over such kind of data more efficiently than other access methods. MAMs can be considered dynamic or static depending on the pivot type used to construct them. Pivots are often employed to narrow the search for data. Global pivots can be employed to look into elements in the whole dataset, thus they have a high impact in the process of pruning irrelevant elements, since a single global pivot can be used to discard a large amount of irrelevant elements. Nevertheless, MAMs based on global pivots may have their dynamicity compromised by the fact that eventual pivot-related updates must be propagated through the entire structure. Local pivots, on the other hand, allow the maintenance to occur locally at the price of a lower pruning ability. In this paper, we propose novel techniques for improving the performance of dynamic MAMs without harming their dynamicity, once that several applications handle online complex data and, consequently, demand efficient dynamic indexes to be successful. Specifically, our main contributions are three techniques: (i) CLAP, which consists of employing local additional pivots to reduce distance calculations; (ii) ACIR, which is combined with CLAP and anticipates information from child nodes to reduce unnecessary disk accesses; and (iii) SCOOP, which is combined with CLAP as an extended version of ACIR, anticipating a larger amount of information from child nodes. The techniques have been applied to a dynamic MAM and evaluated over real datasets ranging from moderate to high dimensionality and cardinality. The experimental results show that our techniques were able to reduce query execution time in up to 63% for point queries and up to 53% for queries retrieving multiple elements.

Introduction

In recent years, the technological advances in electronic devices have accelerated the generation of complex data. In this work, we use the term complex data to refer to data that cannot be represented by traditional types, such as numbers, characters, dates and short texts. Examples of complex data are large texts, time series, georeferenced imagery, genetic sequences, photos and videos.

The scalar data domains possess the order relation property, which allows identifying, for each pair of elements in the domain, whether one precedes the other according to some criterion. Based on this property, most of the index structures implemented in the current Relational Database Management Systems (RDBMSs) are able to efficiently perform queries. However, the order relation does not apply to most of the complex domains [1]. Since traditional index structures are based on this property, they are not suitable for complex data. Hence, the Metric Access Methods (MAMs) were developed to index complex data and to allow efficient similarity searches.

Several MAMs have been proposed, categorized in different ways depending on which factors are considered to structure the indexed data. These factors comprehend: response type, structure dynamicity, space partitioning and pivot type. Regarding response type, MAMs can be either exact or approximate. Approximate MAMs provide less accurate responses in favor of efficiency. As to structure dynamicity, MAMs can be either dynamic or static. Dynamic MAMs enable adding and removing elements at any time with no need for reconstruction. Static MAMs, on the other hand, require the prior existence of the whole dataset to be indexed and usually need to be reconstructed in face of updates. Considering space partitioning, the basic types include: ball partitioning [2], generalized hyperplane partitioning [2] and excluded middle partitioning [3]. Lastly, with respect to pivot type, MAMs can be based either on global or local pivots.

In this paper, pivots are elements that act as representatives of certain regions of the dataset. Their purpose is to prune irrelevant elements during query execution. It is said that a pivot is global when every element in the dataset can be referenced through it, whereas a pivot is local when only a portion of the dataset can be referenced through it. Because global pivots refers to every element in the dataset, they have a high impact in the process of pruning irrelevant elements, since a single global pivot can be used to discard large amounts of irrelevant elements. However, MAMs based on global pivots may have their dynamicity compromised by the fact that eventual pivot-related updates must be propagated through the entire structure. Local pivots, on the other hand, restrict the maintenance locally at the price of a lower pruning ability. Therefore, pivot type and structure dynamicity are directly related to each other.

In this paper, we address the challenge of improving the pruning ability of dynamic MAMs without harming their dynamicity. This is relevant for applications that manage online complex data and, consequently, demand dynamic and efficient index structures. We present novel techniques, applicable to hierarchical MAMs based on local pivots, which aim at reducing the number of distance calculations and disk accesses in similarity queries — two factors that determine the performance of MAMs. Specifically, our main contributions are as follows:

  • The CLAP (Cutting Local Additional Pivots) technique, which employs local additional pivots to reduce the number of distance calculations;

  • The ACIR (Anticipation of Child Information regarding Representatives) technique, which employs CLAP and anticipates information from child nodes related to node representatives to reduce the number of unnecessary disk accesses;

  • The SCOOP (Searching with Cutting local pivots and informatiOn anticipatiOn of Pivots) technique, which employs CLAP and anticipates information from child nodes regarding both node representatives and additional pivots.

The CLAP technique employs local additional pivots to reduce the uncertainty region in the search space. This is the region of the search space that may contain elements that are not in the answer, but nonetheless cannot be pruned until they are individually analyzed — which implies distance calculations. The ACIR and SCOOP techniques, in turn, anticipate information from child nodes into their parents to enable pruning the irrelevant elements before visiting the disk pages that actually store them. Unlike other approaches that employ multiple pivots to define regions in the search space, our approaches allow reducing distance calculations and disk accesses without impairing the index dynamicity.

We applied the CLAP, ACIR and SCOOP techniques to the MAM Slim-tree [4], [5] and extensively evaluated them in a set of experiments over real datasets. The datasets vary both in the number of elements and in the number of attributes. The experiments also employed distance functions of different computational costs — from linear to quadratic. The results confirm the efficiency of the techniques, as they provided significant gains in execution time, number of distance calculations and number of disk accesses in similarity queries over all datasets evaluated. Partial results of this work regarding the CLAP and ACIR techniques have been published [6]. In this paper, we extend the anticipation-of-information mechanism, resulting in the SCOOP technique; provide in-depth details regarding the three proposed techniques, including SCOOP’s construction and query algorithms; and implement the related approach Nearest-Neighbor graph (NN-graph), employed by the dynamic MAM M*-tree [7], over the same code base to enable a fair comparison. Our techniques have achieved notable gains over NN-graph, showing to be effective to improve the performance of dynamic hierarchical MAMs.

The rest of this paper is organized as follows. Section 2 covers concepts regarding similarity queries over complex data and Section 3 presents the related work. Section 4.1 describes the CLAP technique. Sections 4.2 and 4.3 describe the anticipation techniques ACIR and SCOOP. Section 5 presents the application of the proposed techniques to the MAM Slim-tree, describing the new node structures and the algorithms for insertion and for similarity queries. Section 6 describes the experiments and discusses the results. Finally, Section 7 concludes the work.

Section snippets

Similarity queries over complex data

Most of complex domains do not possess the relation order. Therefore, relational operators (e.g.  < ,  ≤ ,  >  and  ≥ ) cannot be employed on the elements of such domains to identify any precedence among them. It is also uncommon to employ the operators = and  ≠  on complex data, since it is quite unlikely that two complex elements are identical. Take for example two images. If a single pixel is different, then they are not equal. Hence, in complex domains, queries that consider the similarity

Related work

The approaches proposed in this paper explore the use of multiple pivots in order to improve the performance of dynamic MAMs. Specifically, our approaches employ novel techniques based on multiple pivots aiming at reducing both distance calculations and disk accesses in similarity queries, with the objective of reducing their overall execution time. Accordingly, this section presents related work whose approaches aim at improving the performance of MAMs by employing multiple pivots,

The CLAP technique

This section describes the CLAP technique, which stands for Cutting Local Additional Pivots. We developed the CLAP technique to allow reducing the number of distance calculations performed in similarity queries by dynamic hierarchical MAMs. The number of distance calculations is reduced because CLAP restricts (or cuts) the uncertainty region of each node visited in a similarity query by employing additional pivots, stored in each node of the structure.

Distinctly from other pivot-based

Node structures

The Slim-tree has two node types: index node and leaf node. A leaf node has the following structure (the characters [and] delimit an array): leafnode[OIDi,si,δ(si,srep)]where OIDi is the identifier of the ith element in the node; si is the element itself, stored as a feature vector; δ(si, srep) is the distance between si and its representative. An index node has the following structure: indexnode[si,δ(si,srep),ri,Ptr(Tsi),#Ent(Tsi)]where si is the feature vector of the representative in the

Experimental results

The techniques presented in this paper have been drilled in extensive experiments. We used datasets that range from moderate to high dimensionality and cardinality and we employed metrics with different computational costs, in order to evaluate our contributions in various scenarios. To carry out the experiments, we used a computer equipped with an Intel® Core i5-2400 processor, 4 GB of DDR3 1333MHz RAM memory, SATA III 7200RPM hard disk and Ubuntu Linux 14.04.1 LTS 64-bit operating system

Conclusion

In this paper, we have presented novel techniques for improving the performance of dynamic MAMs. Designed for hierarchical MAMs based on local pivots, our techniques aim at anticipating data required to perform node pruning through the triangular inequality property of metric spaces, thus reducing both the number of distance calculations and the number of disk accesses in similarity queries. The CLAP technique employs local additional pivots to reduce the number of distance calculations,

Acknowledgements

This research was funded, partly, by the Brazilian Coordination for the Improvement of Higher Education Personnel (CAPES), by the Brazilian National Council for Scientific and Technological Development (CNPq), and by the State of São Paulo Research Foundation (FAPESP).

References (48)

  • P. Zezula et al.

    Similarity Search: The Metric Space Approach

    (2006)
  • M.C.N. Barioni et al.

    Querying multimedia data by similarity in relational DBMS

  • D.R. Wilson et al.

    Improved heterogeneous distance functions

    J. Artif. Intell. Res.

    (1997)
  • J. Hafner et al.

    Efficient color histogram indexing for quadratic form distance functions

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1995)
  • M.M. Deza et al.

    Encyclopedia of distances

    Encyclopedia of Distances

    (2009)
  • F. Long et al.

    Fundamentals of content-based image retrieval

  • C. Beecks et al.

    On stability of signature-based similarity measures for content-based image retrieval

    Multimed. Tools Appl.

    (2014)
  • Y. Rubner et al.

    The earth mover’s distance as a metric for image retrieval

    Int. J. Comput. Vis.

    (2000)
  • B.G. Park et al.

    Color-based image retrieval using perceptually modified hausdorff distance

    J. Image Video Process.

    (2008)
  • Y.A. Aslandogan et al.

    Techniques and systems for image and video retrieval

    IEEE Trans. Knowl. Data Eng.

    (1999)
  • M.S. Lew et al.

    Content-based multimedia information retrieval: State of the art and challenges

    ACM Trans. Multimed. Comput. Commun. Appl.

    (2006)
  • W.A. Burkhard et al.

    Some approaches to best-match file searching

    Commun. ACM

    (1973)
  • P.N. Yianilos

    Data structures and algorithms for nearest neighbor search in general metric spaces

    Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms

    (1993)
  • T. Bozkaya et al.

    Distance-based indexing for high-dimensional metric spaces

    Proceedings of the ACM SIGMOD International Conference on Management of Data

    (1997)
  • Cited by (0)

    View full text