CLAP, ACIR and SCOOP: Novel techniques for improving the performance of dynamic Metric Access Methods
Introduction
In recent years, the technological advances in electronic devices have accelerated the generation of complex data. In this work, we use the term complex data to refer to data that cannot be represented by traditional types, such as numbers, characters, dates and short texts. Examples of complex data are large texts, time series, georeferenced imagery, genetic sequences, photos and videos.
The scalar data domains possess the order relation property, which allows identifying, for each pair of elements in the domain, whether one precedes the other according to some criterion. Based on this property, most of the index structures implemented in the current Relational Database Management Systems (RDBMSs) are able to efficiently perform queries. However, the order relation does not apply to most of the complex domains [1]. Since traditional index structures are based on this property, they are not suitable for complex data. Hence, the Metric Access Methods (MAMs) were developed to index complex data and to allow efficient similarity searches.
Several MAMs have been proposed, categorized in different ways depending on which factors are considered to structure the indexed data. These factors comprehend: response type, structure dynamicity, space partitioning and pivot type. Regarding response type, MAMs can be either exact or approximate. Approximate MAMs provide less accurate responses in favor of efficiency. As to structure dynamicity, MAMs can be either dynamic or static. Dynamic MAMs enable adding and removing elements at any time with no need for reconstruction. Static MAMs, on the other hand, require the prior existence of the whole dataset to be indexed and usually need to be reconstructed in face of updates. Considering space partitioning, the basic types include: ball partitioning [2], generalized hyperplane partitioning [2] and excluded middle partitioning [3]. Lastly, with respect to pivot type, MAMs can be based either on global or local pivots.
In this paper, pivots are elements that act as representatives of certain regions of the dataset. Their purpose is to prune irrelevant elements during query execution. It is said that a pivot is global when every element in the dataset can be referenced through it, whereas a pivot is local when only a portion of the dataset can be referenced through it. Because global pivots refers to every element in the dataset, they have a high impact in the process of pruning irrelevant elements, since a single global pivot can be used to discard large amounts of irrelevant elements. However, MAMs based on global pivots may have their dynamicity compromised by the fact that eventual pivot-related updates must be propagated through the entire structure. Local pivots, on the other hand, restrict the maintenance locally at the price of a lower pruning ability. Therefore, pivot type and structure dynamicity are directly related to each other.
In this paper, we address the challenge of improving the pruning ability of dynamic MAMs without harming their dynamicity. This is relevant for applications that manage online complex data and, consequently, demand dynamic and efficient index structures. We present novel techniques, applicable to hierarchical MAMs based on local pivots, which aim at reducing the number of distance calculations and disk accesses in similarity queries — two factors that determine the performance of MAMs. Specifically, our main contributions are as follows:
- •
The CLAP (Cutting Local Additional Pivots) technique, which employs local additional pivots to reduce the number of distance calculations;
- •
The ACIR (Anticipation of Child Information regarding Representatives) technique, which employs CLAP and anticipates information from child nodes related to node representatives to reduce the number of unnecessary disk accesses;
- •
The SCOOP (Searching with Cutting local pivots and informatiOn anticipatiOn of Pivots) technique, which employs CLAP and anticipates information from child nodes regarding both node representatives and additional pivots.
The CLAP technique employs local additional pivots to reduce the uncertainty region in the search space. This is the region of the search space that may contain elements that are not in the answer, but nonetheless cannot be pruned until they are individually analyzed — which implies distance calculations. The ACIR and SCOOP techniques, in turn, anticipate information from child nodes into their parents to enable pruning the irrelevant elements before visiting the disk pages that actually store them. Unlike other approaches that employ multiple pivots to define regions in the search space, our approaches allow reducing distance calculations and disk accesses without impairing the index dynamicity.
We applied the CLAP, ACIR and SCOOP techniques to the MAM Slim-tree [4], [5] and extensively evaluated them in a set of experiments over real datasets. The datasets vary both in the number of elements and in the number of attributes. The experiments also employed distance functions of different computational costs — from linear to quadratic. The results confirm the efficiency of the techniques, as they provided significant gains in execution time, number of distance calculations and number of disk accesses in similarity queries over all datasets evaluated. Partial results of this work regarding the CLAP and ACIR techniques have been published [6]. In this paper, we extend the anticipation-of-information mechanism, resulting in the SCOOP technique; provide in-depth details regarding the three proposed techniques, including SCOOP’s construction and query algorithms; and implement the related approach Nearest-Neighbor graph (NN-graph), employed by the dynamic MAM M*-tree [7], over the same code base to enable a fair comparison. Our techniques have achieved notable gains over NN-graph, showing to be effective to improve the performance of dynamic hierarchical MAMs.
The rest of this paper is organized as follows. Section 2 covers concepts regarding similarity queries over complex data and Section 3 presents the related work. Section 4.1 describes the CLAP technique. Sections 4.2 and 4.3 describe the anticipation techniques ACIR and SCOOP. Section 5 presents the application of the proposed techniques to the MAM Slim-tree, describing the new node structures and the algorithms for insertion and for similarity queries. Section 6 describes the experiments and discusses the results. Finally, Section 7 concludes the work.
Section snippets
Similarity queries over complex data
Most of complex domains do not possess the relation order. Therefore, relational operators (e.g. < , ≤ , > and ≥ ) cannot be employed on the elements of such domains to identify any precedence among them. It is also uncommon to employ the operators = and ≠ on complex data, since it is quite unlikely that two complex elements are identical. Take for example two images. If a single pixel is different, then they are not equal. Hence, in complex domains, queries that consider the similarity
Related work
The approaches proposed in this paper explore the use of multiple pivots in order to improve the performance of dynamic MAMs. Specifically, our approaches employ novel techniques based on multiple pivots aiming at reducing both distance calculations and disk accesses in similarity queries, with the objective of reducing their overall execution time. Accordingly, this section presents related work whose approaches aim at improving the performance of MAMs by employing multiple pivots,
The CLAP technique
This section describes the CLAP technique, which stands for Cutting Local Additional Pivots. We developed the CLAP technique to allow reducing the number of distance calculations performed in similarity queries by dynamic hierarchical MAMs. The number of distance calculations is reduced because CLAP restricts (or cuts) the uncertainty region of each node visited in a similarity query by employing additional pivots, stored in each node of the structure.
Distinctly from other pivot-based
Node structures
The Slim-tree has two node types: index node and leaf node. A leaf node has the following structure (the characters [and] delimit an array): where OIDi is the identifier of the ith element in the node; si is the element itself, stored as a feature vector; δ(si, srep) is the distance between si and its representative. An index node has the following structure: where si is the feature vector of the representative in the
Experimental results
The techniques presented in this paper have been drilled in extensive experiments. We used datasets that range from moderate to high dimensionality and cardinality and we employed metrics with different computational costs, in order to evaluate our contributions in various scenarios. To carry out the experiments, we used a computer equipped with an Intel® Core™ i5-2400 processor, 4 GB of DDR3 1333MHz RAM memory, SATA III 7200RPM hard disk and Ubuntu Linux 14.04.1 LTS 64-bit operating system
Conclusion
In this paper, we have presented novel techniques for improving the performance of dynamic MAMs. Designed for hierarchical MAMs based on local pivots, our techniques aim at anticipating data required to perform node pruning through the triangular inequality property of metric spaces, thus reducing both the number of distance calculations and the number of disk accesses in similarity queries. The CLAP technique employs local additional pivots to reduce the number of distance calculations,
Acknowledgements
This research was funded, partly, by the Brazilian Coordination for the Improvement of Higher Education Personnel (CAPES), by the Brazilian National Council for Scientific and Technological Development (CNPq), and by the State of São Paulo Research Foundation (FAPESP).
References (48)
Excluded middle vantage point forests for nearest neighbor search
Proceedings of the 6th DIMACS Implementation Challenge: Near Neighbor Searches
(1999)- et al.
The onion-tree: quick indexing of complex data in the main memory
- et al.
Textural features for image classification
IEEE Trans. Syst. Man Cybern.
(1973) - et al.
CophIR image collection under the microscope
2nd International Workshop on Similarity Search and Applications
(2009) Searching Multimedia Databases by Content
(1996)Satisfying general proximity/similarity queries with metric trees
Inf. Process. Lett.
(1991)- et al.
Slim-trees: high performance metric trees minimizing overlap between nodes
- et al.
Fast indexing and visualization of metric data sets using slim-trees
IEEE Trans. Knowl. Data Eng.
(2002) - et al.
Improving the pruning ability of dynamic metric access methods with local additional pivots and anticipation of information
- et al.
Improving the performance of m-tree family by nearest-neighbor graphs