A near-optimal similarity join algorithm and performance evaluation

doi:10.1016/j.ins.2003.11.005

Information Sciences

Volume 167, Issues 1–4, 2 December 2004, Pages 87-108

https://doi.org/10.1016/j.ins.2003.11.005 Get rights and content

Abstract

Similarity join, a basic operation for multi-media databases, amounts to combinations of all pairs of points, with the distance between each pair bounded by a given parameter ε. In this paper, properties of index-based join algorithms are studied and a highly efficient and near-optimal similarity join algorithm is proposed. Our algorithm utilizes the Breadth-First strategy, and guides the join computation and I/O access through the cache content. In contrast with many other proposed join algorithms, our algorithm is advantageous due to the essential independence of the ordering strategies and the minimal cache capacity requirement. As a result, a more precise plan for the sequence of join computations and I/O access can be realized. Generally, processing and accessing each page can be done with only one attempt. Qualitative and quantitative analysis of the performance of the algorithm is provided. Although only R-tree (a common index structure) based similarity join is discussed in this paper, the idea can be generalized to implement other join algorithms without substantial difficulties. Experiments based on our analysis indicate that the new algorithm yields superior performances across a wide range of dimensions and sizes of databases.

Introduction

The application of similarity join, a basic operation for database queries, ranges widely from reporting “similar” pairs in a multimedia database to speeding up data mining processes. In fact, similarity join can be considered as certain combinations of a series of similarity queries.

In order to retrieve data objects efficiently, similarity query strategies are usually required to transform data objects into multi-dimensional points (feature vector), based on the index structure. Accordingly, similarity query amounts to the query of multi-dimensional feature points in index structure. Similarity join combines multi-dimensional point sets so that the resulted sets contain all “close” pairs of points. Although many index-based join algorithms have appeared [5], [6], [11], index-based approaches are thought recently [5] to be inefficient, particularly in high-dimensional case. For example, as pointed out in [5], a serious optimization conflict may exist between CPU time and I/O time, that is, while fine-grained index structures seem to be efficient for CPU, it tends to deteriorate the I/O performance.

In this paper, we first investigate the properties of the index-based similarity join. The comparison of the features of several widely used join algorithms––R-tree spatial join (RSJ) [6], breadth first R-tree join (BFRJ) [11] and multi-page index similarity join (MISJ) [5]––shows that, to achieve local and global optimization simultaneously, the index-based join needs to be computed by levels (breadth-first), and join computations and I/O access should be guided by cache content (The word “cache” always represents the application cache in this paper) so that one may plan the sequence of I/O access and join computations a priori. However, as opposed to many reported results, it is pointed out that the index-based join algorithm is in itself independent of the ordering strategies, such as Plane-Sweep, space-filling curves. Next, a near-optimal similarity join (NOSJ) algorithm is proposed. The algorithm guides the join computations and I/O access through the cache content, nearly minimizes the capacity of the demanded cache and I/O access times, and efficiently utilizes the locality of the cache. Theoretical analysis and experiments show that, generally, processing and accessing each page can be completed with only one attempt. Moreover, how the performance of the NOSJ algorithm can be optimized regarding the CPU and the I/O costs is discussed. Considering its superior performance, we base our experiments for NOSJ on R-tree. Our experiments show that NOSJ yields superior performances across a wide range of dimensions and sizes of databases.

This paper is organized as follows. In Section 2 related work and backgrounds are discussed. Our NOSJ algorithm is described in Section 3. In Section 4 of this paper, we optimize the NOSJ algorithm from two aspects: CPU cost and I/O cost. After evaluating the NOSJ experimentally in Section 5, we conclude this paper in Section 6.

Section snippets

The R-tree family

The concept of R-tree is a natural extension of B⁺-tree in high-dimensional cases [10]. It combines most advantages of B-trees and quadtrees. A non-leaf node of an R-tree contains entries of the form 〈Ptr,MBR〉, where Ptr is a pointer to a child node and MBR is the minimal bounding rectangle (MBR) that encloses the MBR's all entries in the child node. A leaf node of an R-tree contains entries of the form 〈Oid,MBR〉, where Oid is a pointer to a data object. In an R-tree, father nodes are allowed

Analyzing the special features of the join algorithms

Let us first analyze and compare the properties of the RSJ and the BFRJ algorithms. First, they use different traversal strategies, while the former use depth-first and consequently the access pattern for pages beyond the current scope may not be captured, the latter adopts breadth-first so that the join computation can be optimized both globally and locally. Next, they both depend upon the ordering strategy, and their experiments indicate that better ordering strategies can achieve better

Analyzing the CPU cost of NOSJ

It is known [5] that the CPU cost consists of the cost of directory page processing and that of data page processing, and thus the total CPU cost is $t_{CPU} =σ|R||S|t_{point} + |R||S|t_{box} C_{avg}^{2} .$ However, for NOSJ, besides costs included in (4.1), the cost for maintaining the NPI and managing the cache should also be taken into account. According to (3.1), the cost for maintaining the NPI is $2 σ|S||R| C_{avg}^{2} t_{o},$ where t_o is the time for removing a head-node from the Twax list. The cost for managing the cache

Experimental evaluations

In this section, we shall test the performance of our algorithm and verify its superiority. The NOSJ algorithm is implemented in C++. For comparison, we also implement the RSJ [6] and BFRJ algorithm [11] algorithms, optimized with other optimizing techniques introduced in [6], [11]. All the experiments are carried out on Founder-PC with Intel Celeron 1200 MHz CPU, 128 M Memory, and Windows XP Operating System. The size of virtual memory page-file is 168 MB. The goals of our experiments is to

Conclusions

In this paper, we analyze the properties of index-based join algorithms and present a new similarity join algorithm called NOSJ. The NOSJ algorithm can not only efficiently plan join computations and I/O access (the cache content guides the I/O access) without using any ordering strategies, but also nearly minimize the capacity of the demanded cache and fully utilize the locality of the cache. In general, each page (both R-page and S-page) can be accessed and processed with only one attempt.

References (14)

D.J. Abel, V. Gaede, R.A. Power, X.F. Zhou, Resequencing and Clustering to improve the Performance of Spatial Joins,...
N. Beckman, H.-P. Kriegel, R. Schneider, B. Seeger, The R∗-tree: an efficient and robust access method for points and...
S. Berchtold, D. Keim, H.-P. Kriegel, The X-tree: an index structure for high-dimensional data, in: 22nd International...
C. Böhm, B. Braunmller, M.M. Breunig, H.-P. Kriegel, High performance clustering based on the similarity join, in:...
C. Böhm, H.-P. Kriegel, A cost model and index architecture for the similarity join, in: Proceedings of 17th IEEE...
T. Brinkhoff, H.-P. Kriegel, B. Seeger, Efficient processing of spatial joins using R-trees, in: Proceedings of the...
T. Brinkhoff, H.-P. Kriegel, B. Seeger, Parallel processing of spatial joins using R-trees, in: Proceedings of the 12th...

There are more references available in the full text version of this article.

Cited by (3)

Automatic threshold estimation for data matching applications
2011, Information Sciences
Citation Excerpt :
This technique is also very useful to data integration applications. A special case is the approximate join operator [17,18,39] which matches records from different files according to the degree of similarity between their fields. A further application of data matching is data cleaning [18,10].
Several advanced data management applications, such as data integration, data deduplication, and similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to decide whether two different data instances match, i.e., if they represent the same real world object. In this context, threshold definition is a central problem. This paper proposes a method for estimating the quality of a similarity function. Quality is measured in terms of recall and precision calculated at several different thresholds. Based on the results of the proposed estimation process and the requirements of a specific application, a user is able to choose a suitable threshold value. The estimation process is based on a clustering phase performed over a data collection (or a sample thereof) and requires no human intervention since the choice of similarity threshold is based on the silhouette coefficient, which is an internal quality measure for clusters. An extensive set of experiments on artificial and real datasets demonstrates the effectiveness of the proposed approach. The results of the experiments show that in most cases the estimation error was below 10% in terms of precision and recall.
A performance comparison of distance-based query algorithms using R-trees in spatial databases
2007, Information Sciences
Efficient processing of distance-based queries (DBQs) is of great importance in spatial databases due to the wide area of applications that may address such queries. The most representative and known DBQs are the K Nearest Neighbors Query (KNNQ), ρ Distance Range Query (ρDRQ), K Closest Pairs Query (KCPQ) and ρ Distance Join Query (ρDJQ). In this paper, we propose new pruning mechanism to apply them in the design of new Recursive Best-First Search (RBFS) algorithms for DBQs between spatial objects indexed in R-trees. RBFS is a general search algorithm that runs in linear space and expands nodes in best-first order, but it can suffer from node re-expansion overhead (i.e. to expand nodes in best-first order, some nodes can be considered more than once). The R-tree and its variations are commonly cited spatial access methods that can be used for answering such spatial queries. Moreover, an exhaustive experimental study was also included using R-trees, which resulted to several conclusions about the efficiency of proposed RBFS algorithm and its comparison with respect to other search algorithms (Best-First Search (BFS) and Depth-First Branch-and-Bound (DFBnB)), in terms of disk accesses, response time and main memory requirements, taking into account several important parameters as maximum branching factor (Cmax), cardinality of the final query result (K), distance threshold (ρ) and size of a global LRU buffer (B). In general RBFS is competitive for KNNQ and KCPQ where the maximum branching factor (Cmax) is large enough (even better than DFBnB and very close to BFS), and it is a good alternative when we have main memory limitations in our computer due to high process overload in our system, since it is linear space consuming with respect to the height of the R-trees. Nevertheless, RBFS is the worst alternative for ρDRQ and ρDJQ. DFBnB is also a linear space algorithm and it obtains the same behavior as BFS for ρDRQ and ρDJQ; and it is the best when an LRU buffer was included. Finally, we have been able to check experimentally that BFS is the best for all DBQs, but it can consume many main memory resources to perform spatial queries.
Extend tree edit distance for effective object identification
2016, Knowledge and Information Systems

^☆: This paper is supported by Hi-Tech Research and Development Program of China (No. 2001AA135091), and National Natural Science Foundation of China (No. 60275021), and Key Technologies R&D Program of Shanghai, China (No. 025115023).

View full text

A near-optimal similarity join algorithm and performance evaluation☆