Elsevier

Information Sciences

Volume 167, Issues 1–4, 2 December 2004, Pages 87-108
Information Sciences

A near-optimal similarity join algorithm and performance evaluation

https://doi.org/10.1016/j.ins.2003.11.005Get rights and content

Abstract

Similarity join, a basic operation for multi-media databases, amounts to combinations of all pairs of points, with the distance between each pair bounded by a given parameter ε. In this paper, properties of index-based join algorithms are studied and a highly efficient and near-optimal similarity join algorithm is proposed. Our algorithm utilizes the Breadth-First strategy, and guides the join computation and I/O access through the cache content. In contrast with many other proposed join algorithms, our algorithm is advantageous due to the essential independence of the ordering strategies and the minimal cache capacity requirement. As a result, a more precise plan for the sequence of join computations and I/O access can be realized. Generally, processing and accessing each page can be done with only one attempt. Qualitative and quantitative analysis of the performance of the algorithm is provided. Although only R-tree (a common index structure) based similarity join is discussed in this paper, the idea can be generalized to implement other join algorithms without substantial difficulties. Experiments based on our analysis indicate that the new algorithm yields superior performances across a wide range of dimensions and sizes of databases.

Introduction

The application of similarity join, a basic operation for database queries, ranges widely from reporting “similar” pairs in a multimedia database to speeding up data mining processes. In fact, similarity join can be considered as certain combinations of a series of similarity queries.

In order to retrieve data objects efficiently, similarity query strategies are usually required to transform data objects into multi-dimensional points (feature vector), based on the index structure. Accordingly, similarity query amounts to the query of multi-dimensional feature points in index structure. Similarity join combines multi-dimensional point sets so that the resulted sets contain all “close” pairs of points. Although many index-based join algorithms have appeared [5], [6], [11], index-based approaches are thought recently [5] to be inefficient, particularly in high-dimensional case. For example, as pointed out in [5], a serious optimization conflict may exist between CPU time and I/O time, that is, while fine-grained index structures seem to be efficient for CPU, it tends to deteriorate the I/O performance.

In this paper, we first investigate the properties of the index-based similarity join. The comparison of the features of several widely used join algorithms––R-tree spatial join (RSJ) [6], breadth first R-tree join (BFRJ) [11] and multi-page index similarity join (MISJ) [5]––shows that, to achieve local and global optimization simultaneously, the index-based join needs to be computed by levels (breadth-first), and join computations and I/O access should be guided by cache content (The word “cache” always represents the application cache in this paper) so that one may plan the sequence of I/O access and join computations a priori. However, as opposed to many reported results, it is pointed out that the index-based join algorithm is in itself independent of the ordering strategies, such as Plane-Sweep, space-filling curves. Next, a near-optimal similarity join (NOSJ) algorithm is proposed. The algorithm guides the join computations and I/O access through the cache content, nearly minimizes the capacity of the demanded cache and I/O access times, and efficiently utilizes the locality of the cache. Theoretical analysis and experiments show that, generally, processing and accessing each page can be completed with only one attempt. Moreover, how the performance of the NOSJ algorithm can be optimized regarding the CPU and the I/O costs is discussed. Considering its superior performance, we base our experiments for NOSJ on R-tree. Our experiments show that NOSJ yields superior performances across a wide range of dimensions and sizes of databases.

This paper is organized as follows. In Section 2 related work and backgrounds are discussed. Our NOSJ algorithm is described in Section 3. In Section 4 of this paper, we optimize the NOSJ algorithm from two aspects: CPU cost and I/O cost. After evaluating the NOSJ experimentally in Section 5, we conclude this paper in Section 6.

Section snippets

The R-tree family

The concept of R-tree is a natural extension of B+-tree in high-dimensional cases [10]. It combines most advantages of B-trees and quadtrees. A non-leaf node of an R-tree contains entries of the form 〈Ptr,MBR〉, where Ptr is a pointer to a child node and MBR is the minimal bounding rectangle (MBR) that encloses the MBR's all entries in the child node. A leaf node of an R-tree contains entries of the form 〈Oid,MBR〉, where Oid is a pointer to a data object. In an R-tree, father nodes are allowed

Analyzing the special features of the join algorithms

Let us first analyze and compare the properties of the RSJ and the BFRJ algorithms. First, they use different traversal strategies, while the former use depth-first and consequently the access pattern for pages beyond the current scope may not be captured, the latter adopts breadth-first so that the join computation can be optimized both globally and locally. Next, they both depend upon the ordering strategy, and their experiments indicate that better ordering strategies can achieve better

Analyzing the CPU cost of NOSJ

It is known [5] that the CPU cost consists of the cost of directory page processing and that of data page processing, and thus the total CPU cost istCPU=σ|R||S|tpoint+|R||S|tboxCavg2.However, for NOSJ, besides costs included in (4.1), the cost for maintaining the NPI and managing the cache should also be taken into account. According to (3.1), the cost for maintaining the NPI is2σ|S||R|Cavg2to,where to is the time for removing a head-node from the Twax list. The cost for managing the cache

Experimental evaluations

In this section, we shall test the performance of our algorithm and verify its superiority. The NOSJ algorithm is implemented in C++. For comparison, we also implement the RSJ [6] and BFRJ algorithm [11] algorithms, optimized with other optimizing techniques introduced in [6], [11]. All the experiments are carried out on Founder-PC with Intel Celeron 1200 MHz CPU, 128 M Memory, and Windows XP Operating System. The size of virtual memory page-file is 168 MB. The goals of our experiments is to

Conclusions

In this paper, we analyze the properties of index-based join algorithms and present a new similarity join algorithm called NOSJ. The NOSJ algorithm can not only efficiently plan join computations and I/O access (the cache content guides the I/O access) without using any ordering strategies, but also nearly minimize the capacity of the demanded cache and fully utilize the locality of the cache. In general, each page (both R-page and S-page) can be accessed and processed with only one attempt.

References (14)

  • D.J. Abel, V. Gaede, R.A. Power, X.F. Zhou, Resequencing and Clustering to improve the Performance of Spatial Joins,...
  • N. Beckman, H.-P. Kriegel, R. Schneider, B. Seeger, The R∗-tree: an efficient and robust access method for points and...
  • S. Berchtold, D. Keim, H.-P. Kriegel, The X-tree: an index structure for high-dimensional data, in: 22nd International...
  • C. Böhm, B. Braunmller, M.M. Breunig, H.-P. Kriegel, High performance clustering based on the similarity join, in:...
  • C. Böhm, H.-P. Kriegel, A cost model and index architecture for the similarity join, in: Proceedings of 17th IEEE...
  • T. Brinkhoff, H.-P. Kriegel, B. Seeger, Efficient processing of spatial joins using R-trees, in: Proceedings of the...
  • T. Brinkhoff, H.-P. Kriegel, B. Seeger, Parallel processing of spatial joins using R-trees, in: Proceedings of the 12th...
There are more references available in the full text version of this article.

Cited by (3)

  • Automatic threshold estimation for data matching applications

    2011, Information Sciences
    Citation Excerpt :

    This technique is also very useful to data integration applications. A special case is the approximate join operator [17,18,39] which matches records from different files according to the degree of similarity between their fields. A further application of data matching is data cleaning [18,10].

This paper is supported by Hi-Tech Research and Development Program of China (No. 2001AA135091), and National Natural Science Foundation of China (No. 60275021), and Key Technologies R&D Program of Shanghai, China (No. 025115023).

View full text