Article

Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

Authors:
Christian Böhm

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany
View Profile

,
Bernhard Braunmüller

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany
View Profile

,
Florian Krebs

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany
View Profile

,
Hans-Peter Kriegel

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany

Institute for Computer Science, University of Munich, Oettingenstr. 67, D-80538 München, Germany
View Profile

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of dataMay 2001Pages 379–388https://doi.org/10.1145/375663.375714

Published:01 May 2001Publication History

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

Pages 379–388

ABSTRACT

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter ε. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length ε over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the ε-kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown.

References

ABKS 99.Ankerst M., Breunig M. M., Kriegel H.-P., Sander J.: OPTICS: Ordering Points To Identify the Clustering Structure, ACM SIGMOD Int. Conf. on Management of Data, 1999. Google ScholarDigital Library
AFS 93.Agrawal R., Faloutsos C., Swami A. Efficient similarity search in sequence databases. Int. Conf. on Foundations of Data Organization and Algorithms, 1993. Google ScholarDigital Library
BBBK 00.Bohm C., Braunmuller B., Breunig M. M., Kriegel H.-P.: Fast Clustering Based on High-Dimensional Similarity Joins, Int. Conf. on Information Knowledge Management (CIKM), 2000.Google Scholar
BBK 98.Berchtold S., Bohm C., Kriegel H.-P.: Improving the Query Performance of High-Dimensional Index Structures Using Bulk-Load Operations, Int. Conf. on Extending Database Technology (EDBT), 1998. Google ScholarDigital Library
BBKK 97.Berchtold S., Bohm C., Keim D., Kriegel H.-P.: A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space, ACM Symposium on Principles of Database Systems (PODS), 1997. Google ScholarDigital Library
BEKS 00.Braunmuller B., Ester M., Kriegel H.-P., Sander J.: Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases, IEEE Int. Conf. on Data Engineering, 2000. Google ScholarDigital Library
BK 01.Bohm C., Kriegel H.-P.: A Cost Model and Index Architecture for the Similarity Join, IEEE Int. Conf on Data Engineering (ICDE), 2001. Google ScholarDigital Library
BKS 93.Brinkhoff T., Kriegel H.-P., Seeger B.: Efficient Processing of Spatial Joins Using R-trees, ACM SIGMOD Int. Conf. on Management of Data, 1993. Google ScholarDigital Library
BKS 96.Brinkhoff T., Kriegel H.-P., Seeger B.: Parallel Processing of Spatial Joins Using R-trees, IEEE Int. Conf. on Data Engineering (ICDE), 1996. Google ScholarDigital Library
BSW 97.van den Bercken J., Seeger B., Widmayer P.:A General Approach to Bulk Loading Multidimensional Index Structures, Int. Conf. on Very Large Databases, 1997. Google ScholarDigital Library
FBFH 94.Faloutsos C., Barber R., Flickner M., Hafner J., et al.: Efficient and Effective Querying by Image Content, Journal of Intelligent Information Systems, Vol. 3, 1994. Google ScholarDigital Library
GRS 98.Guha S., Rastogi R., Shim K.: CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD Int. Conf. on Management of Data, 1998. Google ScholarDigital Library
HJR 97.Huang Y.-W., Jing N., Rundensteiner E. A.:Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations, Int. Conf. on Very Large Databases (VLDB), 1997. Google ScholarDigital Library
HT 93.Hattori K., Torii Y.: Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognition, Vol. 26, No. 5, 1993.Google Scholar
Jag 91.Jagadish H. V.: A Retrieval Technique for Similar Shapes, ACM SIGMOD Int. Conf. on Management of Data, 1991. Google ScholarDigital Library
KF 94.Kamel I., Faloutsos C.: Hilbert R-tree: An Improved R-tree using Fractals. Int. Conf. on Very Large Databases, 1994. Google ScholarDigital Library
KH 95.Koperski K. and Han J.: Discovery of Spatial Association Rules in Geographic Information Databases, Int. Symp. on Large Spatial Databases (SSD), 1995. Google ScholarDigital Library
KN 96.Knorr E.M. and Ng R.T.: Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining, IEEE Trans. on Knowledge and Data Engineering, 8(6), 1996. Google ScholarDigital Library
KN 98.Knorr E.M. and Ng R.T.: Algorithms for Mining Distance-Based Outliers in Large Datasets, Int. Conf. on Very Large Databases (VLDB), 1998. Google ScholarDigital Library
KS 97.Koudas N., Sevcik C.: Size Separation Spatial Join, ACM SIGMOD Int. Conf. on Management of Data, 1997. Google ScholarDigital Library
KS 98a.Koudas N., Sevcik C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation, IEEE Int. Conf. on Data Engineering (ICDE), Best Paper Award, 1998. Google ScholarDigital Library
KS 98b.Kriegel H.-P., Seidl T.: Approximation-Based Similarity Search for 3-D Surface Segments, GeoInformatica Journal, Kluwer Academic Publishers, 1998. Google ScholarDigital Library
KSF+ 96.Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z.: Fast Nearest Neighbor Search in Medical Image Databases, Int. Conf. on Very Large Data Bases (VLDB), 1996. Google ScholarDigital Library
LJF 95.Lin K.-I., Jagadish H. V., Faloutsos C.: The TV-Tree: An Index Structure for High-Dimensional Data, VLDB-Journal Vol. 3, 1995. Google ScholarDigital Library
LR 94.Lo M.-L., Ravishankar C. V.: Spatial Joins Using Seeded Trees, ACM SIGMOD Int. Conf. Management of Data, 1994. Google ScholarDigital Library
LR 96.Lo M.-L., Ravishankar C. V.: Spatial Hash Joins, ACM SIGMOD Int. Conf. on Management of Data, 1996. Google ScholarDigital Library
PD 96.Patel J.M., DeWitt D.J., Partition Based Spatial-Merge Join, ACM SIGMOD Int. Conf. on Management of Data, 1996. Google ScholarDigital Library
SEKX 98.Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Vol. 2, No. 2, 1998. Google ScholarDigital Library
Sib 73.Sibson R.: SLINK: an optimally efficient algorithm for the single-link cluster method, The Computer Journal 16(1), 1973.Google ScholarCross Ref
SSA 97.Shim K., Srikant R., Agrawal R.: High-Dimensional Similarity Joins, Int. Conf. on Data Engineering (ICDE), 1997. Google ScholarDigital Library

Index Terms

Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space ...
Read More
String similarity measures and joins with synonyms
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two ...
Read More
The k-Nearest Neighbour Join: Turbo Charging the KDD Process

The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
May 2001
630 pages
ISBN:1581133324
DOI:10.1145/375663
Editors:
Timos Sellis,
Sharad Mehrotra
ACM SIGMOD Record Volume 30, Issue 2
June 2001
625 pages
ISSN:0163-5808
DOI:10.1145/376284
Editors:
Timos Sellis
National Technical Univ. of Athens
,
Sharad Mehrotra
Univ. of California at Irvine
Issue’s Table of Contents
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
feature transformation
high-dimensional space
knowledge discovery
similarity join
similarity search
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '01 Paper Acceptance Rate44of293submissions,15%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 818
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

String similarity measures and joins with synonyms

The k-Nearest Neighbour Join: Turbo Charging the KDD Process