ABSTRACT
The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter ε. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length ε over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the ε-kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown.
- ABKS 99.Ankerst M., Breunig M. M., Kriegel H.-P., Sander J.: OPTICS: Ordering Points To Identify the Clustering Structure, ACM SIGMOD Int. Conf. on Management of Data, 1999. Google ScholarDigital Library
- AFS 93.Agrawal R., Faloutsos C., Swami A. Efficient similarity search in sequence databases. Int. Conf. on Foundations of Data Organization and Algorithms, 1993. Google ScholarDigital Library
- BBBK 00.Bohm C., Braunmuller B., Breunig M. M., Kriegel H.-P.: Fast Clustering Based on High-Dimensional Similarity Joins, Int. Conf. on Information Knowledge Management (CIKM), 2000.Google Scholar
- BBK 98.Berchtold S., Bohm C., Kriegel H.-P.: Improving the Query Performance of High-Dimensional Index Structures Using Bulk-Load Operations, Int. Conf. on Extending Database Technology (EDBT), 1998. Google ScholarDigital Library
- BBKK 97.Berchtold S., Bohm C., Keim D., Kriegel H.-P.: A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space, ACM Symposium on Principles of Database Systems (PODS), 1997. Google ScholarDigital Library
- BEKS 00.Braunmuller B., Ester M., Kriegel H.-P., Sander J.: Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases, IEEE Int. Conf. on Data Engineering, 2000. Google ScholarDigital Library
- BK 01.Bohm C., Kriegel H.-P.: A Cost Model and Index Architecture for the Similarity Join, IEEE Int. Conf on Data Engineering (ICDE), 2001. Google ScholarDigital Library
- BKS 93.Brinkhoff T., Kriegel H.-P., Seeger B.: Efficient Processing of Spatial Joins Using R-trees, ACM SIGMOD Int. Conf. on Management of Data, 1993. Google ScholarDigital Library
- BKS 96.Brinkhoff T., Kriegel H.-P., Seeger B.: Parallel Processing of Spatial Joins Using R-trees, IEEE Int. Conf. on Data Engineering (ICDE), 1996. Google ScholarDigital Library
- BSW 97.van den Bercken J., Seeger B., Widmayer P.:A General Approach to Bulk Loading Multidimensional Index Structures, Int. Conf. on Very Large Databases, 1997. Google ScholarDigital Library
- FBFH 94.Faloutsos C., Barber R., Flickner M., Hafner J., et al.: Efficient and Effective Querying by Image Content, Journal of Intelligent Information Systems, Vol. 3, 1994. Google ScholarDigital Library
- GRS 98.Guha S., Rastogi R., Shim K.: CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD Int. Conf. on Management of Data, 1998. Google ScholarDigital Library
- HJR 97.Huang Y.-W., Jing N., Rundensteiner E. A.:Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations, Int. Conf. on Very Large Databases (VLDB), 1997. Google ScholarDigital Library
- HT 93.Hattori K., Torii Y.: Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognition, Vol. 26, No. 5, 1993.Google Scholar
- Jag 91.Jagadish H. V.: A Retrieval Technique for Similar Shapes, ACM SIGMOD Int. Conf. on Management of Data, 1991. Google ScholarDigital Library
- KF 94.Kamel I., Faloutsos C.: Hilbert R-tree: An Improved R-tree using Fractals. Int. Conf. on Very Large Databases, 1994. Google ScholarDigital Library
- KH 95.Koperski K. and Han J.: Discovery of Spatial Association Rules in Geographic Information Databases, Int. Symp. on Large Spatial Databases (SSD), 1995. Google ScholarDigital Library
- KN 96.Knorr E.M. and Ng R.T.: Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining, IEEE Trans. on Knowledge and Data Engineering, 8(6), 1996. Google ScholarDigital Library
- KN 98.Knorr E.M. and Ng R.T.: Algorithms for Mining Distance-Based Outliers in Large Datasets, Int. Conf. on Very Large Databases (VLDB), 1998. Google ScholarDigital Library
- KS 97.Koudas N., Sevcik C.: Size Separation Spatial Join, ACM SIGMOD Int. Conf. on Management of Data, 1997. Google ScholarDigital Library
- KS 98a.Koudas N., Sevcik C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation, IEEE Int. Conf. on Data Engineering (ICDE), Best Paper Award, 1998. Google ScholarDigital Library
- KS 98b.Kriegel H.-P., Seidl T.: Approximation-Based Similarity Search for 3-D Surface Segments, GeoInformatica Journal, Kluwer Academic Publishers, 1998. Google ScholarDigital Library
- KSF+ 96.Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z.: Fast Nearest Neighbor Search in Medical Image Databases, Int. Conf. on Very Large Data Bases (VLDB), 1996. Google ScholarDigital Library
- LJF 95.Lin K.-I., Jagadish H. V., Faloutsos C.: The TV-Tree: An Index Structure for High-Dimensional Data, VLDB-Journal Vol. 3, 1995. Google ScholarDigital Library
- LR 94.Lo M.-L., Ravishankar C. V.: Spatial Joins Using Seeded Trees, ACM SIGMOD Int. Conf. Management of Data, 1994. Google ScholarDigital Library
- LR 96.Lo M.-L., Ravishankar C. V.: Spatial Hash Joins, ACM SIGMOD Int. Conf. on Management of Data, 1996. Google ScholarDigital Library
- PD 96.Patel J.M., DeWitt D.J., Partition Based Spatial-Merge Join, ACM SIGMOD Int. Conf. on Management of Data, 1996. Google ScholarDigital Library
- SEKX 98.Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Vol. 2, No. 2, 1998. Google ScholarDigital Library
- Sib 73.Sibson R.: SLINK: an optimally efficient algorithm for the single-link cluster method, The Computer Journal 16(1), 1973.Google ScholarCross Ref
- SSA 97.Shim K., Srikant R., Agrawal R.: High-Dimensional Similarity Joins, Int. Conf. on Data Engineering (ICDE), 1997. Google ScholarDigital Library
Index Terms
- Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data
Recommendations
Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data
The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space ...
String similarity measures and joins with synonyms
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataA string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two ...
The k-Nearest Neighbour Join: Turbo Charging the KDD Process
The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the ...
Comments