skip to main content
10.1145/375663.375714acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

Authors Info & Claims
Published:01 May 2001Publication History

ABSTRACT

The similarity join is an important database primitive which has been successfully applied to speed up applications such as similarity search, data analysis and data mining. The similarity join combines two point sets of a multidimensional vector space such that the result contains all point pairs where the distance does not exceed a parameter ε. In this paper, we propose the Epsilon Grid Order, a new algorithm for determining the similarity join of very large data sets. Our solution is based on a particular sort order of the data points, which is obtained by laying an equi-distant grid with cell length ε over the data space and comparing the grid cells lexicographically. A typical problem of grid-based approaches such as MSJ or the ε-kdB-tree is that large portions of the data sets must be held simultaneously in main memory. Therefore, these approaches do not scale to large data sets. Our technique avoids this problem by an external sorting algorithm and a particular scheduling strategy during the join phase. In the experimental evaluation, a substantial improvement over competitive techniques is shown.

References

  1. ABKS 99.Ankerst M., Breunig M. M., Kriegel H.-P., Sander J.: OPTICS: Ordering Points To Identify the Clustering Structure, ACM SIGMOD Int. Conf. on Management of Data, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. AFS 93.Agrawal R., Faloutsos C., Swami A. Efficient similarity search in sequence databases. Int. Conf. on Foundations of Data Organization and Algorithms, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. BBBK 00.Bohm C., Braunmuller B., Breunig M. M., Kriegel H.-P.: Fast Clustering Based on High-Dimensional Similarity Joins, Int. Conf. on Information Knowledge Management (CIKM), 2000.Google ScholarGoogle Scholar
  4. BBK 98.Berchtold S., Bohm C., Kriegel H.-P.: Improving the Query Performance of High-Dimensional Index Structures Using Bulk-Load Operations, Int. Conf. on Extending Database Technology (EDBT), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BBKK 97.Berchtold S., Bohm C., Keim D., Kriegel H.-P.: A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space, ACM Symposium on Principles of Database Systems (PODS), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BEKS 00.Braunmuller B., Ester M., Kriegel H.-P., Sander J.: Efficiently Supporting Multiple Similarity Queries for Mining in Metric Databases, IEEE Int. Conf. on Data Engineering, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. BK 01.Bohm C., Kriegel H.-P.: A Cost Model and Index Architecture for the Similarity Join, IEEE Int. Conf on Data Engineering (ICDE), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. BKS 93.Brinkhoff T., Kriegel H.-P., Seeger B.: Efficient Processing of Spatial Joins Using R-trees, ACM SIGMOD Int. Conf. on Management of Data, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. BKS 96.Brinkhoff T., Kriegel H.-P., Seeger B.: Parallel Processing of Spatial Joins Using R-trees, IEEE Int. Conf. on Data Engineering (ICDE), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. BSW 97.van den Bercken J., Seeger B., Widmayer P.:A General Approach to Bulk Loading Multidimensional Index Structures, Int. Conf. on Very Large Databases, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. FBFH 94.Faloutsos C., Barber R., Flickner M., Hafner J., et al.: Efficient and Effective Querying by Image Content, Journal of Intelligent Information Systems, Vol. 3, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. GRS 98.Guha S., Rastogi R., Shim K.: CURE: An Efficient Clustering Algorithm for Large Databases, ACM SIGMOD Int. Conf. on Management of Data, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. HJR 97.Huang Y.-W., Jing N., Rundensteiner E. A.:Spatial Joins Using R-trees: Breadth-First Traversal with Global Optimizations, Int. Conf. on Very Large Databases (VLDB), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. HT 93.Hattori K., Torii Y.: Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognition, Vol. 26, No. 5, 1993.Google ScholarGoogle Scholar
  15. Jag 91.Jagadish H. V.: A Retrieval Technique for Similar Shapes, ACM SIGMOD Int. Conf. on Management of Data, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. KF 94.Kamel I., Faloutsos C.: Hilbert R-tree: An Improved R-tree using Fractals. Int. Conf. on Very Large Databases, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. KH 95.Koperski K. and Han J.: Discovery of Spatial Association Rules in Geographic Information Databases, Int. Symp. on Large Spatial Databases (SSD), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. KN 96.Knorr E.M. and Ng R.T.: Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining, IEEE Trans. on Knowledge and Data Engineering, 8(6), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. KN 98.Knorr E.M. and Ng R.T.: Algorithms for Mining Distance-Based Outliers in Large Datasets, Int. Conf. on Very Large Databases (VLDB), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. KS 97.Koudas N., Sevcik C.: Size Separation Spatial Join, ACM SIGMOD Int. Conf. on Management of Data, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. KS 98a.Koudas N., Sevcik C.: High Dimensional Similarity Joins: Algorithms and Performance Evaluation, IEEE Int. Conf. on Data Engineering (ICDE), Best Paper Award, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. KS 98b.Kriegel H.-P., Seidl T.: Approximation-Based Similarity Search for 3-D Surface Segments, GeoInformatica Journal, Kluwer Academic Publishers, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. KSF+ 96.Korn F., Sidiropoulos N., Faloutsos C., Siegel E., Protopapas Z.: Fast Nearest Neighbor Search in Medical Image Databases, Int. Conf. on Very Large Data Bases (VLDB), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. LJF 95.Lin K.-I., Jagadish H. V., Faloutsos C.: The TV-Tree: An Index Structure for High-Dimensional Data, VLDB-Journal Vol. 3, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. LR 94.Lo M.-L., Ravishankar C. V.: Spatial Joins Using Seeded Trees, ACM SIGMOD Int. Conf. Management of Data, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. LR 96.Lo M.-L., Ravishankar C. V.: Spatial Hash Joins, ACM SIGMOD Int. Conf. on Management of Data, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. PD 96.Patel J.M., DeWitt D.J., Partition Based Spatial-Merge Join, ACM SIGMOD Int. Conf. on Management of Data, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. SEKX 98.Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Vol. 2, No. 2, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sib 73.Sibson R.: SLINK: an optimally efficient algorithm for the single-link cluster method, The Computer Journal 16(1), 1973.Google ScholarGoogle ScholarCross RefCross Ref
  30. SSA 97.Shim K., Srikant R., Agrawal R.: High-Dimensional Similarity Joins, Int. Conf. on Data Engineering (ICDE), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
        May 2001
        630 pages
        ISBN:1581133324
        DOI:10.1145/375663

        Copyright © 2001 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 May 2001

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        SIGMOD '01 Paper Acceptance Rate44of293submissions,15%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader