Skip to main content
Log in

The k-Nearest Neighbour Join: Turbo Charging the KDD Process

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the similarity join are well-known, the distance range join, in which the user defines a distance threshold for the join, and the closest pair query or k-distance join, which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called the k-nearest neighbour join, which combines each point of one point set with its k nearest neighbours in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbour classification, data cleansing, postprocessing of sampling-based data mining, etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbour join using the multipage index (MuX), a specialised index structure for the similarity join. To reduce both CPU and I/O costs, we develop optimal loading and processing strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Lin K, Sawhney H, Shim K (1995) Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Int Conf on Very Large Data Bases (VLDB)

  2. Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Int Conf on Management of Data

  3. Berchtold S, Böhm C, Jagadish HV, Kriegel H-P, Sander J (2000) Independent Quantization: An Index Compression Technique for High Dimensional Data Spaces. IEEE Int Conf on Data Engineering (ICDE)

    Google Scholar 

  4. Berchtold S, Böhm C, Keim D, Kriegel H-P (1997) A cost model for nearest neighbor search in high-dimensional data space. ACM Symposium on Principles of Database Systems (PODS)

  5. Böhm C (2001) The similarity join: a powerful database primitive for high performance data mining, tutorial. IEEE Int Conf on Data Engineering (ICDE)

    Google Scholar 

  6. Böhm C, Braunmüller B, Breunig MM, Kriegel H-P (2000) Fast clustering based on high-dimensional similarity joins. Int Conf on Information Knowledge Management (CIKM)

  7. Böhm C, Braunmüller B, Krebs F, Kriegel H-P (2001) Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. ACM SIGMOD Int Conf on Management of Data

    Google Scholar 

  8. Böhm C, Krebs F (2002) High performance data mining using the nearest neighbor join. IEEE Int Conf on Data Mining (ICDM)

  9. Böhm C, Krebs F (2003) Supporting KDD applications by the k-nearest neighbor join. Int Conf on Database and Expert Systems Applications (DEXA)

  10. Böhm C, Kriegel H-P (2001) A cost model and index architecture for the similarity join. IEEE Int Conf on Data Engineering (ICDE)

  11. Brachmann R, Anand T (1996) The process of knowledge discovery in databases. In: Fayyad et al (eds) Advances in Knowledge Discovery and Data Mining, AAAI Press

  12. Breunig MM, Kriegel H-P, Kröger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. ACM SIGMOD Int Conf on Management of Data

    Google Scholar 

  13. Brinkhoff T, Kriegel H-P, Seeger B (1993) Efficient processing of spatial joins using R-trees. ACM SIGMOD Int Conf Management of Data

  14. Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. ACM SIGMOD Int Conf on Management of Data

  15. Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT Press, Menlo Park, CA

  16. Han J, Kamber M (2000) Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA

    Google Scholar 

  17. Hattori K, Torii Y (1993) Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognit 26(5)

  18. Hjaltason GR, Samet H (1995) Ranking in spatial databases. Int Symp on Large Spatial Databases (SSD)

  19. Hjaltason GR, Samet H (1998) Incremental distance join algorithms for spatial databases. SIGMOD Int Conf on Management of Data

  20. Huang Y-W, Jing N, Rundensteiner EA (1997) Spatial joins using R-trees: breadth-first traversal with global optimizations. Int Conf on Very Large Databases (VLDB)

  21. Kamel I, Faloutsos C (1994) Hilbert R-tree: an improved R-tree using fractals. Int Conf on Very Large Databases

  22. Koudas N, Sevcik K (1997) Size separation spatial join. ACM SIGMOD Int Conf on Management of Data

  23. Koudas N, Sevcik K (1998) High dimensional similarity joins: algorithms and performance evaluation. IEEE Int Conf on Data Engineering (ICDE), Best Paper Award

    Google Scholar 

  24. Lo M-L, Ravishankar CV (1994) Spatial joins using seeded trees. ACM SIGMOD Int Conf

  25. Lo M-L, Ravishankar CV (1996) Spatial hash joins. ACM SIGMOD Int Conf on Management of Data

  26. Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. ACM SIGMOD Int Conf

  27. Preparata FP, Shamos MI (1985) Computational Geometry. Springer

  28. Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. ACM SIGMOD Int Conf

  29. Sander J, Ester M, Kriegel H-P, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery 2(2). Kluwer Academic Publishers

  30. Shin H, Moon B, Lee S (2000) Adaptive multi-stage distance join processing. ACM SIGMOD Int Conf

  31. Shim K, Srikant R, Agrawal R (1997) High-dimensional similarity joins. IEEE Int Conf on Data Engineering

  32. Ullman JD (1989) Database and Knowledge-Base Systems, Vol II. Computer Science Press, Rockville

  33. van den Bercken J, Seeger B, Widmayer P (1997) A general approach to bulk loading multidimensional index structures. Int Conf on Very Large Databases

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Böhm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Böhm, C., Krebs, F. The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Know. Inf. Sys. 6, 728–749 (2004). https://doi.org/10.1007/s10115-003-0122-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-003-0122-9

Keywords

Navigation