Abstract
The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the similarity join are well-known, the distance range join, in which the user defines a distance threshold for the join, and the closest pair query or k-distance join, which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called the k-nearest neighbour join, which combines each point of one point set with its k nearest neighbours in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbour classification, data cleansing, postprocessing of sampling-based data mining, etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbour join using the multipage index (MuX), a specialised index structure for the similarity join. To reduce both CPU and I/O costs, we develop optimal loading and processing strategies.
Similar content being viewed by others
References
Agrawal R, Lin K, Sawhney H, Shim K (1995) Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Int Conf on Very Large Data Bases (VLDB)
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Int Conf on Management of Data
Berchtold S, Böhm C, Jagadish HV, Kriegel H-P, Sander J (2000) Independent Quantization: An Index Compression Technique for High Dimensional Data Spaces. IEEE Int Conf on Data Engineering (ICDE)
Berchtold S, Böhm C, Keim D, Kriegel H-P (1997) A cost model for nearest neighbor search in high-dimensional data space. ACM Symposium on Principles of Database Systems (PODS)
Böhm C (2001) The similarity join: a powerful database primitive for high performance data mining, tutorial. IEEE Int Conf on Data Engineering (ICDE)
Böhm C, Braunmüller B, Breunig MM, Kriegel H-P (2000) Fast clustering based on high-dimensional similarity joins. Int Conf on Information Knowledge Management (CIKM)
Böhm C, Braunmüller B, Krebs F, Kriegel H-P (2001) Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. ACM SIGMOD Int Conf on Management of Data
Böhm C, Krebs F (2002) High performance data mining using the nearest neighbor join. IEEE Int Conf on Data Mining (ICDM)
Böhm C, Krebs F (2003) Supporting KDD applications by the k-nearest neighbor join. Int Conf on Database and Expert Systems Applications (DEXA)
Böhm C, Kriegel H-P (2001) A cost model and index architecture for the similarity join. IEEE Int Conf on Data Engineering (ICDE)
Brachmann R, Anand T (1996) The process of knowledge discovery in databases. In: Fayyad et al (eds) Advances in Knowledge Discovery and Data Mining, AAAI Press
Breunig MM, Kriegel H-P, Kröger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. ACM SIGMOD Int Conf on Management of Data
Brinkhoff T, Kriegel H-P, Seeger B (1993) Efficient processing of spatial joins using R-trees. ACM SIGMOD Int Conf Management of Data
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. ACM SIGMOD Int Conf on Management of Data
Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT Press, Menlo Park, CA
Han J, Kamber M (2000) Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA
Hattori K, Torii Y (1993) Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognit 26(5)
Hjaltason GR, Samet H (1995) Ranking in spatial databases. Int Symp on Large Spatial Databases (SSD)
Hjaltason GR, Samet H (1998) Incremental distance join algorithms for spatial databases. SIGMOD Int Conf on Management of Data
Huang Y-W, Jing N, Rundensteiner EA (1997) Spatial joins using R-trees: breadth-first traversal with global optimizations. Int Conf on Very Large Databases (VLDB)
Kamel I, Faloutsos C (1994) Hilbert R-tree: an improved R-tree using fractals. Int Conf on Very Large Databases
Koudas N, Sevcik K (1997) Size separation spatial join. ACM SIGMOD Int Conf on Management of Data
Koudas N, Sevcik K (1998) High dimensional similarity joins: algorithms and performance evaluation. IEEE Int Conf on Data Engineering (ICDE), Best Paper Award
Lo M-L, Ravishankar CV (1994) Spatial joins using seeded trees. ACM SIGMOD Int Conf
Lo M-L, Ravishankar CV (1996) Spatial hash joins. ACM SIGMOD Int Conf on Management of Data
Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. ACM SIGMOD Int Conf
Preparata FP, Shamos MI (1985) Computational Geometry. Springer
Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. ACM SIGMOD Int Conf
Sander J, Ester M, Kriegel H-P, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery 2(2). Kluwer Academic Publishers
Shin H, Moon B, Lee S (2000) Adaptive multi-stage distance join processing. ACM SIGMOD Int Conf
Shim K, Srikant R, Agrawal R (1997) High-dimensional similarity joins. IEEE Int Conf on Data Engineering
Ullman JD (1989) Database and Knowledge-Base Systems, Vol II. Computer Science Press, Rockville
van den Bercken J, Seeger B, Widmayer P (1997) A general approach to bulk loading multidimensional index structures. Int Conf on Very Large Databases
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Böhm, C., Krebs, F. The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Know. Inf. Sys. 6, 728–749 (2004). https://doi.org/10.1007/s10115-003-0122-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-003-0122-9