The k-Nearest Neighbour Join: Turbo Charging the KDD Process

Böhm, Christian; Krebs, Florian

doi:10.1007/s10115-003-0122-9

The k-Nearest Neighbour Join: Turbo Charging the KDD Process

Published: 27 February 2004

Volume 6, pages 728–749, (2004)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Christian Böhm¹ &
Florian Krebs²

483 Accesses
97 Citations
Explore all metrics

Abstract

The similarity join has become an important database primitive for supporting similarity searches and data mining. A similarity join combines two sets of complex objects such that the result contains all pairs of similar objects. Two types of the similarity join are well-known, the distance range join, in which the user defines a distance threshold for the join, and the closest pair query or k-distance join, which retrieves the k most similar pairs. In this paper, we propose an important, third similarity join operation called the k-nearest neighbour join, which combines each point of one point set with its k nearest neighbours in the other set. We discover that many standard algorithms of Knowledge Discovery in Databases (KDD) such as k-means and k-medoid clustering, nearest neighbour classification, data cleansing, postprocessing of sampling-based data mining, etc. can be implemented on top of the k-nn join operation to achieve performance improvements without affecting the quality of the result of these algorithms. We propose a new algorithm to compute the k-nearest neighbour join using the multipage index (MuX), a specialised index structure for the similarity join. To reduce both CPU and I/O costs, we develop optimal loading and processing strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal R, Lin K, Sawhney H, Shim K (1995) Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Int Conf on Very Large Data Bases (VLDB)
Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Int Conf on Management of Data
Berchtold S, Böhm C, Jagadish HV, Kriegel H-P, Sander J (2000) Independent Quantization: An Index Compression Technique for High Dimensional Data Spaces. IEEE Int Conf on Data Engineering (ICDE)
Google Scholar
Berchtold S, Böhm C, Keim D, Kriegel H-P (1997) A cost model for nearest neighbor search in high-dimensional data space. ACM Symposium on Principles of Database Systems (PODS)
Böhm C (2001) The similarity join: a powerful database primitive for high performance data mining, tutorial. IEEE Int Conf on Data Engineering (ICDE)
Google Scholar
Böhm C, Braunmüller B, Breunig MM, Kriegel H-P (2000) Fast clustering based on high-dimensional similarity joins. Int Conf on Information Knowledge Management (CIKM)
Böhm C, Braunmüller B, Krebs F, Kriegel H-P (2001) Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. ACM SIGMOD Int Conf on Management of Data
Google Scholar
Böhm C, Krebs F (2002) High performance data mining using the nearest neighbor join. IEEE Int Conf on Data Mining (ICDM)
Böhm C, Krebs F (2003) Supporting KDD applications by the k-nearest neighbor join. Int Conf on Database and Expert Systems Applications (DEXA)
Böhm C, Kriegel H-P (2001) A cost model and index architecture for the similarity join. IEEE Int Conf on Data Engineering (ICDE)
Brachmann R, Anand T (1996) The process of knowledge discovery in databases. In: Fayyad et al (eds) Advances in Knowledge Discovery and Data Mining, AAAI Press
Breunig MM, Kriegel H-P, Kröger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. ACM SIGMOD Int Conf on Management of Data
Google Scholar
Brinkhoff T, Kriegel H-P, Seeger B (1993) Efficient processing of spatial joins using R-trees. ACM SIGMOD Int Conf Management of Data
Corral A, Manolopoulos Y, Theodoridis Y, Vassilakopoulos M (2000) Closest pair queries in spatial databases. ACM SIGMOD Int Conf on Management of Data
Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery: an overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT Press, Menlo Park, CA
Han J, Kamber M (2000) Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA
Google Scholar
Hattori K, Torii Y (1993) Effective algorithms for the nearest neighbor method in the clustering problem. Pattern Recognit 26(5)
Hjaltason GR, Samet H (1995) Ranking in spatial databases. Int Symp on Large Spatial Databases (SSD)
Hjaltason GR, Samet H (1998) Incremental distance join algorithms for spatial databases. SIGMOD Int Conf on Management of Data
Huang Y-W, Jing N, Rundensteiner EA (1997) Spatial joins using R-trees: breadth-first traversal with global optimizations. Int Conf on Very Large Databases (VLDB)
Kamel I, Faloutsos C (1994) Hilbert R-tree: an improved R-tree using fractals. Int Conf on Very Large Databases
Koudas N, Sevcik K (1997) Size separation spatial join. ACM SIGMOD Int Conf on Management of Data
Koudas N, Sevcik K (1998) High dimensional similarity joins: algorithms and performance evaluation. IEEE Int Conf on Data Engineering (ICDE), Best Paper Award
Google Scholar
Lo M-L, Ravishankar CV (1994) Spatial joins using seeded trees. ACM SIGMOD Int Conf
Lo M-L, Ravishankar CV (1996) Spatial hash joins. ACM SIGMOD Int Conf on Management of Data
Patel JM, DeWitt DJ (1996) Partition based spatial-merge join. ACM SIGMOD Int Conf
Preparata FP, Shamos MI (1985) Computational Geometry. Springer
Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. ACM SIGMOD Int Conf
Sander J, Ester M, Kriegel H-P, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery 2(2). Kluwer Academic Publishers
Shin H, Moon B, Lee S (2000) Adaptive multi-stage distance join processing. ACM SIGMOD Int Conf
Shim K, Srikant R, Agrawal R (1997) High-dimensional similarity joins. IEEE Int Conf on Data Engineering
Ullman JD (1989) Database and Knowledge-Base Systems, Vol II. Computer Science Press, Rockville
van den Bercken J, Seeger B, Widmayer P (1997) A general approach to bulk loading multidimensional index structures. Int Conf on Very Large Databases

Download references

Author information

Authors and Affiliations

University of Munich, Oettingenstr. 67, 80538, München, Germany
Christian Böhm
University of Munich, München, Germany
Florian Krebs

Authors

Christian Böhm
View author publications
You can also search for this author in PubMed Google Scholar
Florian Krebs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Böhm.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Böhm, C., Krebs, F. The k-Nearest Neighbour Join: Turbo Charging the KDD Process. Know. Inf. Sys. 6, 728–749 (2004). https://doi.org/10.1007/s10115-003-0122-9

Download citation

Received: 09 December 2002
Revised: 07 February 2003
Accepted: 12 May 2003
Published: 27 February 2004
Issue Date: November 2004
DOI: https://doi.org/10.1007/s10115-003-0122-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The k-Nearest Neighbour Join: Turbo Charging the KDD Process

Abstract

Access this article

Similar content being viewed by others

Index-Based R-S Similarity Joins

On reverse-k-nearest-neighbor joins

Survey on KNN Methods in Data Science

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

The k-Nearest Neighbour Join: Turbo Charging the KDD Process

Abstract

Access this article

Similar content being viewed by others

Index-Based R-S Similarity Joins

On reverse-k-nearest-neighbor joins

Survey on KNN Methods in Data Science

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation