- GORDER: An Efficient Method for KNN Join Processing
Publisher Summary
This chapter introduces an efficient method for k-Nearest Neighbor (KNN) Join Processing—Gorder. KNN is an important but very expensive primitive operation of high-dimensional databases. Gorder is a block nested loop join method that exploits sorting, join scheduling, and distance computation filtering and reduction to reduce both I/O and CPU costs. It sorts input datasets into the G-order and applies the scheduled block nested loop join on the G-ordered data. The distance computation reduction is employed to further reduce CPU cost. It is simple and yet efficient, and handles high-dimensional data efficiently. The chapter conducts investigations on both synthetic cluster and real life datasets, and the results illustrate that Gorder is an efficient KNN-join method and outperforms existing methods by a wide margin.
References (0)
Cited by (54)
Improving Distance-Join Query processing with Voronoi-Diagram based partitioning in SpatialHadoop
2020, Future Generation Computer SystemsSpatialHadoop is an extended MapReduce framework supporting global indexing techniques that partition spatial datasets across several machines and improve spatial query processing performance compared to traditional Hadoop systems. SpatialHadoop supports several spatial operations (e.g., Nearest Neighbor search, range query, spatial intersection join, etc.) and seven spatial partitioning techniques (Grid, Quadtree, STR, STR+, -d tree, Z-curve and Hilbert-curve). Distance-Join Queries (DJQs), like the Nearest Neighbors Join Query (NNJQ) and Closest Pairs Query (CPQ), are common operations used in numerous spatial applications. DJQs are costly operations, since they combine spatial joins with distance-based search. Data partitioning improves the management of large datasets and speeds up query performance. Therefore, performing DJQs efficiently with new partitioning methods in SpatialHadoop is a challenging task. In this paper, a new data partitioning technique based on Voronoi-Diagrams is designed and implemented in SpatialHadoop. Moreover, improved NNJQ and CPQ MapReduce algorithms, using the new partitioning mechanism, are also designed and developed for SpatialHadoop. Finally, the results of an extensive set of experiments with real-world datasets are presented, demonstrating that the new partitioning technique and the improved DJQ MapReduce algorithms are efficient, scalable and robust in SpatialHadoop.
Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework
2019, Data and Knowledge EngineeringNumerous modern applications, from social networking to astronomy, need efficient answering of queries on spatial data. One such query is the All Nearest-Neighbor Query, or Nearest-Neighbor Join, that takes as input two datasets and, for each object of the first one, returns the nearest-neighbors from the second one. It is a combination of the nearest-neighbor and join queries and is computationally demanding. Especially, when the datasets involved fall in the category of Big Data, a single machine cannot efficiently process it. Only in the last few years, papers proposing solutions for distributed computing environments have appeared in the literature. In this paper, we focus on parallel and distributed algorithms using the Apache Hadoop framework. More specifically, we focus on an algorithm that was recently presented in the literature and propose improvements to tackle three major challenges that distributed processing faces: improvement of load balancing (we implement an adaptive partitioning scheme based on Quadtrees), acceleration of local processing (we prune points during calculations by utilizing plane-sweep processing), and reduction of network traffic (we restructure and reduce the output size of the most demanding phase of computation). Moreover, by using real 2D and 3D datasets, we experimentally study the effect of each improvement and their combinations on performance of this literature algorithm. Experiments show that by carefully addressing the three aforementioned issues, one can achieve significantly better performance. Thereby, we conclude to a new scalable algorithm that adapts to the data distribution and significantly outperforms its predecessor. Moreover, we present an experimental comparison of our algorithm against other well-known MapReduce algorithms for the same query and show that these algorithms are also significantly outperformed.
Efficient multiple bichromatic mutual nearest neighbor query processing
2016, Information SystemsIn this paper we propose, motivate and solve multiple bichromatic mutual nearest neighbor queries in the plane considering multiplicative weighted Euclidean distances. Given two sets of facilities of different types, a multiple bichromatic mutual -nearest neighbor query finds pairs of points, one of each set, such that the point of the first set is a k-nearest neighbor of the point of the second set and, at the same time, the point of the second set is a -nearest neighbor of the point of the first set. These queries find applications in collaborative marketing and prospective data analysis, where facilities of one type cooperate with facilities of the other type to obtain reciprocal benefits. We present a sequential and a parallel algorithm, to be run on the CPU and on a Graphics Processing Unit, respectively, for solving multiple bichromatic mutual nearest neighbor queries. We also present the time and space complexity analysis of both algorithms, together with their theoretical comparison. Finally, we provide and discuss experimental results obtained with the implementation of the proposed sequential and a parallel algorithm.
Solving multiple kth smallest dissimilarity queries for non-metric dissimilarities with the GPU
2016, Information SciencesThe kth smallest dissimilarity of a query point with respect to a given set is the dissimilarity that ranks number k when we sort, in increasing order, the dissimilarity value of the points in the set with respect to the query point. A multiple kth smallest dissimilarity query determines the kth smallest dissimilarity for several query points simultaneously. Although the problem of solving multiple kth smallest dissimilarity queries is an important primitive operation used in many areas, such as spatial data analysis, facility location, text classification and content-based image retrieval, it has not been previously addressed explicitly in the literature. In this paper we present three parallel strategies, to be run on a Graphics Processing Unit, for computing multiple kth smallest dissimilarity queries when non-metric dissimilarities, that do not satisfy the triangular inequality, are used. The strategies are theoretically and experimentally analyzed and compared among them and with an efficient sequential strategy to solve the problem.
Similarity joins: Their implementation and interactions with other database operators
2015, Information SystemsSimilarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Joins as physical database operators. In this paper, we focus on the study, design, implementation, and optimization of a Similarity Join database operator for metric spaces. We present DBSimJoin, a physical database operator that integrates techniques to: enable a non-blocking behavior, prioritize the early generation of results, and fully support the database iterator interface. The proposed operator can be used with multiple distance functions and data types. We describe the changes in each query engine module to implement DBSimJoin and provide details of our implementation in PostgreSQL. We also study ways in which DBSimJoin can be combined with other similarity and non-similarity operators to answer more complex queries, and how DBSimJoin can be used in query transformation rules to improve query performance. The extensive performance evaluation shows that DBSimJoin significantly outperforms alternative approaches and scales very well when important parameters like ε, data size, and number of dimensions increase.
Efficient continuous kNN join over dynamic high-dimensional data
2023, World Wide Web