Pivot-based approximate k-NN similarity joins for big high-dimensional data

doi:10.1016/j.is.2019.06.006

Information Systems

Volume 87, January 2020, 101410

https://doi.org/10.1016/j.is.2019.06.006 Get rights and content

Highlights

•
Study of approximate k-NN similarity joins for big high-dimensional data.
•
Pivot-based k-NN join methods supporting various levels of approximation guarantee.
•
Implementation and algorithm extensions with publicly available source code.
•
Comprehensive experiments using high-dimensional data and popular Big Data systems.

Abstract

Given an appropriate similarity model, the $k$ -nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.

Introduction

The $k$ -nearest neighbor (k-NN) similarity join is an asymmetric operation that returns the $k$ most similar objects in a dataset $S$ for each query object in a dataset $R$ . In recent years, the study of k-NN joins attracted a considerable amount of attention due to their applicability in various domains. In the data mining and machine learning context, k-NN joins can be employed as a preprocessing step for classification or cluster analysis. In data exploration and information retrieval, similarity joins provide a similarity graph with potentially relevant entities for each object in the database. k-NN similarity join applications can be found, for example, in image and video retrieval [1], [2], [3], [4], spatial databases [5], pattern recognition [6], and network communication analysis and malware detection frameworks [7], [8].

Because data volumes are often too large to be processed on a single machine (especially for high-dimensional data), we focus on the distributed MapReduce environment [9] running on Hadoop¹ and Spark² . MapReduce is a widely adopted framework and is considered an efficient and scalable solution for distributed big data processing. MapReduce programs are designed to run on large clusters of commodity hardware and employ a programming paradigm similar to the divide and conquer approach. Datasets are loaded, split and pre-processed in the map phase and the main execution and evaluation of an algorithm are performed in parallel on smaller data fractions in the reduce phase.

In this paper, we study approximate k-NN similarity join algorithms that can provide significant speedup compared to the exact similarity join while still preserving high results precision. In many domains, the difference between exact and slightly different $k$ nearest results is acceptable. This is particularly the case in scenarios where computing the exact similarity joins over big high-dimensional data would take significantly large execution times.

Our study focuses on similarity joins for MapReduce environments based on the metric space approach [10]. This approach provides a universal framework for the efficient processing of various similarity models. For evaluations on vector data, we also revisited and extended two previously proposed k-NN similarity join approaches designed for vector spaces. In this paper, we focus on algorithms employing data organizations and replication strategies initialized randomly as these techniques can be conveniently applied to Big Data in different domains. Although a study tackling related similarity joins has been previously published for Hadoop [11], the study focused on low dimensional data. The subsequent journal paper [12] tested data up to 386 dimensions and highlighted limitations for most k-NN join methods on such a high-dimensional dataset. The need for effective and efficient k-NN similarity joins for high-dimensional data led us to (1) design distributed similarity join techniques with thresholds or approximation guarantees, (2) revise available MapReduce algorithms integrating extensions to more efficiently handle high-dimensional data, (3) consider the implementation of such algorithms on a different platform — Spark (in addition to Hadoop), and (4) experimentally evaluate and compare the performance of the different approaches.

This paper extends a short conference paper that presented a comparison of our heuristic method with two previously proposed approaches on Hadoop [13] and follows the paper proposing the pivot-based heuristic k-NN join method [7]. This paper significantly extends the previous papers by introducing a new MapReduce based method that supports an $ϵ$ -guaranteed approximation, i.e., an approximate version of the k-NN join where the distance from each query point to its farthest neighbor is constrained in terms of a parameter ( $ϵ$ ) and the distance to the farthest neighbor in the exact solution. Furthermore, this paper includes the implementation guidelines for Spark and a thorough, and mostly new, set of experimental results.

The overall contribution of our work can be summarized into four points:

•
Extensions of previously proposed k-NN similarity join algorithms on MapReduce to process big high-dimensional data more efficiently.
•
The introduction of pivot-based k-NN similarity join heuristic approaches on MapReduce that support approximation-related thresholds and guarantees. We analyze an approach that provides the $ϵ$ -guarantee (which constrains the distance from each query point to its furthest neighbor returned in the k-NN join). We include a discussion of the theoretical foundations that support the proposed methods.
•
The Spark and Hadoop implementation guidelines of the proposed MapReduce join methods. We point out the limitations of different platforms and show why Spark provides faster execution times. We also provide the source code of the Spark implementation of all the evaluated methods, including our new implementations of baseline related approaches based on space filling curves (Z-curve) and locality sensitive hashing.
•
Thorough and extensive performance evaluation on large data with different dimensionality (from 10 to 1000 dimensions) running on fully distributed Amazon clusters, with most experiments evaluated on the Spark platform processing up to tens of millions of objects. This analysis provides guidance for selecting an appropriate algorithm for distributed k-NN join based on workload and approximation precision requirements.

The remaining part of the paper is structured in the following way. In Section 2, basic formal definitions and common terms are presented. An overview of similarity joins problems, two related methods, and several proposed extensions of these methods are covered in Section 3. Section 4 presents several exact and approximate pivot-based k-NN similarity join algorithms on MapReduce and provides their implementation guidelines. In Section 5, the performance evaluation of all the implemented algorithms is presented and the results are discussed. Section 6 concludes the paper.

Section snippets

Preliminaries

The fundamental concepts and basic definitions related to approximate k-NN similarity joins are summarized in the following subsections, considering the standard notations [10], [12].

Related work on similarity joins

Many types of different similarity joins have been defined and studied over recent years. Specifically, previous work in this area studied k-Distance joins [15] (returns the smallest k pairs between two datasets), range query joins [16], [17] (returns all the pairs with a distance equal to or smaller than a given threshold) and k-NN similarity joins (for each record of the first dataset, it returns the k closest records in the second dataset) [18], [19]. Some join techniques focus just on

Pivot-based k-NN similarity joins on MapReduce

Pivot-based methods represent a useful generic approach with the convenient random initialization, which nevertheless reflects data distribution by dividing a metric space into partitions centered around global objects (pivots) selected from the dataset. The benefits of pivot-based methods have been investigated for k-NN similarity joins on MapReduce in the work of Lu et al. [19]. The authors describe how mappers cluster objects into groups and reducers perform the k-NN join on each group of

Experimental evaluation

In this section, the presented MapReduce k-NN similarity join algorithms are experimentally evaluated and compared. The experiments focus on scalability, precision and the overall execution time of all solutions for high-dimensional data. First, we describe the test datasets and the evaluation platform, then we compare selected methods on two MapReduce frameworks, where we present the benefits of Spark. For Spark, we investigate parameters for all the presented methods and, finally, we compare

Conclusions

In this paper, we focused on approximate k-NN similarity joins in the MapReduce environment implemented mainly in Spark. We studied approximation quality and guarantees for pivot-based methods from the theoretical and experimental perspectives and presented two different pivot-based approximate k-NN similarity join algorithms. We also compared the methods with other heuristic algorithms (based on Z-curves and LSH) reimplemented in Spark for high-dimensional data. According to our findings, data

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This project was supported by the Charles University in Prague grant GAUK 201515, the Czech Science Foundation (GAČR) project Nr. 17-22224S and partially by Charles University grant SVV-260451.

References (55)

SilvaY.N. et al.
Similarity joins: Their implementation and interactions with other database operators
Inf. Syst.
(2015)
YuC. et al.
Efficient index-based KNN join processing for high-dimensional data
Inf. Softw. Technol.
(2007)
BustosB. et al.
Pivot selection techniques for proximity searching in metric spaces
Pattern Recognit. Lett.
(2003)
MarinJ.-M. et al.
Bayesian modelling and inference on mixtures of distributions
KohoutJ. et al.
Learning communication patterns for malware discovery in https data
Expert Syst. Appl.
(2018)
FerhatosmanogluH. et al.
Approximate nearest neighbor searching in multimedia databases
GiacintoG.
A nearest-neighbor approach to relevance feedback in content based image retrieval
CobârzanC. et al.
Interactive video search tools: a detailed analysis of the video browser showdown 2015
Multimedia Tools Appl.
(2017)
LokocJ. et al.
On influential trends in interactive video retrieval: Video browser showdown 2015-2017
IEEE Trans. Multimed.
(2018)
HjaltasonG.R. et al.
Distance browsing in spatial databases
ACM Trans. Database Syst.
(1999)

MujaM. et al.

Scalable nearest neighbor algorithms for high dimensional data

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

CechP. et al.

Feature extraction and malware detection on large HTTPS data using mapreduce

LokocJ. et al.

K-nn classification of malware in HTTPS traffic using the metric space approach

DeanJ. et al.

Mapreduce: Simplified data processing on large clusters

Commun. ACM

(2008)

ZezulaP. et al.

Similarity Search - The Metric Space Approach

Advances in Database Systems

(2006)

SongG. et al.

Solutions for processing k nearest neighbor joins for massive data on mapreduce

SongG. et al.

K nearest neighbour joins for big data on mapreduce: A theoretical and experimental analysis

IEEE Trans. Knowl. Data Eng.

(2016)

CechP. et al.

Comparing mapreduce-based k-nn similarity joins on hadoop for high-dimensional data

PatellaM. et al.

The many facets of approximate similarity search

HjaltasonG.R. et al.

Incremental distance join algorithms for spatial databases

SilvaY.N. et al.

Exploiting mapreduce-based similarity joins

MaY. et al.

Parallel similarity joins on massive high-dimensional data using mapreduce

Concurr. Comput.: Pract. Exper.

(2016)

BöhmC. et al.

The k-nearest neighbour join: Turbo charging the KDD process

Knowl. Inf. Syst.

(2004)

LuW. et al.

Efficient processing of k nearest neighbor joins using mapreduce

Proc. VLDB Endow.

(2012)

VernicaR. et al.

Efficient parallel set-similarity joins using mapreduce

RongC. et al.

Fast and scalable distributed set similarity joins for big data analytics

XiaoC. et al.

Ed-join: an efficient algorithm for similarity joins with edit distance constraints

PVLDB

(2008)

Cited by (13)

FLEX: A fast and light-weight learned index for kNN search in high-dimensional space
2024, Information Sciences
The k Nearest Neighbors (kNN) search in high-dimensional space is a fundamental problem with various applications. In this paper, we try to solve this problem by using deep neural networks (DNNs). We apply DNNs to represent complex correlations between high-dimensional objects, enabling us to project similar objects into the same class, thus reducing the search cost. Based on DNNs, we propose two novel techniques to improve query efficiency while keeping accuracy. First, traditional DNNs typically demand extensive training data for achieving high accuracy. To decrease the training size, we design a multi-module DNN framework comprising several small modules. Each module learns to capture part of knowledge for the given query. The collective output of these sub-modules is then seamlessly integrated to form the final result. Second, with machine learning models, the size of candidates are unbounded. Thus, we design a linear-time data layout refinement algorithm, aiming to limit the number of candidates to a small constant. Empirically we find that our approach significantly outperforms the state-of-the-art methods in terms of both time efficiency and space efficiency while still attaining comparable or better accuracy.
Unconventional application of k-means for distributed approximate similarity search
2023, Information Sciences
Similarity search based on a distance function in metric spaces is a fundamental problem for many applications. Queries for similar objects lead to the well-known machine learning task of nearest-neighbours identification. Many data indexing strategies, collectively known as Metric Access Methods (MAM), have been proposed to speed up these queries. Moreover, since exact approaches to solving similarity queries can be complex and time-consuming, alternative options have emerged to reduce query execution time, such as returning approximate results or resorting to distributed computing platforms. In this paper, we introduce MASK (Multilevel Approximate Similarity search with k-means), an unconventional application of the k-means algorithm as the foundation of a multilevel index structure for approximate similarity search suitable for metric spaces. We show that this method leverages inherent properties of k-means for this purpose, like representing high-density data areas with fewer prototypes. An implementation of this new indexing procedure is evaluated using a synthetic dataset and two real-world datasets in high-dimensional and high-sparsity spaces. Experimental tests show that MASK performs better than alternative algorithms for approximate similarity search. Results are promising and underpin the applicability of this novel indexing method in multiple domains.
A novel extreme learning machine based kNN classification method for dealing with big data
2021, Expert Systems with Applications
Citation Excerpt :
This method uses a sampling technique for capturing data distribution more quickly. Totally, the performance of these methods would be weakened when the size of dimensionality increases (Čech, Lokoč, & Silva, 2020). Due to some disadvantages of Hadoop like saving intermediate results of computations on the Hard Disk Drive that reduces the processing speed, some studies discussed the use of Spark for distributing the kNN computations (Chen, Shen, Feng, & Le, 2015; Jesus Maillo et al., 2017; Ramírez-Gallego et al., 2017).
kNN algorithm, as an effective data mining technique, is always attended for supervised classification. On the other hand, the previously proposed kNN finding methods cannot be considered as efficient methods for dealing with big data. As there is daily generated and expanded big datasets on different online and offline servers, the efficient methods for such data must be introduced to find kNN. Moreover, massive amounts of data contain more noise and imperfection data samples that significantly increase the need for a robust kNN finding method. In this paper, a new fast and robust kNN finding framework is introduced to deal with the big datasets. In this method, a group of most relevant data samples to an input data sample are detected and the original kNN method is applied on them for finding the final nearest neighbors. The main goal of this method is dealing with the big datasets in an accurate, fast, and robust manner. Here, the training data samples of each label are grouped into some partitions based on the output of some mini-classifiers (i.e. ELM classifier). In fact, the behavior of the mini-classifiers is the basis of partitioning the training data samples. These mini-classifiers are trained using non-overlapping subsets of the training set in the form of each mini-classifier a subset. Here, an index is calculated for each partition to make the corresponding partition finding faster using a tree structure in which each partition index is fallen into a leaf. Then, outputs of the mini-classifiers for an input test sample are used to find the corresponding group of most relevant data samples to the input data sample on the tree. Experimental results indicate that the proposed method has better performance in most cases and comparable performance on other cases of original and noisy big data problems.
Towards the Selection of Distance Metrics for k-NN Classifier in Students' Performance Prediction Modeling
2023, 2023 IEEE International Conference on Computing, ICOCO 2023
Efficient exact k-flexible aggregate nearest neighbor search in road networks using the M-tree
2022, Journal of Supercomputing
Unconventional application of k-means for distributed approximate similarity search
2022, arXiv

View all citing articles on Scopus

^☆: This paper is an extended version of previous papers by Cech et al. (2017, 2016).

View full text

Pivot-based approximate k-NN similarity joins for big high-dimensional data☆