Elsevier

Information Systems

Volume 87, January 2020, 101410
Information Systems

Pivot-based approximate k-NN similarity joins for big high-dimensional data

https://doi.org/10.1016/j.is.2019.06.006Get rights and content

Highlights

  • Study of approximate k-NN similarity joins for big high-dimensional data.

  • Pivot-based k-NN join methods supporting various levels of approximation guarantee.

  • Implementation and algorithm extensions with publicly available source code.

  • Comprehensive experiments using high-dimensional data and popular Big Data systems.

Abstract

Given an appropriate similarity model, the k-nearest neighbor similarity join represents a useful yet costly operator for data mining, data analysis and data exploration applications. The time to evaluate the operator depends on the size of datasets, data distribution and the dimensionality of data representations. For vast volumes of high-dimensional data, only distributed and approximate approaches make the joins practically feasible. In this paper, we investigate and evaluate the performance of multiple MapReduce-based approximate k-NN similarity join approaches on two leading Big Data systems Apache Hadoop and Spark. Focusing on the metric space approach relying on reference dataset objects (pivots), this paper investigates distributed similarity join techniques with and without approximation guarantees and also proposes high-dimensional extensions to previously proposed algorithms. The paper describes the design guidelines, algorithmic details, and key theoretical underpinnings of the compared approaches and also presents the empirical performance evaluation, approximation precision, and scalability properties of the implemented algorithms. Moreover, the Spark source code of all these algorithms has been made publicly available. Key findings of the experimental analysis are that randomly initialized pivot-based methods perform well with big high-dimensional data and that, in general, the selection of the best algorithm depends on the desired levels of approximation guarantee, precision and execution time.

Introduction

The k-nearest neighbor (k-NN) similarity join is an asymmetric operation that returns the k most similar objects in a dataset S for each query object in a dataset R. In recent years, the study of k-NN joins attracted a considerable amount of attention due to their applicability in various domains. In the data mining and machine learning context, k-NN joins can be employed as a preprocessing step for classification or cluster analysis. In data exploration and information retrieval, similarity joins provide a similarity graph with potentially relevant entities for each object in the database. k-NN similarity join applications can be found, for example, in image and video retrieval [1], [2], [3], [4], spatial databases [5], pattern recognition [6], and network communication analysis and malware detection frameworks [7], [8].

Because data volumes are often too large to be processed on a single machine (especially for high-dimensional data), we focus on the distributed MapReduce environment [9] running on Hadoop1 and Spark2 . MapReduce is a widely adopted framework and is considered an efficient and scalable solution for distributed big data processing. MapReduce programs are designed to run on large clusters of commodity hardware and employ a programming paradigm similar to the divide and conquer approach. Datasets are loaded, split and pre-processed in the map phase and the main execution and evaluation of an algorithm are performed in parallel on smaller data fractions in the reduce phase.

In this paper, we study approximate k-NN similarity join algorithms that can provide significant speedup compared to the exact similarity join while still preserving high results precision. In many domains, the difference between exact and slightly different k nearest results is acceptable. This is particularly the case in scenarios where computing the exact similarity joins over big high-dimensional data would take significantly large execution times.

Our study focuses on similarity joins for MapReduce environments based on the metric space approach [10]. This approach provides a universal framework for the efficient processing of various similarity models. For evaluations on vector data, we also revisited and extended two previously proposed k-NN similarity join approaches designed for vector spaces. In this paper, we focus on algorithms employing data organizations and replication strategies initialized randomly as these techniques can be conveniently applied to Big Data in different domains. Although a study tackling related similarity joins has been previously published for Hadoop [11], the study focused on low dimensional data. The subsequent journal paper [12] tested data up to 386 dimensions and highlighted limitations for most k-NN join methods on such a high-dimensional dataset. The need for effective and efficient k-NN similarity joins for high-dimensional data led us to (1) design distributed similarity join techniques with thresholds or approximation guarantees, (2) revise available MapReduce algorithms integrating extensions to more efficiently handle high-dimensional data, (3) consider the implementation of such algorithms on a different platform — Spark (in addition to Hadoop), and (4) experimentally evaluate and compare the performance of the different approaches.

This paper extends a short conference paper that presented a comparison of our heuristic method with two previously proposed approaches on Hadoop [13] and follows the paper proposing the pivot-based heuristic k-NN join method [7]. This paper significantly extends the previous papers by introducing a new MapReduce based method that supports an ϵ-guaranteed approximation, i.e., an approximate version of the k-NN join where the distance from each query point to its farthest neighbor is constrained in terms of a parameter (ϵ) and the distance to the farthest neighbor in the exact solution. Furthermore, this paper includes the implementation guidelines for Spark and a thorough, and mostly new, set of experimental results.

The overall contribution of our work can be summarized into four points:

  • Extensions of previously proposed k-NN similarity join algorithms on MapReduce to process big high-dimensional data more efficiently.

  • The introduction of pivot-based k-NN similarity join heuristic approaches on MapReduce that support approximation-related thresholds and guarantees. We analyze an approach that provides the ϵ-guarantee (which constrains the distance from each query point to its furthest neighbor returned in the k-NN join). We include a discussion of the theoretical foundations that support the proposed methods.

  • The Spark and Hadoop implementation guidelines of the proposed MapReduce join methods. We point out the limitations of different platforms and show why Spark provides faster execution times. We also provide the source code of the Spark implementation of all the evaluated methods, including our new implementations of baseline related approaches based on space filling curves (Z-curve) and locality sensitive hashing.

  • Thorough and extensive performance evaluation on large data with different dimensionality (from 10 to 1000 dimensions) running on fully distributed Amazon clusters, with most experiments evaluated on the Spark platform processing up to tens of millions of objects. This analysis provides guidance for selecting an appropriate algorithm for distributed k-NN join based on workload and approximation precision requirements.

The remaining part of the paper is structured in the following way. In Section 2, basic formal definitions and common terms are presented. An overview of similarity joins problems, two related methods, and several proposed extensions of these methods are covered in Section 3. Section 4 presents several exact and approximate pivot-based k-NN similarity join algorithms on MapReduce and provides their implementation guidelines. In Section 5, the performance evaluation of all the implemented algorithms is presented and the results are discussed. Section 6 concludes the paper.

Section snippets

Preliminaries

The fundamental concepts and basic definitions related to approximate k-NN similarity joins are summarized in the following subsections, considering the standard notations [10], [12].

Related work on similarity joins

Many types of different similarity joins have been defined and studied over recent years. Specifically, previous work in this area studied k-Distance joins [15] (returns the smallest k pairs between two datasets), range query joins [16], [17] (returns all the pairs with a distance equal to or smaller than a given threshold) and k-NN similarity joins (for each record of the first dataset, it returns the k closest records in the second dataset) [18], [19]. Some join techniques focus just on

Pivot-based k-NN similarity joins on MapReduce

Pivot-based methods represent a useful generic approach with the convenient random initialization, which nevertheless reflects data distribution by dividing a metric space into partitions centered around global objects (pivots) selected from the dataset. The benefits of pivot-based methods have been investigated for k-NN similarity joins on MapReduce in the work of Lu et al. [19]. The authors describe how mappers cluster objects into groups and reducers perform the k-NN join on each group of

Experimental evaluation

In this section, the presented MapReduce k-NN similarity join algorithms are experimentally evaluated and compared. The experiments focus on scalability, precision and the overall execution time of all solutions for high-dimensional data. First, we describe the test datasets and the evaluation platform, then we compare selected methods on two MapReduce frameworks, where we present the benefits of Spark. For Spark, we investigate parameters for all the presented methods and, finally, we compare

Conclusions

In this paper, we focused on approximate k-NN similarity joins in the MapReduce environment implemented mainly in Spark. We studied approximation quality and guarantees for pivot-based methods from the theoretical and experimental perspectives and presented two different pivot-based approximate k-NN similarity join algorithms. We also compared the methods with other heuristic algorithms (based on Z-curves and LSH) reimplemented in Spark for high-dimensional data. According to our findings, data

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This project was supported by the Charles University in Prague grant GAUK 201515, the Czech Science Foundation (GAČR) project Nr. 17-22224S and partially by Charles University grant SVV-260451.

References (55)

  • MujaM. et al.

    Scalable nearest neighbor algorithms for high dimensional data

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • CechP. et al.

    Feature extraction and malware detection on large HTTPS data using mapreduce

  • LokocJ. et al.

    K-nn classification of malware in HTTPS traffic using the metric space approach

  • DeanJ. et al.

    Mapreduce: Simplified data processing on large clusters

    Commun. ACM

    (2008)
  • ZezulaP. et al.

    Similarity Search - The Metric Space Approach

    Advances in Database Systems

    (2006)
  • SongG. et al.

    Solutions for processing k nearest neighbor joins for massive data on mapreduce

  • SongG. et al.

    K nearest neighbour joins for big data on mapreduce: A theoretical and experimental analysis

    IEEE Trans. Knowl. Data Eng.

    (2016)
  • CechP. et al.

    Comparing mapreduce-based k-nn similarity joins on hadoop for high-dimensional data

  • PatellaM. et al.

    The many facets of approximate similarity search

  • HjaltasonG.R. et al.

    Incremental distance join algorithms for spatial databases

  • SilvaY.N. et al.

    Exploiting mapreduce-based similarity joins

  • MaY. et al.

    Parallel similarity joins on massive high-dimensional data using mapreduce

    Concurr. Comput.: Pract. Exper.

    (2016)
  • BöhmC. et al.

    The k-nearest neighbour join: Turbo charging the KDD process

    Knowl. Inf. Syst.

    (2004)
  • LuW. et al.

    Efficient processing of k nearest neighbor joins using mapreduce

    Proc. VLDB Endow.

    (2012)
  • VernicaR. et al.

    Efficient parallel set-similarity joins using mapreduce

  • RongC. et al.

    Fast and scalable distributed set similarity joins for big data analytics

  • XiaoC. et al.

    Ed-join: an efficient algorithm for similarity joins with edit distance constraints

    PVLDB

    (2008)
  • Cited by (13)

    • A novel extreme learning machine based kNN classification method for dealing with big data

      2021, Expert Systems with Applications
      Citation Excerpt :

      This method uses a sampling technique for capturing data distribution more quickly. Totally, the performance of these methods would be weakened when the size of dimensionality increases (Čech, Lokoč, & Silva, 2020). Due to some disadvantages of Hadoop like saving intermediate results of computations on the Hard Disk Drive that reduces the processing speed, some studies discussed the use of Spark for distributing the kNN computations (Chen, Shen, Feng, & Le, 2015; Jesus Maillo et al., 2017; Ramírez-Gallego et al., 2017).

    View all citing articles on Scopus

    This paper is an extended version of previous papers by Cech et al. (2017, 2016).

    View full text