Solving multiple kth smallest dissimilarity queries for non-metric dissimilarities with the GPU

doi:10.1016/j.ins.2016.03.054

Information Sciences

Volumes 361–362, 20 September 2016, Pages 66-83

https://doi.org/10.1016/j.ins.2016.03.054 Get rights and content

Abstract

The kth smallest dissimilarity of a query point with respect to a given set is the dissimilarity that ranks number k when we sort, in increasing order, the dissimilarity value of the points in the set with respect to the query point. A multiple kth smallest dissimilarity query determines the kth smallest dissimilarity for several query points simultaneously. Although the problem of solving multiple kth smallest dissimilarity queries is an important primitive operation used in many areas, such as spatial data analysis, facility location, text classification and content-based image retrieval, it has not been previously addressed explicitly in the literature. In this paper we present three parallel strategies, to be run on a Graphics Processing Unit, for computing multiple kth smallest dissimilarity queries when non-metric dissimilarities, that do not satisfy the triangular inequality, are used. The strategies are theoretically and experimentally analyzed and compared among them and with an efficient sequential strategy to solve the problem.

Introduction

A dissimilarity (similarity) is a function that assigns to every pair of objects a quantity that measures how different (alike) the objects are. Dissimilarity and similarity are dual concepts: a small dissimilarity and a large similarity both imply a close resemblance of objects [9], [27], [31], [37], [39], [40]. In order to facilitate the readability of the paper, we concentrate on dissimilarities. Dissimilarities that satisfy the properties of a metric, non-negativeness, reflexivity, symmetry and triangle inequality, are also called distances and are widely used. However, metric dissimilarities may not be suitable in some applications, thus, recently many non-metric dissimilarity functions have appeared in several specific domains. Different natural notions of dissimilarity, that are not constrained to fulfill specific properties, are being used because they are able to model more complex dissimilarities in a better way. In [37], Skopal and Bustos survey several types of non-metric dissimilarities and in [9] a comparative study of the effect of the application of some different dissimilarity functions is presented. Among the most used non-metric dissimilarities we find the fractional L_p dissimilarity, multiplicative weighted Euclidean distance, cosine dissimilarity, Kullback–Leibler divergence, Jeffrey-divergence, χ² function and Itakura–Saito divergence.

Given a set S, a point q and a dissimilarity function, the kth smallest dissimilarity of q with respect to S is the dissimilarity that ranks number k when the dissimilarities, between q and the points of S, have been sorted in increasing order. In many applications it is necessary to process a large number of kth smallest dissimilarity queries, one for each point of another set Q, against a set of points S. A multiple kth smallest dissimilarity query determines the kth smallest dissimilarity for several query points simultaneously. The problem of solving multiple kth smallest dissimilarity queries, considering metric and non-metric dissimilarity measures, is a fundamental task in different fields, such as spatial data analysis, facility location, text mining and content-based image retrieval, among others. In particular, determining multiple kth smallest dissimilarities is necessary for:

•
Solving k-influential region problems [14], k-influence region problems [16] and common k-influence region problems [15], [18] in the facility location field.
•
Creating sorted k-similarity-dissimilarity plots [11], [34] in clustering and classification algorithms for spatial data analysis, text mining and content-based image retrieval.
•
Answering multiple bichromatic mutual nearest neighbor queries [17] in spatial data analysis and in clustering and classification operations for spatial and text data.

Despite finding multiple kth smallest dissimilarities is unavoidable in all these works, none of them provide any in depth analysis of how the problem can be solved efficiently.

Solving multiple kth smallest dissimilarity queries for non-metric dissimilarities is a challenging problem due to its intrinsic complexity, mainly, because non-metric dissimilarities do not have properties that facilitate solving the problem. Several index structures with a regular space partition can be used to filter out irrelevant points during the process whenever metric dissimilarities are considered, this avoids a sequential scan algorithm. However, the properties that make these structures meaningful heavily rely on the triangle inequality and it does not always satisfy for non-metric dissimilarity functions. Consequently, the indexing approaches used for solving multiple kth smallest dissimilarity queries cannot be applied for dissimilarities that do not satisfy the triangle inequality. Moreover, in high-dimensional spaces, even when the dissimilarity is a metric, indexing approaches have a threshold beyond which the phenomena of the empty space and measure concentration exhibits. This is commonly known as the curse of dimensionality. Experiments show that as the dimensionality increases the approaches degrade to the sequential scan [39]. The simplest implementation, the brute force approach, for solving a multiple kth smallest dissimilarity query is the sequential scan over the entire set for each query point. Thus, the query point is compared with every point in the set resulting in a dissimilarity sorting which is used for the query evaluation. Brute force scans can be parallelized by computing the kth smallest dissimilarity of each query point independently.

The Graphics Processing Units (GPUs) due to the its programmability and high computational rates are, nowadays, a compelling platform to handle problems that can be parallelized. In fact, they are appropriate for handling computationally demanding tasks where a large amount of data need to be processed whenever it can be processed in parallel. The parallel processing capability of the GPU allows to divide complex computing tasks into thousands of smaller tasks that can be run concurrently. This ability is enabling researchers to address many challenging computational problems faster than usual CPUs. GPUs have quickly become an industry standard that power millions of desktops, notebooks, workstations and supercomputers around the world. Nowadays, one can take advantage of using a GPU without much economical effort, because, even though the high-end GPUs are expensive, there also exist many affordable and efficient GPUs. General-Purpose computing on the GPU (GPGPU), to drastically decrease execution times, is capturing the attention of researchers in many computational fields which range from numeric computing operations and physical simulations to text clustering or facility location problems [5], [13], [16], [17], [18], [38], [46].

The problem of efficiently solving multiple kth smallest dissimilarity queries for metric and non-metric dissimilarities has not been previously addressed explicitly. However, it is closely related to the k-nearest neighbor query which has been widely studied. Given a set S, a point q and a dissimilarity function, a k-nearest neighbor query finds the points of S whose dissimilarity with respect to q is not greater than the kth smallest dissimilarity of q. The task of determining the k-nearest neighbors for several query points defining Q is known as a multiple k-nearest neighbor query, k-nearest neighbor join or all k-nearest neighbor query.

Most of the existing CPU algorithms for efficiently searching k-nearest neighbors with metric dissimilarities use index structures [1]. Generally, these algorithms involve building space partitions with tree-type structures and are based on either depth-first [7], [32], [33] or best-first [22] traversal paradigms. Skopal and Bustos [37] discuss several types of non-metric access methods to solve k-nearest neighbor queries. Several scan-based and index-based methods for efficiently solving all k-nearest neighbor queries have been proposed. In a scan-based solution, neither the data set nor the query set are organized in an index structure and the complete sets have to be scanned. Index-based algorithms assume that at least the data set is organized in an index structure. Index-based algorithms for computing all k-nearest neighbor are provided in [2], [8], [12], [35], [42], [43], [44]. Algorithms for solving all k-nearest neighbor queries without using index structures are found in [40], [45].

Basic brute force k-nearest neighbor search on the GPU is much faster than on the CPU and even compares favorably to CPU implementations that use index structures [6]. There exist many works which use GPUs to accelerate the brute force k-nearest neighbor search. Meanwhile some of them, dealing with metric dissimilarities, use index structures [3], [4], [6], [25]. Some others, considering both metric or non-metric dissimilarities, either use sorting algorithms with some modification or customization [21], [23], [24], [26], or only maintain the k-nearest neighbors during the search [28], [36]. In [28] the all k-nearest neighbors for metric and non-metric dissimilarities with the GPU is approximately solved by partitioning an initial feature data set into several clusters.

Several reasons motivated us to design an efficient approach for solving multiple kth smallest dissimilarity queries with the GPU: the increasing use, in many fields, of multiple kth smallest dissimilarity queries for non-metric dissimilarities that do not satisfy the triangular inequality; the limitation of not being possible to use indexed structures to solve these queries; the curse of dimensionality problem; and the degree of parallelization of the brute force algorithm for solving the problem.

Thus, in this work, we tackle the problem of solving many kth-smallest dissimilarity queries in a d-dimensional space with respect to a set of points considering non-metric dissimilarities that should be computed in O(d) time. Three different GPU-parallel strategies to exactly solve multiple kth smallest dissimilarity queries when non-metric dissimilarities are presented. ‘Strategy 1’ uses dissimilarity matrices which have been widely used in the literature to solve k-nearest neighbor queries. ‘Strategy 2’ and ‘Strategy 3’ obtain, without using dissimilarity matrices, the kth smallest dissimilarity value meanwhile the dissimilarity values are computed. These last two strategies differ in the kind of memory used to obtain the kth smallest dissimilarity value. The presented strategies are theoretically and experimentally analyzed and compared among them and with the most efficient CPU sequential way to solve the problem. Two different non-metric dissimilarities are considered in the experimental results section: (i) the multiplicative weighted Euclidean distance, which is used in the facility location field; (ii) the cosine dissimilarity, which is used to cluster text documents. Experimental results show that ‘Strategy 2’ is the best with the multiplicative weighted distance, meanwhile ‘Strategy 1’ performs better with the cosine dissimilarity.

The remainder of this paper is organized as follows. In Section 2, the main definitions and terminology used in the paper is presented. From Sections 3.1–3.4, we present the three parallel strategies to solve the multiple k-smallest dissimilarity queries. They are theoretically compared among them and with a sequential algorithm in Section 4 and experimentally in Section 5. Finally, in Section 6 conclusions and further comments are provided.

Section snippets

Background

In this section we define the notion of dissimilarity, present several non-metric dissimilarities and define formally the multiple kth smallest dissimilarity queries.

CUDA based algorithms

In this section we present three strategies to solve the multiple kth nearest neighbor dissimilarity queries problem which is the focus of this paper. The presented algorithms run in parallel and are specially designed to work with CUDA.

Theoretical comparison

In this section, we present the theoretical comparison among the different parallel strategies that we have provided. We also compare them with a sequential algorithm to analyze their parallel speedup.

Experimental results

In this section, we present the experimental results obtained from the implementation of our algorithms using two different dissimilarity measures applied to elements of spaces of very different dimensionality. We use the multiplicative weighted Euclidean distance in spaces of small dimension (from 2 to 10) and the cosine dissimilarity in spaces of high dimension (the used vectors have an in average length of 935.4), more details are given in Section 5.2.2. See Table 1 for the definitions of

Conclusions and further comments

We have proposed and analyzed, theoretically and experimentally, three GPU-parallel exact strategies specially designed to solve the multiple kth-smallest dissimilarity problem with measures that do not satisfy the triangular inequality. They have been compared among them and with an efficient sequential CPU algorithm to solve the problem. We have provided several experimental results considering the multiplicative weighted Euclidean distance, in small dimensional spaces, and the cosine

Acknowledgments

We thank the reviewers for their comments which helped us to improve the experimental results section a lot. Work partially supported by the Spanish Ministerio de Economía y Competitividad under Grant TIN2014-52211-C2-2-R. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.

References (46)

R. Cappelli et al.
Large-scale fingerprint identification on GPU
Inf. Sci.
(2015)
T. Emrich et al.
Optimizing all-nearest-neighbor queries with trigonometric pruning
Proceedings of the Twenty-second International Conference on Scientific and Statistical Database Management (SSDBM)
(2010)
M. Fort et al.
Finding influential location regions based on reverse k-neighbor queries
Knowl. Based Syst.
(2013)
M. Fort et al.
Solving the k-influence region problem with the GPU
Inf. Sci.
(2014)
M. Fort et al.
Common influence region problems
Inf. Sci.
(2015)
E. Gabrilovich. TechTC – Technion Repository of Text Categorization Datasets (2011) http://techtc.cs.technion.ac.il/...
NVIDIA Cuda Zone, GPU Accelerated Libraries, Thrust, (2016) https://developer.nvidia.com/thrust (Accessed...
T. Skopal et al.
On nonmetric similarity search problems in complex domains
ACM Comput. Surv.
(2011)
YaoB. et al.
K-nearest neighbor queries and KNN-joins in large relational databases (almost) for free
Proceedings of the Twenty-sixth International Conference on Data Engineering (ICDE)
(2010)
YuC. et al.
Efficient index-based KNN join processing for high-dimensional data
Inf. Softw. Technol.
(2007)

ZhangJ. et al.

All-nearest-neighbors queries in spatial databases

Proceedings of the Sixteenth International Conference on Scientific and Statistical Database Management (SSDBM)

(2004)

C. Böhm et al.

Searching in high dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Comput. Surv.

(2001)

C. Böhm et al.

The k-nearest neighbor join: Turbo charging the KDD process

KAIS

(2004)

R.J. Barrientos et al.

KNN query processing in metric spaces using GPUs

Proceedings of the Seventeenth International European Conference on Parallel and Distributed Computing (Euro-Par’11)

(2011)

S. Brown et al.

Gpu nearest neighbors using a minimal kd-tree

Proceedings of the Second Workshop on Massive Data Algorithmics (MASSIVE)

(2010)

L. Cayton

A nearest neighbor data structure for graphics hardware

Proceedings of the First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (VLDB-ADMS)

(2010)

CheungK.L. et al.

Enhanced nearest neighbour search on the r-tree

SIGMOD

(1998)

ChenY. et al.

Efficient evaluation of all-nearest-neighbor queries

Proceedings of the Twenty-third International Conference on Data Engineering (ICDE)

(2007)

F. Chiclanam et al.

A statistical comparative study of different similarity measures of consensus in group decision making

Inf. Sci.

(2013)

D. Davidov et al.

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the Twenty-seventh Annual International ACM SIGIR Conference

(2004)

DuanL. et al.

A local-density based spatial clustering algorithm with noise

Inf. Syst.

(2007)

U. Erra et al.

Approximate TF-IDF based on topic extraction from massive message stream using the GPU

Inf. Sci.

(2015)

M. Fort et al.

Common influence region queries

Proceedings of the Tenth International Symposium on Voronoi Diagrams in Science and Engineering (ISVD)

(2013)

Cited by (0)

View full text

Solving multiple kth smallest dissimilarity queries for non-metric dissimilarities with the GPU

Abstract

Introduction

Section snippets

Background

CUDA based algorithms

Theoretical comparison

Experimental results

Conclusions and further comments

Acknowledgments

Inf. Sci.

Knowl. Based Syst.

Inf. Sci.

Inf. Sci.

ACM Comput. Surv.

Inf. Softw. Technol.

Searching in high dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Comput. Surv.

The k-nearest neighbor join: Turbo charging the KDD process

KAIS

KNN query processing in metric spaces using GPUs

Proceedings of the Seventeenth International European Conference on Parallel and Distributed Computing (Euro-Par’11)

Gpu nearest neighbors using a minimal kd-tree

Proceedings of the Second Workshop on Massive Data Algorithmics (MASSIVE)

A nearest neighbor data structure for graphics hardware

Proceedings of the First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (VLDB-ADMS)

Enhanced nearest neighbour search on the r-tree

SIGMOD

Efficient evaluation of all-nearest-neighbor queries

Proceedings of the Twenty-third International Conference on Data Engineering (ICDE)

A statistical comparative study of different similarity measures of consensus in group decision making

Inf. Sci.

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the Twenty-seventh Annual International ACM SIGIR Conference

A local-density based spatial clustering algorithm with noise

Inf. Syst.

Approximate TF-IDF based on topic extraction from massive message stream using the GPU

Inf. Sci.

Common influence region queries

Proceedings of the Tenth International Symposium on Voronoi Diagrams in Science and Engineering (ISVD)