GPU-based fast clustering via K-Centres and k-NN mode seeking for geospatial industry applications

doi:10.1016/j.compind.2020.103260

Computers in Industry

Volume 122, November 2020, 103260

https://doi.org/10.1016/j.compind.2020.103260 Get rights and content

Abstract

The emerging trends in data industry, particularly those related to the repeated processing of data streams, are pushing the limits of computer systems and processes. Among them, the near real-time clustering of geospatial location data is paradigmatic due to its scale, requirements and potential industrial applications. As a solution to deal with the large and continuous arrival of geospatial data, modern many-core (GPU-based) computer architectures are used for implementing fast and efficient clustering algorithms. This paper proposes GPU implementations of two different clustering algorithms — $K$ -Centres and $k$ -NN mode seeking — and compare them against their corresponding sequential implementations. Publicly available geospatial datasets have been used to exemplify the achieved performances using GPUs. Our main contribution is providing GPU implementations of the clustering algorithms that are feasible for near real-time problems. Results show speed-ups of up to $19$ and $135$ times, with the largest dataset, for $K$ -Centres and $k$ -NN mode seeking respectively. Important technical details of the sorting algorithms, required by the GPU implementation of $k$ -NN mode seeking, are also highlighted.

Introduction

Data science — the study and application of methods aimed at extracting valuable knowledge from large datasets — has rapidly surpassed its original research environment and permeated nearly all aspects of daily-life and economy. As a result, a new concept defined as Data Industry (Tang, 2016) has emerged. This novel industry faces numerous technological challenges, such as how to cope with the pertain continuous data streams (Nguyen et al., 2015) and the resulting overwhelming volume of data that pushes the boundaries of computer systems and techniques.

An intensive field within the data industry is the analysis of massive amounts of data generated by Global Position Systems (GPS) that are currently incorporated in omnipresent consumer devices such as smartphones and tablets, allowing, therefore, the ubiquitous production of valuable geospatial information.

Among the applications that may profit from the analysis of geospatial data, the following can be highlighted: location-based business intelligence, transportation planning, models for emergency and disaster management, optimization of health care resources (Barik et al., 2019), studies for environmental policies, logistics for supply chains, and the efficient delivery of goods in commerce.

Due to the always-growing nature of data streams, like the geospatial ones, powerful data analytic tools (Flath and Stein, 2018) are required, which must necessarily rely on fast and efficient data analysis algorithms. Data clustering (Wagstaff, 2012) — the task of associating data into meaningful groups — is among the most widely applied and often computationally-demanding of such tools that, recently, have been increasingly used for the analysis of geospatial locations; see, (Zhao et al., 2015, Boeing, 2018). The computational cost of clustering depends not only on the size of the incoming datasets and the complexity of the algorithms themselves, but also on the repetitions that are typically required to estimate the appropriate number of clusters. Moreover, frequent updates of the results are needed when the datasets’ dynamics change rapidly over time, e.g. the current locations of moving objects on the streets.

Clustering is typically motivated by one or several of the following needs: detecting outliers, selecting prototypes, summarizing data, and discovering hidden homogeneous structures. A plethora of different clustering methods has been proposed, including classical and state-of-the-art ones that, according to their nature, are usually categorized into either partitional or agglomerative hierarchical ones. Yet, another dichotomy consists of distinguishing between algorithms that model each cluster by selecting an object¹ that is a member of the original dataset, and the ones that model clusters by using an average of its objects. The former methods are typically applied when valid representative examples (also known as prototypes) of each cluster are required; for example when the interest is finding an originally existing geospatial location to represent each cluster instead of ending up with an interpolating location that might be physically unreachable.

Among the classical clustering methods, a paradigmatic and still widely employed one is the well-known $K$ -means algorithm (Jain, 2010). Several strategies to accelerate the execution of $K$ -means have been tried, ranging from using graphics processing units (GPUs) (Li et al., 2013, Cuomo et al., 2019, Kohlhoff et al., 2011, Kohlhoff et al., 2013), multicore architecture, and hybrid architectures like Xeon Phi coprocessors with Many Integrated Core and NVIDIA Kepler K20 (Jaros et al., 2017).

$K$ -means represents each cluster using a mean vector that results from averaging all the feature vectors of the cluster objects. A variant of $K$ -means that, in contrast to it, models each cluster by selecting an object from the original dataset, is the so-called $K$ -Medoids algorithm (Hastie et al., 2009, p. 515). As indicated by its name, this algorithm models each cluster by the medoid, that is, by the object whose summation of dissimilarities to other objects in the cluster is minimal. $K$ -Medoids has also been accelerated in GPU; see (Kohlhoff et al., 2011, Wang et al., 2013); other accelerations of $K$ -Medoids, like the one in (Zhang et al., 2018), use P system. Moreover, in (Song et al., 2017), the authors use the Hadoop platform and a CPU with 48 cores to improve its speed-up.

Another algorithm that models each cluster with a valid object from the dataset is called $K$ -Centres (Pěkalska et al., 2006a, Duin et al., 2007). Contrary to $K$ -Medoids, $K$ -Centres represents each cluster by choosing the object whose maximum dissimilarity to other objects in the cluster is the smallest one, i.e. the centre of the cluster. To the best of our knowledge and in contrast to $K$ -means and $K$ -Medoids, $K$ -Centres has not been accelerated for multi-core or many-core architectures. However, we found a closely related algorithm called $K$ -centers (Gonzalez, 1985, Dasgupta and Long, 2005), that maximizes the distances to the furthest objects associated with the centers. In spite of the similarity of the algorithm names, the approach used for selecting the objects that represent each cluster in $K$ -Centres is different from the procedure used in $K$ -centers. The $K$ -centers algorithm has been accelerated for GPU in (Kohlhoff et al., 2011, Yutong et al., 2013).

Another popular procedure, originally based on the estimation of the probability density function (PDF) by a mixture of Gaussians, is clustering by mode seeking. As suggested by its name, mode seeking is aimed at modeling each cluster by a mode of the PDF, i.e. by one of its local maxima. There are different alternatives for estimating the PDF and finding its modes, particularly by (i) using the Parzen kernel to estimate the PDF and, afterward, following the gradient to associate the objects to the modes (this procedure is also known as mean shift) (Cheng, 1995a) or by (ii) applying the $k$ -nearest neighbor ( $k$ -NN) rule such that both, the estimation of the densities and the association of the objects to the modes, is entirely based on pairwise dissimilarities (Pěkalska et al., 2006a, Duin et al., 2012). The authors of (Duin et al., 2012) showed that $k$ -NN mode seeking is less costly than mean shift because the densities are computed as the inverse of the distances to the $k$ th nearest neighbor, instead of by costly operations with kernels and gradients as done in mean shift. There are even other improvements for $k$ -NN mode seeking; for example to make it feasible for large scale problems (Duin et al., 2012, Duin and Verzakov, 2017) or by introducing structural changes in the algorithm itself (Myhre et al., 2018). However, we restrict ourselves to the version implemented in (Duin et al., 2007).

We present GPU-based parallel versions of $K$ -Centres and $k$ -NN mode seeking and show their convenience for being used in geospatial data analysis. GPUs have been chosen as platforms for the parallel algorithms due to their massive capacity of data processing, their increasing availability in modern general-purpose computers and, more importantly, because there has been an increasing interest in using them in industrial applications (Lopez et al., 2015). For the sake of research reproducibility as well as for exemplification under diverse scales, experiments are done with publicly available geospatial datasets of different sizes and compared against CPU sequential implementations.

The remaining of the paper is organized as follows. Notational conventions are given below. The sequential algorithms of $K$ -Centres and $k$ -NN mode seeking are presented in Section 2, along with a brief essential description of the GPU architecture. The proposed GPU algorithms are described in Section 3. Experimental results with public geospatial datasets are discussed in Section 4. Finally, our concluding remarks and suggestions for future work are given in Section 5.

Section snippets

Background

As a reference of the sequential implementations of both $K$ -Centres and $k$ -NN mode seeking, we considered the Matlab codes from (Duin et al., 2007), including the novel version of the second algorithm that was more recently proposed by the same first author in (Duin et al., 2012) and that is also described in (Duin and Verzakov, 2017). However, for a fair comparison against our GPU-based algorithms, we developed our own faster sequential versions in ANSI C.

GPU algorithms implementation

Both clustering algorithms require the computation of a square $D_d$ . Even though some particular dissimilarity measures might be appropriate for geospatial data (e.g. the geodesic distance (Schneider and Eberly, 2003)), here and without loss of generality, we restrict ourselves the Euclidean distance. $k$ -NN mode seeking, also, requires the repeated sorting of the distances to find the neighborhood to estimate local densities.

Experimental setup

The experiments were carried out on a Dell PowerEdge-T630 with a GPU Tesla K40c. Table 3 shows the main characteristics of the GPU, CPU, and the software versions of the machine used to run the implementations.

Conclusions

In this paper, strategies implemented to accelerate the $K$ -Centres and $k$ -NN mode seeking clustering algorithms on many-core architectures were presented. The new computational proposals on GPU guarantee higher accelerations over the sequential versions, given an amount of objects to be labeled that fit completely in the GPU memory. The results obtained demonstrate that the GPU implementation of the algorithms are better suited for handling large amounts of data. Typically, larger datasets result

Authors’ contribution

Ana-Lorena Uribe-Hurtado: conceptualization, methodology, software developer – original draft preparation. Mauricio Orozco-Alzate: supervisor, reviewing, introduction writer and editing. Bernardete Ribeiro: reviewing, resources support. Noel Lopes: software development support, reviewing, validation.

Conflict of interest

None declared.

Acknowledgements

The first author acknowledges funding provided by Universidad Nacional de Colombiathrough “Convocatoria para la Movilidad Internacional de la Universidad Nacional de Colombia 2017-2018”. Center of Informatics and Systems of the University of Coimbra (CISUC) is also acknowledged.

References (45)

S. Cuomo et al.
gpu-accelerated parallel k-means algorithm
Comput. Electr. Eng.
(2019)
S. Dasgupta et al.
Performance guarantees for hierarchical clustering
J. Comput. Syst. Sci.
(2005)
C.M. Flath et al.
Towards a data science toolbox for industrial analytics applications
Comput. Ind.
(2018)
T.F. Gonzalez
Clustering to minimize the maximum intercluster distance
Theoret. Comput. Sci.
(1985)
M. Gowanlock et al.
A hybrid CPU/GPU approach for optimizing sorting throughput
Parallel Comput.
(2019)
A.K. Jain
Data clustering: 50 years beyond K-means
Pattern Recogn. Lett.
(2010)
M. Jaros et al.
Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: application in medical imaging
Adv. Eng. Softw.
(2017)
Y. Li et al.
Speeding up k-Means algorithm by GPUs
J. Comput. Syst. Sci.
(2013)
F. Lopez et al.
Particle filtering on GPU architectures for manufacturing applications
Comput. Ind.
(2015)
E. Pěkalska et al.
Prototype selection for dissimilarity-based classifiers
Pattern Recogn.
(2006)

E. Pěkalska et al.

Prototype selection for dissimilarity-based classifiers

Pattern Recogn.

(2006)

P.J. Schneider et al.

J. Parallel Distrib. Comput.

(2008)

Q. Zhao et al.

A grid-growing clustering algorithm for geo-spatial data

Pattern Recogn. Lett.

(2015)

R.K. Barik et al.

GeoFog4Health: a fog-based SDI framework for geospatial health big data analysis

J. Amb. Intell. Human. Comput.

(2019)

G. Boeing

Spatial Cluster Analysis With Python

(2016)

G. Boeing

Clustering to reduce spatial data set size

SocArXiv

(2018)

A. Cano

A survey on graphic processing unit computing for large-scale data mining

Wiley Interdisc. Rev.: Data Mining Knowl. Discov.

(2018)

T.C. Chen et al.

Simplified odd-even sort using multiple shift-register loops

Int. J. Comput. Inform. Sci.

(1978)

S. Chen et al.

An efficient sorting algorithm with CUDA

J. Chin. Inst. Engrs. Trans. Chin. Inst. Engrs. Ser. A/Chung-Kuo Kung Ch’eng Hsuch K’an

(2009)

J. Cheng et al.

CUDA programming model

Professional CUDA C Programming

(2014)

Y. Cheng

Mean shift, mode seeking, and clustering

IEEE Trans. Pattern Anal. Mach. Intell.

(1995)

Cited by (4)

Efficient Hardware Accelerators for k-Nearest Neighbors Classification using Most Significant Digit First Arithmetic
2023, Research Square
Assessing the impact of GeoAI in the world of spatial data and energy revolution
2023, Risk Detection and Cyber Security for the Success of Contemporary Computing
Building Actionable Personas Using Machine Learning Techniques
2022, Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022
KNN-MSDF: A Hardware Accelerator for k-Nearest Neighbors Using Most Significant Digit First Computation
2022, International System on Chip Conference

View full text

GPU-based fast clustering via K-Centres and k-NN mode seeking for geospatial industry applications

Abstract

Introduction

Section snippets

Background

GPU algorithms implementation

Experimental setup

Conclusions

Authors’ contribution

Conflict of interest

Acknowledgements

Comput. Electr. Eng.

J. Comput. Syst. Sci.

Comput. Ind.

Theoret. Comput. Sci.

Parallel Comput.

Pattern Recogn. Lett.

Adv. Eng. Softw.

J. Comput. Syst. Sci.

Comput. Ind.

Pattern Recogn.

Pattern Recogn.

J. Parallel Distrib. Comput.

Pattern Recogn. Lett.

GeoFog4Health: a fog-based SDI framework for geospatial health big data analysis

J. Amb. Intell. Human. Comput.

Spatial Cluster Analysis With Python

Clustering to reduce spatial data set size

SocArXiv

A survey on graphic processing unit computing for large-scale data mining

Wiley Interdisc. Rev.: Data Mining Knowl. Discov.

Simplified odd-even sort using multiple shift-register loops

Int. J. Comput. Inform. Sci.

An efficient sorting algorithm with CUDA

J. Chin. Inst. Engrs. Trans. Chin. Inst. Engrs. Ser. A/Chung-Kuo Kung Ch’eng Hsuch K’an

CUDA programming model

Professional CUDA C Programming

Mean shift, mode seeking, and clustering

IEEE Trans. Pattern Anal. Mach. Intell.

GPU-based fast clustering via $K$ -Centres and $k$ -NN mode seeking for geospatial industry applications