GPU-based fast clustering via -Centres and -NN mode seeking for geospatial industry applications
Introduction
Data science — the study and application of methods aimed at extracting valuable knowledge from large datasets — has rapidly surpassed its original research environment and permeated nearly all aspects of daily-life and economy. As a result, a new concept defined as Data Industry (Tang, 2016) has emerged. This novel industry faces numerous technological challenges, such as how to cope with the pertain continuous data streams (Nguyen et al., 2015) and the resulting overwhelming volume of data that pushes the boundaries of computer systems and techniques.
An intensive field within the data industry is the analysis of massive amounts of data generated by Global Position Systems (GPS) that are currently incorporated in omnipresent consumer devices such as smartphones and tablets, allowing, therefore, the ubiquitous production of valuable geospatial information.
Among the applications that may profit from the analysis of geospatial data, the following can be highlighted: location-based business intelligence, transportation planning, models for emergency and disaster management, optimization of health care resources (Barik et al., 2019), studies for environmental policies, logistics for supply chains, and the efficient delivery of goods in commerce.
Due to the always-growing nature of data streams, like the geospatial ones, powerful data analytic tools (Flath and Stein, 2018) are required, which must necessarily rely on fast and efficient data analysis algorithms. Data clustering (Wagstaff, 2012) — the task of associating data into meaningful groups — is among the most widely applied and often computationally-demanding of such tools that, recently, have been increasingly used for the analysis of geospatial locations; see, (Zhao et al., 2015, Boeing, 2018). The computational cost of clustering depends not only on the size of the incoming datasets and the complexity of the algorithms themselves, but also on the repetitions that are typically required to estimate the appropriate number of clusters. Moreover, frequent updates of the results are needed when the datasets’ dynamics change rapidly over time, e.g. the current locations of moving objects on the streets.
Clustering is typically motivated by one or several of the following needs: detecting outliers, selecting prototypes, summarizing data, and discovering hidden homogeneous structures. A plethora of different clustering methods has been proposed, including classical and state-of-the-art ones that, according to their nature, are usually categorized into either partitional or agglomerative hierarchical ones. Yet, another dichotomy consists of distinguishing between algorithms that model each cluster by selecting an object1 that is a member of the original dataset, and the ones that model clusters by using an average of its objects. The former methods are typically applied when valid representative examples (also known as prototypes) of each cluster are required; for example when the interest is finding an originally existing geospatial location to represent each cluster instead of ending up with an interpolating location that might be physically unreachable.
Among the classical clustering methods, a paradigmatic and still widely employed one is the well-known -means algorithm (Jain, 2010). Several strategies to accelerate the execution of -means have been tried, ranging from using graphics processing units (GPUs) (Li et al., 2013, Cuomo et al., 2019, Kohlhoff et al., 2011, Kohlhoff et al., 2013), multicore architecture, and hybrid architectures like Xeon Phi coprocessors with Many Integrated Core and NVIDIA Kepler K20 (Jaros et al., 2017).
-means represents each cluster using a mean vector that results from averaging all the feature vectors of the cluster objects. A variant of -means that, in contrast to it, models each cluster by selecting an object from the original dataset, is the so-called -Medoids algorithm (Hastie et al., 2009, p. 515). As indicated by its name, this algorithm models each cluster by the medoid, that is, by the object whose summation of dissimilarities to other objects in the cluster is minimal. -Medoids has also been accelerated in GPU; see (Kohlhoff et al., 2011, Wang et al., 2013); other accelerations of -Medoids, like the one in (Zhang et al., 2018), use P system. Moreover, in (Song et al., 2017), the authors use the Hadoop platform and a CPU with 48 cores to improve its speed-up.
Another algorithm that models each cluster with a valid object from the dataset is called -Centres (Pěkalska et al., 2006a, Duin et al., 2007). Contrary to -Medoids, -Centres represents each cluster by choosing the object whose maximum dissimilarity to other objects in the cluster is the smallest one, i.e. the centre of the cluster. To the best of our knowledge and in contrast to -means and -Medoids, -Centres has not been accelerated for multi-core or many-core architectures. However, we found a closely related algorithm called -centers (Gonzalez, 1985, Dasgupta and Long, 2005), that maximizes the distances to the furthest objects associated with the centers. In spite of the similarity of the algorithm names, the approach used for selecting the objects that represent each cluster in -Centres is different from the procedure used in -centers. The -centers algorithm has been accelerated for GPU in (Kohlhoff et al., 2011, Yutong et al., 2013).
Another popular procedure, originally based on the estimation of the probability density function (PDF) by a mixture of Gaussians, is clustering by mode seeking. As suggested by its name, mode seeking is aimed at modeling each cluster by a mode of the PDF, i.e. by one of its local maxima. There are different alternatives for estimating the PDF and finding its modes, particularly by (i) using the Parzen kernel to estimate the PDF and, afterward, following the gradient to associate the objects to the modes (this procedure is also known as mean shift) (Cheng, 1995a) or by (ii) applying the -nearest neighbor (-NN) rule such that both, the estimation of the densities and the association of the objects to the modes, is entirely based on pairwise dissimilarities (Pěkalska et al., 2006a, Duin et al., 2012). The authors of (Duin et al., 2012) showed that -NN mode seeking is less costly than mean shift because the densities are computed as the inverse of the distances to the th nearest neighbor, instead of by costly operations with kernels and gradients as done in mean shift. There are even other improvements for -NN mode seeking; for example to make it feasible for large scale problems (Duin et al., 2012, Duin and Verzakov, 2017) or by introducing structural changes in the algorithm itself (Myhre et al., 2018). However, we restrict ourselves to the version implemented in (Duin et al., 2007).
We present GPU-based parallel versions of -Centres and -NN mode seeking and show their convenience for being used in geospatial data analysis. GPUs have been chosen as platforms for the parallel algorithms due to their massive capacity of data processing, their increasing availability in modern general-purpose computers and, more importantly, because there has been an increasing interest in using them in industrial applications (Lopez et al., 2015). For the sake of research reproducibility as well as for exemplification under diverse scales, experiments are done with publicly available geospatial datasets of different sizes and compared against CPU sequential implementations.
The remaining of the paper is organized as follows. Notational conventions are given below. The sequential algorithms of -Centres and -NN mode seeking are presented in Section 2, along with a brief essential description of the GPU architecture. The proposed GPU algorithms are described in Section 3. Experimental results with public geospatial datasets are discussed in Section 4. Finally, our concluding remarks and suggestions for future work are given in Section 5.
Section snippets
Background
As a reference of the sequential implementations of both -Centres and -NN mode seeking, we considered the Matlab codes from (Duin et al., 2007), including the novel version of the second algorithm that was more recently proposed by the same first author in (Duin et al., 2012) and that is also described in (Duin and Verzakov, 2017). However, for a fair comparison against our GPU-based algorithms, we developed our own faster sequential versions in ANSI C.
GPU algorithms implementation
Both clustering algorithms require the computation of a square . Even though some particular dissimilarity measures might be appropriate for geospatial data (e.g. the geodesic distance (Schneider and Eberly, 2003)), here and without loss of generality, we restrict ourselves the Euclidean distance. -NN mode seeking, also, requires the repeated sorting of the distances to find the neighborhood to estimate local densities.
Experimental setup
The experiments were carried out on a Dell PowerEdge-T630 with a GPU Tesla K40c. Table 3 shows the main characteristics of the GPU, CPU, and the software versions of the machine used to run the implementations.
Conclusions
In this paper, strategies implemented to accelerate the -Centres and -NN mode seeking clustering algorithms on many-core architectures were presented. The new computational proposals on GPU guarantee higher accelerations over the sequential versions, given an amount of objects to be labeled that fit completely in the GPU memory. The results obtained demonstrate that the GPU implementation of the algorithms are better suited for handling large amounts of data. Typically, larger datasets result
Authors’ contribution
Ana-Lorena Uribe-Hurtado: conceptualization, methodology, software developer – original draft preparation. Mauricio Orozco-Alzate: supervisor, reviewing, introduction writer and editing. Bernardete Ribeiro: reviewing, resources support. Noel Lopes: software development support, reviewing, validation.
Conflict of interest
None declared.
Acknowledgements
The first author acknowledges funding provided by Universidad Nacional de Colombiathrough “Convocatoria para la Movilidad Internacional de la Universidad Nacional de Colombia 2017-2018”. Center of Informatics and Systems of the University of Coimbra (CISUC) is also acknowledged.
References (45)
- et al.
gpu-accelerated parallel k-means algorithm
Comput. Electr. Eng.
(2019) - et al.
Performance guarantees for hierarchical clustering
J. Comput. Syst. Sci.
(2005) - et al.
Towards a data science toolbox for industrial analytics applications
Comput. Ind.
(2018) Clustering to minimize the maximum intercluster distance
Theoret. Comput. Sci.
(1985)- et al.
A hybrid CPU/GPU approach for optimizing sorting throughput
Parallel Comput.
(2019) Data clustering: 50 years beyond K-means
Pattern Recogn. Lett.
(2010)- et al.
Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: application in medical imaging
Adv. Eng. Softw.
(2017) - et al.
Speeding up k-Means algorithm by GPUs
J. Comput. Syst. Sci.
(2013) - et al.
Particle filtering on GPU architectures for manufacturing applications
Comput. Ind.
(2015) - et al.
Prototype selection for dissimilarity-based classifiers
Pattern Recogn.
(2006)
Prototype selection for dissimilarity-based classifiers
Pattern Recogn.
Distance in 3D
Geometric Tools for Computer Graphics, Computer Graphics
Fast parallel GPU-sorting using a hybrid algorithm
J. Parallel Distrib. Comput.
A grid-growing clustering algorithm for geo-spatial data
Pattern Recogn. Lett.
GeoFog4Health: a fog-based SDI framework for geospatial health big data analysis
J. Amb. Intell. Human. Comput.
Spatial Cluster Analysis With Python
Clustering to reduce spatial data set size
SocArXiv
A survey on graphic processing unit computing for large-scale data mining
Wiley Interdisc. Rev.: Data Mining Knowl. Discov.
Simplified odd-even sort using multiple shift-register loops
Int. J. Comput. Inform. Sci.
An efficient sorting algorithm with CUDA
J. Chin. Inst. Engrs. Trans. Chin. Inst. Engrs. Ser. A/Chung-Kuo Kung Ch’eng Hsuch K’an
CUDA programming model
Professional CUDA C Programming
Mean shift, mode seeking, and clustering
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (4)
Assessing the impact of GeoAI in the world of spatial data and energy revolution
2023, Risk Detection and Cyber Security for the Success of Contemporary ComputingBuilding Actionable Personas Using Machine Learning Techniques
2022, Proceedings of the 2022 IEEE Symposium Series on Computational Intelligence, SSCI 2022KNN-MSDF: A Hardware Accelerator for k-Nearest Neighbors Using Most Significant Digit First Computation
2022, International System on Chip Conference