Large-scale learning algorithms are essential for modern data collections that may have billions of data points. Here, we study the design of parallel \(k\)-clustering algorithms, which include the \(k\)-median, \(k\)-medoids, and \(k\)-means clustering problems. We design efficient parallel algorithms for these problems and prove that they still compute constant-factor approximations to the optimal solution for stable clustering instances. In addition to our theoretic results, we present computational experiments that show that our \(k\)-median and \(k\)-means algorithms work well in practice—we are able to find better clusterings than state-of-the-art coreset constructions using samples of the same size.
1 Introduction
Modern data collections require algorithms that can handle massive amounts of data. In many applications involving online content (e.g., videos, images, text posts), it is common to have billions of data points. Therefore, the design and development of large-scale learning algorithms has become a compelling research area. In particular, recently large-scale unsupervised learning algorithms have been developed for hierarchical clustering [13], graph partitioning [6], and weighted set cover [24].
Here we study \(k\)-clustering in a parallel computing environment. Given a set of points \(X\subset\mathbb{R}^{\mathfrak{m}}\) that we wish to cluster, a distance function (metric) \(d\), and some constant \(p{\gt}0\), our \(k\)-clustering problem is to find \(k\) centers \(c_{1},\ldots,c_{k}\) that minimize \(\sum_{x\in X}\min_{i}d(x,c_{i})^{p}\). We study two variations of this problem—where the centers must be chosen from \(X\), and where they may be any points in \(\mathbb{R}^{\mathfrak{m}}\). We refer to the former problem as constrained \(k\)-clustering, and to the latter problem as unconstrained \(k\)-clustering. Note that the former problem for \(p=1\) is equivalent to \(k\)-median clustering, and for \(p=2\) it is equivalent to \(k\)-medoids clustering. The latter problem for \(p=2\) is equivalent to \(k\)-means clustering.
The \(k\)-median and \(k\)-means clustering problems are very well-studied, and there have been many algorithms proposed that are suited to different settings and datasets of different sizes (see Section 2 for an overview). Here we study the batch clustering setting, where we wish to cluster all the \(n\) data points at once, and \(n\) may be quite large. Note that the batch setting is different than the online setting (where the data points arrive one at a time), and the streaming setting (where we are interested in limiting memory consumption rather than the overall run time).
A heuristic way to approach a batch clustering task of this scale is to first partition the data onto many machines using some application-specific criteria. For example, we may partition business listings based on geographic location or partition newspaper articles based on the publisher. We can then cluster the data on each machine in parallel. However, this means that the data points assigned to different machines cannot be in the same cluster. If our initial partition strategy is not consistent with the intended clustering, we then severely limit the quality of the clustering we can output. Another approach is to make strong assumptions about the structure of the data, which can give clustering instances that are easy to solve efficiently even for very large datasets (see Section 4).
In this work we design efficient \(k\)-clustering algorithms that easily scale to very large datasets. Our algorithms do not use any partitioning heuristics and make more realistic assumptions about the structure of the data. The first step of our approach is to sample a set of points \(S\subset X\) uniformly at random. We then compute a Voronoi decomposition of \(X\) with respect to the points in \(S\). Observe that the Voronoi decomposition may be computed in a distributed fashion in time \(tn/m\), where \(t=|S|\) is the size of the sample, \(n\) is the size of the dataset, and \(m\) is the number of machines. Moreover, in our analysis, we show that setting \(t=O(k\textrm{log}\frac{k}{\delta})\) is sufficient (here \(\delta\) is an approximation parameter). Given that \(m\) can easily be on the order of \(10^{4}\), the parallel Voronoi decomposition may then only take a few minutes even for very large datasets. We then compute the clustering by running an in-memory algorithm (on a single machine) that only considers the points in \(S\), whose run time is polynomial in \(t\). However, our algorithm is still able to approximate the objective value for the entire dataset using the information from the Voronoi cells.
We make subtle design choices regarding the information that we compute for each Voronoi cell. For each seed point \(s\in S\), let \(v(s)\) denote the points in its Voronoi cell. In the constrained setting (where the centers must be chosen from the dataset), it suffices to compute the size of \(v(s)\), and the sum of the distances between \(s\) and the points in \(v(s)\). In the unconstrained setting, we can further exploit the property that the centers may not be actual points, and also compute the sum of the points in \(v(s)\), which is later used to compute the average of the points in each potential cluster. Note that in each setting it is possible to compute these quantities in parallel.
1.1 Our Results
We design and develop parallel \(k\)-clustering algorithms with strong performance guarantees. Our general problem statement enables us to optimize any \(k\)-clustering objective, where the centers may or may not be restricted to the data points. This problem statement subsumes the \(k\)-median, \(k\)-medoids, and \(k\)-means clustering problems.
In particular, we prove that for stable clustering instances, with probability at least \(1-\delta\), our approach gives a 6-approximation to the optimal solution for the \(k\)-median objective function, a 40-approximation for the \(k\)-medoids objective, and a 10-approximation for the \(k\)-means objective. On a single machine our algorithms have a run time of \(O(nt+t^{3})\), where \(t=O(k\textrm{log}\frac{k}{\delta})\) is the size of the sample used by the algorithms. In a distributed computing environment with \(m\) machines, the algorithms can be implemented in time \(O(\frac{nt}{m}+t^{3})\). Our algorithms therefore give better approximation ratios than alternative solutions and/or naturally scale to much larger datasets (see Section 2 for a detailed comparison).
In our analysis we study the setting where the optimal clustering is stable—it is resilient to perturbations of the distances between the points and/or the cluster assignments have a certain margin. These assumptions have been studied in a variety of previous work (see, in particular Awasthi et al. [5], Balcan and Liang [12], Ben-David and Reyzin [14]). When the centers are restricted to the data points, we mildly strengthen the stability assumptions compared to the best known results in order to design much more scalable algorithms (see Section 2.4). When the centers are not restricted to the data points, we are in fact able to relax the stability assumptions in some cases (see Section 2.4), while also scaling to much larger datasets.
To complement our theoretic results, in Section 7, we present computational experiments that show that our \(k\)-median and \(k\)-means algorithms work well in practice. We show that we are able to compute clusterings with better objective value when compared to alternative solutions using samples of points of the same size.
2 Related Work
In this section, we survey the existing algorithms for \(k\)-median, \(k\)-means, and \(k\)-medoids clustering. Our survey includes more computationally intensive algorithms that do not scale well, as well as approaches that are designed to be more scalable. Here we also provide the relevant background for the stability properties considered in our work and discuss how our algorithms compare to the best known results in these settings.
2.1 K-Median Clustering
There are several constant-factor approximations for \(k\)-median. Charikar et al. [15] give a \(6\frac{2}{3}\)-approximation algorithm, and Jain and Vazirani [26] give a \(6\)-approximation. Both approaches use linear programming. Arya et al. [3] give a \((3+\epsilon)\)-approximation that uses local search. In particular, Arya et al. [3] present an algorithm that swaps \(p\) centers in each stage and gives an approximation ratio of \(3+2/p\). The most efficient version of this algorithm (swapping one center in each stage) gives a 5-approximation. The best known approximation algorithm for \(k\)-median by Li and Svensson [30] gives a \((1+\sqrt{3}+\epsilon)\)-approximation.
In order to handle more data several approaches have been proposed. One approach is to first sample a set of points \(S\subset X\), and then use one of these constant-factor approximation algorithms to cluster \(S\), which still gives a good approximation for \(X\) (see Mishra et al. [36], Czumaj and Sohler [20], Meyerson et al. [35]). The limitation of these sampling approaches is that we still need to compute a constant-factor approximation on the points in the sample. Such approximations are slow to compute and run on a single machine. Given that these sampling approaches require fairly large sample sizes that may also depend on the data diameter, they become infeasible for very large datasets.
Another approach to handle large data is to use coresets—small subsets of representative points that enable approximating the objective value on the full dataset up to a multiplicative factor of \(1\pm\epsilon\). Feldman and Langberg [23] present a coreset of size \(t=O(\frac{kd}{\epsilon^{2}})\) that runs in time \(O(ndk+\textrm{log}^{2}n+k^{2}+t\ \textrm{log}\ n)\), where \(n\) is the number of data points and \(d\) is the dimension of the data. Their algorithm takes a \(k\)-median solution as input, and samples points based on their contribution to the objective in this solution. However, the analysis requires first computing a constant-factor approximation for the entire dataset, which is infeasible for large datasets. In Section 7, we compare the performance of our \(k\)-median algorithm with the algorithm of Feldman and Langberg [23]. In our implementation of the construction of Feldman and Langberg [23], we use a heuristic to compute the initial \(k\)-median solution.
Balcan et al. [10] improve on the limitations of Feldman and Langberg [23] by showing how to construct the coreset in a distributed fashion on \(m\) machines, where each machine computes a constant-factor approximation on a subset of the data of size \(n/m\). The size of the resulting coreset is \(O(\frac{kd}{\epsilon^{2}}+mk)\). However, when the number of machines is large (say \(m=10^{4}\)), this still gives a coreset that is too large, given that we must compute a constant-factor approximation on the points in the coreset in order to preserve the approximation ratio on the full dataset. On the other hand, choosing a smaller \(m\) is problematic because then \(n/m\) (the size of each subset) becomes too large, and we will not be able to compute a constant-factor approximation on each subset.
Recently, a lot of effort has gone into designing coresets that are as small as possible, establishing near-optimal lower-bounds on the coreset size (see Cohen-Addad et al. [17]), and then providing algorithms that almost match these bounds. In particular, Cohen-Addad et al. [18] and Huang et al. [25] give coresets of size \(\tilde{O}(\textrm{min}(k^{4/3}\cdot\epsilon^{-2},k\cdot\epsilon^{-4}))\), where \(\tilde{O}(x)\) hides multiplicative factors that are polylogarithmic in \(x\). However, consider that we still need to cluster the points in the coreset, which is added to the overall complexity of these approaches. In particular, if we wish to preserve a very good approximation factor given by the coreset, we need to use a constant-factor approximation algorithm to cluster the coreset, which increases the overall complexity of the solution. Note we also have to ensure that the coreset construction itself has an efficient implementation.
Ene et al. [22] present a MapReduce algorithm for \(k\)-median clustering. It uses a weighted \(k\)-median \(c\)-approximation algorithm as a subroutine and then outputs a (\(10c+3\))-approximation to the optimal solution. Using the \((1+\sqrt{3}+\epsilon)\)-approximation algorithm from Li and Svensson [30] as a subroutine, this construction at best gives a 30-approximation.
The online \(k\)-median algorithm from Lattanzi and Vassilvitskii [29] optimizes for clustering consistency in the online setting, where points \(x_{1},x_{2},\ldots x_{n}\) arrive one at a time, rather than the overall run time for clustering \(n\) points. Their construction requires calling a constant-factor approximation algorithm for \(k\)-median \(n\) times. Similarly, the streaming algorithm for \(k\)-median from Charikar et al. [16] optimizes for memory efficiency, rather than the overall run time. Its construction requires solving a facility-location problem in each phase, where the number of phases is bounded by the total number of points.
Cohen-Addad et al. [19] design a parallel clustering algorithm for the hierarchical \(k\)-median problem. For any fixed value of \(k\) their algorithm gives an \(O(\textrm{min}(d,\textrm{log}\ n)\textrm{log}\ \Delta)\)-approximation to the optimal solution in expectation, where \(\Delta\) is the ratio between the maximum and minimum pairwise distance of the input data points.
Voevodski [39] present a scalable \(k\)-median clustering algorithm. Their theoretic analysis considers the same stability assumptions that are studied here in the constrained setting. Voevodski [39] also present strong experimental results when compared to the algorithm of Feldman and Langberg [23]. The work presented here extends the results of Voevodski [39] to other clustering objectives. In particular, here we generalize the clustering objective, stability assumptions, and the theoretic analysis presented in Voevodski [39]. In this work, we also design and develop a new, more specialized \(k\)-means algorithm (see Section 6) and evaluate its performance empirically (see Section 7).
2.2 K-Means Clustering
The well-known \(k\)-means++ algorithm from Arthur and Vassilvitskii [2] computes an \(O(\textrm{log}\ k)\)-approximation to the optimal solution. The center initialization of the algorithm is inherently sequential and requires \(O(kn)\) time, where \(n\) is the number of data points. This approach does not scale to very large datasets where both \(n\) and \(k\) may be large. In order to address this limitation, Bahmani et al. [8] present a parallel version of the algorithm that samples \(O(k)\) points in each round and runs for \(O(\textrm{log}\ n)\) rounds. The sampling in each round may be done in parallel. Their analysis shows that if we use a \(c\)-approximation algorithm to cluster the selected points, we get an \(O(c)\)-approximation to the optimal solution.
Ailon et al. [1] present a streaming \(k\)-means algorithm that opens \(O(k\ \textrm{log}\ k)\) centers and gives an \(O(1)\)-approximation to the optimal solution. Their algorithm runs a batch clustering algorithm on small inputs that fit into memory; the outputs are then combined in a hierarchical manner.
Bachem et al. [7] present a parallel coreset construction for \(k\)-means clustering. The coreset enables approximating the \(k\)-means objective on the full dataset up to a multiplicative factor of \(1\pm\epsilon\). The size of the coreset is \(O((dk\ \textrm{log}\ k)/\epsilon^{2})\), where \(d\) is the dimension of the data. Their coreset construction samples each point proportionally to the squared distance to the mean of the dataset, while also ensuring that each point has some uniform probability of being selected (see Bachem et al. [7]). In Section 7, we compare our \(k\)-means algorithm with the algorithm of Bachem et al. [7].
Cohen-Addad et al. [18] and Huang et al. [25] significantly improve the size of the coreset, giving coresets of size \(\tilde{O}(\textrm{min}(k^{3/2}\cdot\epsilon^{-2},k\cdot\epsilon^{-4}))\), where \(\tilde{O}(x)\) hides multiplicative factors that are polylogarithmic in \(x\). Note that the dependence on \(d\) is also removed completely, which is a major advancement. However, as before, we still need to add the cost of clustering the points in the coreset, as well as the cost of computing the coreset itself, to the overall complexity of the solution.
Peng et al. [37] design a neural network implementation of Lloyd’s \(k\)-means algorithm, restating its steps using differentiable operations. They also present favorable experimental results on well-known datasets.
2.3 Other Clustering Objectives
The best known approximation algorithm for the \(k\)-medoids problem gives a \((9+\epsilon)\)-approximation to the optimal solution [27]. The run time of this algorithm is polynomial in the size of the dataset, and it does not scale very well to large datasets.
Large-scale algorithms have also been studied in the context of \(k\)-center clustering [34] and Gaussian mixture models [33]. The coreset for Gaussian mixture models from Lucic et al. [33] uses \(k\)-means++ as a subroutine, and the overall construction is very similar to the centralized coreset construction of Feldman and Langberg [23] for \(k\)-median.
Recently there have also been efforts to restate clustering as a supervised learning problem. However, doing this requires synthetically generating labels from unlabeled data. One approach is “data augmentation”—creating new data points that are modified copies of the original data. Positive labels are then assigned to pairs of points that originate from the same point (see Li et al. [31] and Li et al. [32]). Another approach is to assign positive labels to pairs of points that are nearest neighbors in an embedded space (see Van Gansbeke et al. [38]). These supervised formulations are then inherently more scalable than traditional clustering methods. In particular, a classifier may be trained on these labels using parallel batch gradient descent.
2.4 Stability Properties
The \(\alpha\)-perturbation stability and \(\alpha\)-center proximity properties considered here are also studied by Awasthi et al. [5] and Balcan and Liang [12]. Awasthi et al. [5] give a polynomial-time algorithm for \(\alpha\)-perturbation stable instances for any center-based clustering objective in a metric space. Their algorithm works for \(\alpha\geq 3\) when the centers are restricted to only the data points. Balcan and Liang [12] improve this bound to \(\alpha\geq 1+\sqrt{2}\) using a more sophisticated algorithm and also study a noisy version of the problem where only a subset of the data points satisfy \(\alpha\)-perturbation stability.
The algorithms of Awasthi et al. [5] and Balcan and Liang [12] compute a hierarchical clustering of the data points and then use dynamic programming to find the best \(k\)-pruning of the hierarchical clustering tree. Given the stability assumptions, they return the optimal clustering, but their run time is infeasible for large datasets. It takes \(O(n^{3})\) time to compute a hierarchical clustering, where \(n\) is the number of data points. Computing the dynamic programming solution also requires \(O(n^{2})\) time just to compute the cost of a single cluster for each sub-tree. In our work we mildly strengthen the assumptions on \(\alpha\) in this setting in order to design much more scalable algorithms (see Section 5).
The complexity of perturbation stability is studied by Ben-David and Reyzin [14]. They prove that finding the optimal \(k\)-median clustering is NP-hard if the data satisfies \(\alpha\)-perturbation stability for \(\alpha{\lt}2\). On the other hand, they show that if \(\alpha\geq 2+\sqrt{3}\), then the optimal clustering satisfies strict separation (see Definition 3.8). When this is the case, Balcan et al. [9] show that we can construct a hierarchical clustering of the data using single-linkage, and some pruning of this tree must be equivalent to the optimal clustering. As before, this pruning may be computed using dynamic programming. However, this algorithm is still not feasible for large-scale data.
Clustering instances satisfying \(\alpha\)-center proximity are known to be harder when the centers may be any points in \(\mathbb{R}^{\mathfrak{m}}\) (as opposed to only the data points). Here we refer to this setting as unconstrained \(k\)-clustering. Awasthi et al. [5] show that in this setting finding the optimal \(k\)-median clustering is NP-hard for \(\alpha{\lt}3\), and present an algorithm that is able to find the optimal clustering only for \(\alpha\geq 2+\sqrt{3}\). We are in fact able to relax the assumptions on \(\alpha\) in this setting in some cases, depending on how the points are distributed around the centers of the optimal clustering (see Section 6), while also scaling to much larger datasets.
The \(\alpha\)-center proximity property is closely related to the \(\gamma\)-margin property from Ashtiani et al. [4]. This property is studied by Ashtiani et al. [4] in the context of supervised \(k\)-means. The \(\alpha\)-perturbation stability property is also studied by Balcan et al. [11] in the context of \(k\)-center clustering.
3 Preliminaries
Suppose we are given a set of \(n\) points \(X\subset\mathbb{R}^{\mathfrak{m}}\) and a metric \(d\). Given some constant \(p{\gt}0\), our \(k\)-clustering objective function is to find \(k\) centers \(c_{1},\ldots c_{k}\) that minimize the sum of the distances of each point to the nearest center: \(\sum_{x\in X}\min_{i}d(x,c_{i})^{p}\). We use \(C=C_{1},\ldots C_{k}\) to refer to the corresponding clusters (where each point is assigned to the nearest center).
We consider two variations of the \(k\)-clustering problem based on the domain that the centers are chosen from. We consider the setting where the centers must be chosen from \(X\), which we refer to as constrained \(k\)-clustering. We also consider the setting where the centers may be any points in \(\mathbb{R}^{\mathfrak{m}}\), which we refer to as unconstrained \(k\)-clustering. In the constrained setting, we assume that we are given a metric space \((X,d)\), while in the unconstrained setting, we assume that \(d\) is a general metric on \(\mathbb{R}^{\mathfrak{m}}\). For the \(k\)-means objective function, which corresponds to the unconstrained setting for \(p\) = 2, \(d\) is the Euclidean distance.
We use \(c^{\ast}_{1},\ldots c^{\ast}_{k}\) to refer to the centers that optimize our objective function, and use \(C^{\ast}=C^{\ast}_{1},\ldots C^{\ast}_{k}\) to refer to the corresponding clustering. We use \(OPT\) to refer to the optimal objective value.
In our work, we consider two related notions of stability. In the constrained setting we assume that the clustering instance is resilient to perturbations of the distances between the points—the optimal clustering remains unchanged for small multiplicative perturbations. We next formally define this property.
Definition 3.1.
Given a metric space \((X,d)\) and a parameter \(\alpha{\gt}1\), an \(\alpha\)-perturbation of \(d\) is a function \(d^{\prime}:X\times X\rightarrow R_{\geq 0}\), such that for any pair of points \(x,y\in X\), we have \(d(x,y)\leq d^{\prime}(x,y)\leq\alpha d(x,y)\). We say that a clustering instance satisfies \(\alpha\)-perturbation stability for objective function \(\Phi\) if the unique optimal clustering for \((X,d)\) under \(\Phi\) is the same as the unique optimal clustering for \((X,d^{\prime})\) under \(\Phi\), where \(d^{\prime}\) is any \(\alpha\)-perturbation of \(d\).
In the unconstrained setting, perturbations of the distances between the points are not well-defined because \(d\) is a general metric on \(\mathbb{R}^{\mathfrak{m}}\) (rather than a finite metric defined on pairs of points from \(X\)). In this setting, we consider a related stability property regarding the proximity of the points to the cluster centers in the optimal clustering. We define this property below.
Definition 3.2.
We say that a clustering instance satisfies \(\alpha\)-center proximity for the \(k\)-clustering objective function if for any point \(x\in C^{\ast}_{i}\) and any other cluster \(C^{\ast}_{j\neq i}\), we have \(d(x,c^{\ast}_{j}){\gt}\alpha d(x,c^{\ast}_{i})\).
Note that \(\alpha\)-center proximity describes how “sure” each point is about its cluster assignment in the optimal clustering. This definition is similar to the notion of margin in supervised learning—for example, for a linear classifier the margin of a point is its distance to the separating hyperplane.
We can verify that \(\alpha\)-center proximity follows from \(\alpha\)-perturbation stability, which is formalized in Proposition 3.3. The proof of Proposition 3.3 may be found in Awasthi et al. [5]. Therefore it suffices to only consider \(\alpha\)-center proximity in our analysis in both the constrained and unconstrained settings.
Proposition 3.3
Suppose a clustering instance satisfies \(\alpha\)-perturbation stability for the \(k\)-clustering objective function. Then it must also satisfy \(\alpha\)-center proximity for the \(k\)-clustering objective function.
We next state another observation regarding \(\alpha\)-center proximity, which is used in our analysis. Its proof may be found in Awasthi et al. [5].
Proposition 3.4
Suppose a clustering instance satisfies \(\alpha\)-center proximity for the \(k\)-clustering objective function. Then for any pair of points \(x\in C^{\ast}_{i}\) and \(y\in C^{\ast}_{j\neq i}\), we must have \(d(x,y){\gt}(\alpha-1)d(x,c^{\ast}_{i}\)).
In our analysis we use \(r_{i}\) to denote the radius of cluster \(C_{i}\), which is defined as the maximum distance between \(c_{i}\) and any other point in \(C_{i}\): \(r_{i}=\max_{x\in C_{i}}d(x,c_{i})\). Similarly, we use \(r^{\ast}_{i}\) to denote the radius of cluster \(C^{\ast}_{i}\). We use \(B(x,d)\) to denote the ball of radius \(d\) around \(x\): \(B(x,d)=\{y\in X:d(x,y)\leq d\}\).
We observe that in the optimal clustering a lot of the points in \(C^{\ast}_{i}\) may lie close to the center \(c^{\ast}_{i}\). We formalize this observation with the following definition.
Definition 3.5.
We say that a clustering \(C=C_{1},\ldots C_{k}\) satisfies \(\gamma\)-center density if there is a constant \(0{\lt}c{\lt}1\), and a constant \(0{\lt}\gamma{\lt}1\), such that for each \(C_{i}\in C\), we have \(|B(c_{i},\gamma r_{i})|\geq c|C_{i}|\). We say that a \(k\)-clustering instance satisfies \(\gamma\)-center density if the optimal clustering \(C^{\ast}\) satisfies this property.
Our algorithm computes a hierarchical clustering tree using minimax distance, which is defined as follows.
Definition 3.6.
For two sets of points \(S_{1},S_{2}\subset X\), we use \(d_{m}(S_{1},S_{2})\) to denote the minimax distance between \(S_{1}\) and \(S_{2}\), which is defined as \(d_{m}(S_{1},S_{2})=\min_{x\in S_{1}\cup S_{2}}\max_{y\in S_{1}\cup S_{2}}d(x,y)\).
In our analysis, we show that the hierarchical clustering computed by our algorithm is consistent with the optimal clustering \(C^{\ast}\). Our consistency property is defined as follows.
Definition 3.7.
Let \(T\) be a hierarchical clustering tree of a set of points \(S\subseteq X\), and let the clustering \(C\) be defined as \(C^{\ast}\) restricted to the points in \(S\): \(C_{i}=C^{\ast}_{i}\cap S\). We say that \(T\) is consistent with \(C^{\ast}\) if for each node \(N\) in \(T\), and each cluster \(C_{i}\in C\), we either have \(N\cap C_{i}=\emptyset\), \(N\subseteq C_{i}\), or \(C_{i}\subseteq N\).
Note that for each \(C^{\ast}_{i}\in C^{\ast}\), this property requires that all points from \(C^{\ast}_{i}\) must be merged before merging them with any points from \(C^{\ast}_{j\neq i}\).
In our work, we consider instances where \(C^{\ast}\) satisfies additional structural properties. In particular, we consider the strict separation and strict threshold separation properties of a clustering, which are defined as follows.
Definition 3.8.
We say that a clustering \(C\) satisfies strict separation if for all \(x,y\in C_{i}\) and \(z\in C_{j\neq i}\) we have \(d(x,y){\lt}d(x,z)\).
Definition 3.9.
We say that a clustering \(C\) satisfies strict threshold separation if there exists a threshold \(t\), such that for all \(x,y\in C_{i}\) we have \(d(x,y)\leq t\), and for all \(x\in C_{i}\) and \(y\in C_{j\neq i}\) we have \(d(x,y){\gt}t\).
Our objective function considers pairwise distances raised to some \(p\)th power. In our analysis, we use the following relaxed triangle inequality in this setting.
Proposition 3.10
Suppose we are given a metric \(d\) on \(\mathbb{R}^{\mathfrak{m}}\), and any \(p{\gt}0\). For any three points \(x,y,z\in\mathbb{R}^{\mathfrak{m}}\), we must have \(d(x,y)^{p}\leq 2^{p-1}(d(x,z)^{p}+d(z,y)^{p})\).
4 Simple Algorithm for Strong Clustering Structure
In order to motivate the design of our large-scale algorithms, we observe that the difficulty of the problem depends on the structure of the optimal clustering. In particular, if the optimal clustering has very strong structure, such as strict threshold separation (see Definition 3.9), then we can recover it with a very simple and efficient algorithm. This observation is formalized in Proposition 4.1. Note that the run time of this algorithm enables it to easily scale to very large datasets.
Proposition 4.1.
Suppose a clustering \(C=C_{1},\ldots C_{k}\) satisfies strict threshold separation and each cluster in \(C\) has size \(\Omega(n/k)\). Then there exists an algorithm that outputs \(C\) with probability at least \(1-\delta\) and runs on a single machine in time \(O(k^{2}\textrm{log}\frac{k}{\delta}+kn)\). In a parallel computing environment with \(m\) machines, the algorithm has a run time of \(O(k^{2}\textrm{log}\frac{k}{\delta}+\frac{kn}{m})\).
Proof.
It suffices for the algorithm to sample \(t=O(k\textrm{log}\frac{k}{\delta})\) points from \(X\) uniformly at random. Let \(S\) refer to the sampled points. We can verify that with probability at least \(1-\delta\), \(S\) contains a point from each cluster in \(C\). When this is the case, it suffices to run \(k\) iterations of farthest-first traversal on the points in \(S\)—in each iteration we select the point in \(S\) that has the largest min-distance to the points we already selected. The first point in the traversal may be selected in any manner. This traversal may be implemented in \(O(kt)\) time. We then output the clustering corresponding to the \(k\) centers output by the traversal (by assigning each point to the nearest center). This requires \(kn\) time on a single machine, and \(kn/m\) time in a parallel implementation with \(m\) machines. We can verify that if \(C\) satisfies strict threshold separation, the output clustering must be equivalent to \(C\). □
Note that Proposition 4.1 applies to any clustering \(C\), and clearly also applies to the optimal \(k\)-median and/or \(k\)-means clustering if it satisfies these structural assumptions. However, such assumptions are very strong and are not likely to hold in practice. In the next sections, we present our algorithms that work under more realistic assumptions about the data, where we are still able to exploit the structure of the optimal clustering to approximate the optimal solution.
5 Algorithm for Constrained K-Clustering
Our algorithm first selects a subset of seed points \(S\subset X\) uniformly at random. We then compute a Voronoi decomposition of \(X\) with respect to the points in \(S\). For each \(s\in S\), let \(v(s)\) denote the points in its Voronoi cell, and let \(c(s)\) denote the cost of \(s\), which is defined as \(c(s)=\sum_{x\in v(s)}d(x,s)^{p}\). We compute \(|v(s)|\) and \(c(s)\) for each seed point \(s\in S\). This information is used to approximate the objective value of the clusterings considered by our algorithm.
Our algorithm then computes a hierarchical clustering of \(S\) in a bottom-up fashion using minimax distance (see Definition 3.6). In each iteration, two nodes with the smallest minimax distance are merged, until only one node is left. Note that if we start with \(t=|S|\) points, we will have \(2t-1\) nodes in total. We will use \(N_{1},N_{2},\ldots N_{2t-1}\) to refer to these nodes indexed in depth-first order.
We then apply dynamic programming to find the best \(k\)-pruning of the hierarchical clustering tree. Our dynamic programming only considers clusterings of the seed points, but approximates the objective value for the entire dataset using the information from the Voronoi cell of each seed. Figure 1 gives a high-level description of the components of our algorithm, and we follow with a more exact description and its analysis.
Fig. 1.
Fig. 1. A high-level description of the algorithm components.
We will use \(S_{i}\) to refer to the seed points in node \(N_{i}\). In our analysis we will also use \(P_{i}=\{x\in X:x\in v(s)\) and \(s\in S_{i}\}\) to refer to the points in the corresponding Voronoi cells. In order to define the table of solutions constructed by the dynamic programming, we will use cost\((i,j)\) to refer to the cost of clustering \(N_{i}\) using \(j\) clusters.
The algorithm is described in more detail in Algorithm 1. For simplicity, we omit the details regarding the initialization of the table for leaf nodes and nodes with less than two children. In order to output the best clustering, it suffices to store the centers corresponding to the clusterings in the table of solutions (these details are also omitted).
Algorithm 1 Large Scale K-Clustering(\(X,d,p,k,\delta\))
• Let \(t=O(k\textrm{log}\frac{k}{\delta})\)
• Sample \(t\) points \(S\subset X\) uniformly at random
• Compute a Voronoi decomposition of \(X\) w.r.t. \(S\)
• For each \(s\in S\), let \(v(s)\) denote the points in its Voronoi cell
• For each \(s\in S\), let \(c(s)=\sum_{x\in v(s)}d(x,s)^{p}\)
• Let \(T\) be the hierarchical clustering tree of \(S\) computed using minimax distance
• Run Dynamic-Programming(\(T,k\))
• Let \(N_{r}\) be the root of \(T\)
• Output clustering corresponding to \(cost(r,k)\)
Algorithm 2 Dynamic-Programming(\(T,k\))
Let \(N_{1},N_{2},\ldots N_{2t-1}\) be a depth-first traversal of \(T\)
for\(i=1\)to\(2t-1\)do
for\(j=1\)to\(k\)do
Let \(cost(i,j)=\) Cluster\((N_{i},j)\)
end for
end for
Algorithm 3 Cluster\((N_{i},j)\)
if\(j=1\)then
Let \(S_{i}\) be the seed points in node \(N_{i}\)
We state the theorem regarding the performance of our algorithm in Theorem 5.1.
Theorem 5.1.
Let us suppose that a constrained \(k\)-clustering instance satisfies \(\alpha\)-perturbation stability and \(\gamma\)-center density for \(\alpha\geq 3+2\gamma\). Suppose each cluster in the optimal clustering has size \(\Omega(n/k)\). Then with probability at least \(1-\delta\), Algorithm 1 outputs a clustering with expected objective value at most \(2^{2p-1}\cdot(2^{p}+1)\cdot OPT\), and runs in time \(O(nt+t^{3})\), where \(t=O(k\textrm{log}\frac{k}{\delta})\) is the size of the sample used by the algorithm. In a parallel computing environment with \(m\) machines, Algorithm 1 has a run time of \(O(\frac{nt}{m}+t^{3})\).
Note that we are most interested in settings where \(p=1\) and/or \(p=2\). Observe that applying Theorem 5.1 for \(p=1\) gives us a 6-approximation to the \(k\)-median clustering problem, and applying it for \(p=2\) gives us a 40-approximation to the \(k\)-medoids clustering problem. These observations are formalized in Corollary 5.2 and Corollary 5.3.
Corollary 5.2.
Let us suppose that a \(k\)-median clustering instance satisfies \(\alpha\)-perturbation stability and \(\gamma\)-center density for \(\alpha\geq 3+2\gamma\). Suppose each cluster in the optimal clustering has size \(\Omega(n/k)\). Then with probability at least \(1-\delta\), Algorithm 1 with \(p=1\) outputs a clustering with expected objective value at most \(6\cdot OPT\), and runs in time \(O(nt+t^{3})\), where \(t=O(k\textrm{log}\frac{k}{\delta})\) is the size of the sample used by the algorithm. In a parallel computing environment with \(m\) machines, Algorithm 1 has a run time of \(O(\frac{nt}{m}+t^{3})\).
Corollary 5.3.
Let us suppose that a \(k\)-medoids clustering instance satisfies \(\alpha\)-perturbation stability and \(\gamma\)-center density for \(\alpha\geq 3+2\gamma\). Suppose each cluster in the optimal clustering has size \(\Omega(n/k)\). Then with probability at least \(1-\delta\), Algorithm 1 with \(p=2\) outputs a clustering with expected objective value at most \(40\cdot OPT\), and runs in time \(O(nt+t^{3})\), where \(t=O(k\textrm{log}\frac{k}{\delta})\) is the size of the sample used by the algorithm. In a parallel computing environment with \(m\) machines, Algorithm 1 has a run time of \(O(\frac{nt}{m}+t^{3})\).
In order to prove Theorem 5.1, we next state and prove several observations regarding the structure of the clustering instance, and the performance of the individual components of our algorithm. Lemma 5.4 shows that given \(\alpha\)-center proximity, any pair of points \(x\in C^{\ast}_{i}\) and \(y\in C^{\ast}_{j}\) must be well-separated with respect to the radius of \(C^{\ast}_{i}\).
Lemma 5.4
Suppose that a \(k\)-clustering instance satisfies \(\alpha\)-center proximity for \(\alpha\geq 3+2\gamma\). For any pair of points \(x\in C^{\ast}_{i}\) and \(y\in C^{\ast}_{j\neq i}\), we must have \(d(x,y){\gt}(1+\gamma)r^{\ast}_{i}\).
Proof.
Let \(z\) be the point in \(C^{\ast}_{i}\) that is farthest from \(c^{\ast}_{i}\) (ties broken arbitrarily). In other words, we have \(d(z,c^{\ast}_{i})=r^{\ast}_{i}\). We show that if \(d(x,y)\leq(1+\gamma)r^{\ast}_{i}\), then \(z\) is too close to \(c^{\ast}_{j}\), violating our \(\alpha\)-center proximity assumption. To see this, observe that by the triangle inequality we have \(d(z,c^{\ast}_{j})\leq d(z,c^{\ast}_{i})+d(c^{\ast}_{i},x)+d(x,y)+d(y,c^{\ast}_ {j})=r^{\ast}_{i}+d(c^{\ast}_{i},x)+d(x,y)+d(y,c^{\ast}_{j})\). Then by Proposition 3.4, given that \(\alpha\geq 3\), we must have \(d(x,c^{\ast}_{i}){\lt}d(x,y)/(\alpha-1){\lt}d(x,y)/2\), and similarly \(d(y,c^{\ast}_{j}){\lt}d(x,y)/(\alpha-1){\lt}d(x,y)/2\). Then if we have \(d(x,y)\leq(1+\gamma)r^{\ast}_{i}\), we must have \(d(z,c^{\ast}_{j}){\lt}(3+2\gamma)r^{\ast}_{i}\), violating \(\alpha\)-center proximity for \(z\). □
Continuing our analysis, let \(T_{S}\) be the hierarchical clustering tree of the seed points \(S\) that is computed by our algorithm. Let \(T_{X}\) be the corresponding hierarchical clustering tree of \(X\), which is given by the Voronoi cells of the seed points. In other words, each node \(N_{i}\) in \(T_{S}\) corresponds to a node \(N^{\prime}_{i}\) in \(T_{X}\), which contains the points \(P_{i}\). Note that we only use \(T_{X}\) in our analysis; our algorithm does not compute it. Lemma 5.5 shows that given our assumptions, both \(T_{S}\) and \(T_{X}\) must be consistent with \(C^{\ast}\), per our definition of consistency in Definition 3.7.
Lemma 5.5.
Suppose that a \(k\)-clustering instance satisfies \(\alpha\)-center proximity and \(\gamma\)-center density for \(\alpha\geq 3+2\gamma\). Suppose each cluster in the optimal clustering has size \(\Omega(n/k)\). Let \(T_{S}\) be the hierarchical clustering tree of the seed points \(S\) computed by our algorithm. Let \(T_{X}\) be the corresponding hierarchical clustering tree of \(X\). Then with probability at least \(1-\delta\), \(T_{S}\) and \(T_{X}\) are consistent with the optimal clustering \(C^{\ast}\).
Proof.
For each cluster \(C^{\ast}_{i}\), let us use \(s^{\ast}_{i}\) to refer to the closest seed point to \(c^{\ast}_{i}\) (ties broken arbitrarily). If our clustering instance satisfies \(\gamma\)-center density, and each cluster in the optimal clustering has size \(\Omega(n/k)\), we can verify that if we select \(O(k\textrm{log}\frac{k}{\delta})\) seed points, with probability at least \(1-\delta\), for each cluster \(C^{\ast}_{i}\), we will have \(d(s^{\ast}_{i},c^{\ast}_{i})\leq\gamma r^{\ast}_{i}\).
Whenever this is the case, we prove that \(T_{S}\) must be consistent with \(C^{\ast}\) by induction on the construction of \(T_{S}\). Per Definition 3.7, let the clustering \(C\) be defined as \(C^{\ast}\) restricted to the points in \(S\): \(C_{i}=C^{\ast}_{i}\cap S\). At the start, for each cluster \(A\) we clearly have \(A\subseteq C_{i}\) for some \(C_{i}\in C\). We now verify that if \(A\subset C_{i}\), and \(A^{\prime}\subseteq C_{j\neq i}\) or \(A^{\prime}\) is the union of two or more other clusters in \(C\), there must be another cluster \(B\subset C_{i}\) such that \(d_{m}(A,B){\lt}d_{m}(A,A^{\prime})\).
To verify this, let us consider the cluster assignment of \(s^{\ast}_{i}\). If \(s^{\ast}_{i}\in A\), then we may choose \(B\) to be any other cluster that is a strict subset of \(C_{i}\). Otherwise, we choose \(B\) to be the cluster that contains \(s^{\ast}_{i}\). Using the triangle inequality, we can verify that \(d_{m}(A,B)\leq(1+\gamma)r^{\ast}_{i}\). On the other hand, Lemma 5.4 shows that the distance between any pair of points \(x\in A\) and \(y\in A^{\prime}\) must be more than \((1+\gamma)r^{\ast}_{i}\), which implies that \(d_{m}(A,A^{\prime}){\gt}(1+\gamma)r^{\ast}_{i}\).
Finally, we prove that for any point \(x\in C^{\ast}_{i}\) in the Voronoi cell of a seed point \(s\in S\), we must also have \(s\in C^{\ast}_{i}\). To see this, consider that by Lemma 5.4, the distance between \(x\) and any seed point \(s^{\prime}\in C^{\ast}_{j\neq i}\) must satisfy \(d(x,s^{\prime}){\gt}(1+\gamma)r^{\ast}_{i}\). On the other hand, the distance between \(x\) and \(s^{\ast}_{i}\) must satisfy \(d(x,s^{\ast}_{i})\leq d(x,c^{\ast}_{i})+d(c^{\ast}_{i},s^{\ast}_{i})\leq r^{ \ast}_{i}+\gamma r^{\ast}_{i}=(1+\gamma)r^{\ast}_{i}\). Clearly, \(x\) will then be in the Voronoi cell of \(s^{\ast}_{i}\), or another seed point from \(C^{\ast}_{i}\). It follows that whenever \(T_{S}\) is consistent with \(C^{\ast}\), \(T_{X}\) must be consistent with \(C^{\ast}\) as well. □
We next state and prove a lemma regarding the selection of the seed points. Lemma 5.6 shows that in expectation they give a \(2^{p}\)-approximation with respect to the cost of the optimal centers.
Lemma 5.6.
Let \(C_{i}\) be a cluster with center \(c_{i}\), and let \(x\) be a point in \(C_{i}\) chosen uniformly at random. Let us define \(cost(y)=\sum_{z\in C_{i}}d(z,y)^{p}\). Then we have \(\mathbb{E}[cost(x)]\leq 2^{p}\cdot cost(c_{i})\).
Proof.
By the triangle inequality, we have \(cost(x)\leq\sum_{z\in C_{i}}2^{p-1}(d(z,c_{i})^{p}+d(c_{i},x)^{p})=2^{p-1} \cdot(\sum_{z\in C_{i}}d(z,c_{i})^{p}+|C_{i}|d(c_{i},x)^{p})=2^{p-1}\cdot(cost (c_{i})+|C_{i}|d(c_{i},x)^{p})\). Also, observe that for \(x\) chosen uniformly at random from \(C_{i}\), we have \(\mathbb{E}[d(c_{i},x)^{p}]=(1/|C_{i}|)\cdot\sum_{z\in C_{i}}d(z,c_{i})^{p}=(1/ |C_{i}|)\cdot cost(c_{i})\). Combining these two observations, by linearity of expectation we have \(\mathbb{E}[cost(x)]\leq 2^{p-1}(2\cdot cost(c_{i}))=2^{p}\cdot cost(c_{i})\). □
Finally, Lemma 5.7 and Lemma 5.8 bound the cost of a center estimated by Algorithm 4 (using only the seed points) with respect to its actual cost.
For node \(N_{i}\), let \(S_{i}\) refer to the seed points in \(N_{i}\), and let \(P_{i}=\{x\in X:x\in v(s)\) and \(s\in S_{i}\}\) refer to the points in the corresponding Voronoi cells. Suppose that \(P_{i}\) is exactly equivalent to some cluster \(C^{\ast}_{j}\in C^{\ast}\). For any point \(x\), let \(\tilde{cost}(x)\) be the output of One-Clustering-Cost(\(x,S_{i}\)), and let \(cost(x)=\sum_{y\in P_{i}}d(y,x)^{p}\). Then we must have \(\mathbb{E}[\tilde{cost}(x)]\leq(2^{p}+1)\cdot cost(x)\).
Proof.
For any point \(y\in P_{i}\), let \(s_{y}\) be the seed point \(s\in S_{i}\) such that \(y\) is in the Voronoi cell of \(s\). Observe that by construction we have \(\tilde{cost}(x)=\sum_{y\in P_{i}}(d(y,s_{y})^{p}+d(s_{y},x)^{p})\). By the triangle inequality, we have \(\tilde{cost}(x)=\sum_{y\in P_{i}}(d(y,s_{y})^{p}+d(s_{y},x)^{p})\leq\sum_{y\in P _{i}}(2^{p-1}(d(y,x)^{p}+d(x,s_{y})^{p})+d(s_{y},x)^{p})=2^{p-1}\cdot cost(x)+ 2^{p-1}\cdot\sum_{y\in P_{i}}d(x,s_{y})^{p}+\sum_{y\in P_{i}}d(s_{y},x)^{p}\). Also, observe that given that the seed points are chosen uniformly at random from \(P_{i}=C^{\ast}_{j}\), we have \(\mathbb{E}[d(x,s_{y})^{p}]=\mathbb{E}[d(s_{y},x)^{p}]=cost(x)/|P_{i}|\). It follows that \(\mathbb{E}[\tilde{cost}(x)]\leq 2^{p-1}\cdot cost(x)+2^{p-1}\cdot cost(x)+cost (x)=(2^{p}+1)\cdot cost(x)\). □
Lemma 5.8.
For node \(N_{i}\), let \(S_{i}\) refer to the seed points in \(N_{i}\), and let \(P_{i}=\{x\in X:x\in v(s)\) and \(s\in S_{i}\}\) refer to the points in the corresponding Voronoi cells. For any point \(x\), let \(\tilde{cost}(x)\) be the output of One-Clustering-Cost(\(x,S_{i}\)), and let \(cost(x)=\sum_{y\in P_{i}}d(y,x)^{p}\). Then we must have \(cost(x)\leq 2^{p-1}\cdot\tilde{cost}(x)\).
Proof.
For any point \(y\in P_{i}\), let \(s_{y}\) be the seed point \(s\in S_{i}\) such that \(y\) is in the Voronoi cell of \(s\). Observe that by construction we have \(\tilde{cost}(x)=\sum_{y\in P_{i}}(d(y,s_{y})^{p}+d(s_{y},x)^{p})\). By the triangle inequality, we then have \(cost(x)=\sum_{y\in P_{i}}d(y,x)^{p}\leq\sum_{y\in P_{i}}2^{p-1}(d(y,s_{y})^{p} +d(s_{y},x)^{p})=2^{p-1}\sum_{y\in P_{i}}(d(y,s_{y})^{p}+d(s_{y},x)^{p})=2^{p- 1}\cdot\tilde{cost}(x)\). □
By Proposition 3.3, if a \(k\)-clustering instance satisfies \(\alpha\)-perturbation stability, then it must also satisfy \(\alpha\)-center proximity. Then by Lemma 5.5, there must be a pruning of \(T_{X}\) that is equivalent to \(C^{\ast}\). Consider any \(C^{\ast}_{i}\in C^{\ast}\). For any point \(x\in C^{\ast}_{i}\) define \(cost(x)=\sum_{y\in C^{\ast}_{i}}d(y,x)^{p}\). By Lemma 5.6, for each seed point \(s\in C^{\ast}_{i}\), we have \(\mathbb{E}[cost(s)]\leq 2^{p}\cdot cost(c^{\ast}_{i})\). Then by Lemma 5.7, we have \(\mathbb{E}[\tilde{cost}(s)]\leq(2^{p}+1)\cdot cost(s)\), where \(\tilde{cost}(s)\) is the cost of \(s\) that is estimated by our algorithm. Combining these observations, we see that in expectation we have \(\tilde{cost}(s)\leq(2^{p}+1)\cdot 2^{p}\cdot cost(c^{\ast}_{i})\). Therefore in expectation the objective value for \(C^{\ast}_{i}\) that is estimated by our algorithm must be within \((2^{p}+1)\cdot 2^{p}\cdot cost(c^{\ast}_{i})\). Considering that this statement holds for each cluster in \(C^{\ast}\), in expectation the estimated objective value of this pruning of the hierarchical clustering tree (which is equivalent to \(C^{\ast}\)) must be at most \((2^{p}+1)\cdot 2^{p}\cdot OPT\).
The dynamic programming solution will therefore return a clustering with expected estimated objective value at most \((2^{p}+1)\cdot 2^{p}\cdot OPT\). However, note that the actual objective value may be larger than the estimated objective value. But by Lemma 5.8, the actual objective value of any single cluster may be at most \(2^{p-1}\) times larger than the estimated objective value. Therefore, in expectation our dynamic programming solution must return a clustering with actual objective value at most \(2^{p-1}\cdot(2^{p}+1)\cdot 2^{p}\cdot OPT\).
To verify the run time, the Voronoi decomposition can be computed in \(O(nt)\) time on a single machine, given that it requires \(t\) comparisons for each of the \(n\) points. In a parallel computing environment, we may partition the \(n\) points into \(m\) equal-sized sets, assign each set to a different machine, and then compute the Voronoi decomposition for each set in parallel, reducing the run time by a factor of \(m\). In particular, in a parallel implementation, the \(t\) seed points are passed to each machine; each machine then computes the quantities \(|v(s)|\) and \(c(s)\) for the points assigned to it, which are then added together to compute these quantities for the entire dataset.
It takes \(O(t^{3})\) time to compute the hierarchical clustering tree of the \(t\) seed points. The dynamic programming table is then constructed iteratively for each of the \(2t-1\) nodes in the tree. For each node, it takes \(O(t^{2})\) time to compute the one-clustering cost, and \(O(k^{2})\) to compute the \(j\)-clustering cost for \(2\leq j\leq k\). Given that we choose a setting of \(t\) that is larger than \(k\), the dynamic programming table is then computed in \(O(t^{3})\) time.
In order to output the actual clustering, it suffices to store the centers corresponding to the clusterings in the table of solutions. Assigning each data point to the nearest center takes asymptotically less time than the Voronoi decomposition—\(O(nk)\) for the in-memory implementation, and \(O(nk/m)\) for the parallel implementation. □
6 Algorithm for Unconstrained K-Clustering
Observe that the analysis in the previous section also applies to unconstrained \(k\)-clustering instances that satisfy \(\alpha\)-center proximity and \(\gamma\)-center density—it does not matter whether the centers are actual data points or any points in \(\mathbb{R}^{\mathfrak{m}}\). This observation is formalized in Corollary 6.1.
Corollary 6.1.
Let us suppose that an unconstrained \(k\)-clustering instance satisfies \(\alpha\)-center proximity and \(\gamma\)-center density for \(\alpha\geq 3+2\gamma\). Suppose each cluster in the optimal clustering has size \(\Omega(n/k)\). Then with probability at least \(1-\delta\), Algorithm 1 outputs a clustering with expected objective value at most \(2^{2p-1}\cdot(2^{p}+1)\cdot OPT\), and runs in time \(O(nt+t^{3})\), where \(t=O(k\textrm{log}\frac{k}{\delta})\) is the size of the sample used by the algorithm. In a parallel computing environment with \(m\) machines, Algorithm 1 has a run time of \(O(\frac{nt}{m}+t^{3})\).
Our main motivation for studying the unconstrained setting is to design a large-scale algorithm for the \(k\)-means objective function, which is one of the most widely used clustering objectives in practice. We first observe that applying Corollary 6.1 for \(p=2\) already gives us a 40-approximation to the \(k\)-means objective. However, it is natural to ask whether we can further improve this rather large approximation ratio. Reviewing the design of Algorithm 1, we observe that for the \(k\)-means objective indeed it is possible to improve how we compute the cluster centers. These improvements enable us to compute the actual means of the optimal clusters \(C^{\ast}_{1},\ldots C^{\ast}_{k}\) during the execution of the algorithm, further improving the approximation ratio by a factor of 4. We next describe this algorithm in more detail.
As before, our large-scale algorithm for \(k\)-means first selects a subset of seed points \(S\subset X\) uniformly at random. We then compute \(|v(s)|\) and \(c(s)\) for each seed point \(s\in S\). For the \(k\)-means objective now we additionally compute the sum of the points in each Voronoi cell, which we denote by \(\bar{v}(s)\). In other words, we have \(\bar{v}(s)=\sum_{x\in v(s)}x\). Note that it is still possible to compute this quantity in parallel. We then use this information to construct the mean of each cluster when computing the 1-clustering cost. The algorithm is described in more detail in Algorithm 5.
Algorithm 5 Large Scale K-Means(\(X,d,k,\delta\))
• Let \(t=O(k\textrm{log}\frac{k}{\delta})\)
• Sample \(t\) points \(S\subset X\) uniformly at random
• Compute a Voronoi decomposition of \(X\) w.r.t. \(S\)
• For each \(s\in S\), let \(v(s)\) denote the points in its Voronoi cell
• For each \(s\in S\), let \(c(s)=\sum_{x\in v(s)}d(x,s)^{2}\)
• For each \(s\in S\), let \(\bar{v}(s)=\sum_{x\in v(s)}x\)
• Let \(T\) be the hierarchical clustering tree of \(S\) computed using minimax distance
• Run Dynamic-Programming(\(T,k\))
• Let \(N_{r}\) be the root of \(T\)
• Output clustering corresponding to \(cost(r,k)\)
Algorithm 6 Dynamic-Programming(\(T,k\))
Let \(N_{1},N_{2},\ldots N_{2t-1}\) be a depth-first traversal of \(T\)
for\(i=1\)to\(2t-1\)do
for\(j=1\)to\(k\)do
Let \(cost(i,j)=\) Cluster\((N_{i},j)\)
end for
end for
Algorithm 7 Cluster\((N_{i},j)\)
if\(j=1\)then
Let \(S_{i}\) be the seed points in node \(N_{i}\)
Let \(n_{i}=\sum_{s\in S_{i}}|v(s)|\)
Let \(x=1/n_{i}\cdot\sum_{s\in S_{i}}\bar{v}(s)\)
return One-Clustering-Cost(\(x,S_{i}\))
end if
Let \(i_{1}\) be the index of the left child of \(N_{i}\)
Let \(i_{2}\) be the index of the right child of \(N_{i}\)
Suppose that a \(k\)-means clustering instance satisfies \(\alpha\)-center proximity and \(\gamma\)-center density for \(\alpha\geq 3+2\gamma\). Suppose each cluster in the optimal clustering has size \(\Omega(n/k)\). Then with probability at least \(1-\delta\), Algorithm 5 outputs a clustering with expected objective value at most \(10\cdot OPT\), and runs in time \(O(nt+t^{3})\), where \(t=O(k\textrm{log}\frac{k}{\delta})\) is the size of the sample used by the algorithm. In a parallel computing environment with \(m\) machines, Algorithm 5 has a run time of \(O(\frac{nt}{m}+t^{3})\).
Proof.
As in the analysis in the previous section, let \(T_{S}\) be the hierarchical clustering tree of the seed points \(S\) that is computed by our algorithm. Let \(T_{X}\) be the corresponding hierarchical clustering tree of \(X\), which is given by the Voronoi cells of the seed points.
By Lemma 5.5, there must be a pruning of \(T_{X}\) that is equivalent to \(C^{\ast}\). As before, consider any \(C^{\ast}_{i}\in C^{\ast}\). Observe that by construction, now our algorithm is able to compute the actual mean of \(C^{\ast}_{i}\). For any point \(x\), let \(cost(x)=\sum_{y\in C^{\ast}_{i}}d(y,x)^{p}\), where we now fix \(p=2\). By Lemma 5.7, we have \(\mathbb{E}[\tilde{cost}(x)]\leq(2^{p}+1)\cdot cost(x)=5\cdot cost(x)\), where \(\tilde{cost}(x)\) is the cost of \(x\) that is estimated by our algorithm. Therefore in expectation the objective value for \(C^{\ast}_{i}\) that is estimated by our algorithm must be within \(5\cdot cost(c^{\ast}_{i})\). Considering that this statement holds for each cluster in \(C^{\ast}\), in expectation the estimated objective value of this pruning of the hierarchical clustering tree (which is equivalent to \(C^{\ast}\)) must be at most \(5\cdot OPT\).
As before, the dynamic programming solution will therefore return a clustering with expected estimated objective value at most \(5\cdot OPT\). The actual objective value may be larger than the estimated objective value. But by Lemma 5.8, the actual objective value of any single cluster may be at most \(2^{p-1}=2\) times larger than the estimated objective value. Therefore in expectation our dynamic programming solution must return a clustering with actual objective value at most \(2\cdot 5\cdot OPT\).
The asymptotic run time analysis remains unchanged. Note that as before, in a parallel implementation of the algorithm, the \(t\) seed points are passed to each machine; each machine then computes the quantities \(|v(s)|\), \(c(s)\), and \(\bar{v}(s)\) for the points assigned to it, which are then added together to compute these quantities for the entire dataset. □
7 Experiments
We evaluate the performance of our \(k\)-median and \(k\)-means algorithms on the Covertype dataset from the UCI Machine Learning Repository [21], as well as the Quantum Physics and Protein Homology datasets from KDD Cup [28]. Each dataset contains \(n\) points in \(\mathbb{R}^{d}\). Table 1 gives the values of \(n\) and \(d\) for each dataset.
Table 1.
Dataset
Points (\(\boldsymbol{n}\))
Dimensions (\(\boldsymbol{d}\))
Covertype
581,012
54
Quantum Physics
50,000
78
Protein Homology
145,751
74
Table 1. The Number of Points (n) and the Number of Dimensions (d) for Each Dataset
For each dataset, we normalize the data in each dimension such that all entries are in [0,1] (where 0 corresponds to the minimum value and 1 corresponds to the maximum value). For the Quantum Physics dataset we also discard the eight dimensions that have missing data (see KDD Cup [28]). The distances between the points are computed using Euclidean distance. For each objective function we compare our approach with several alternative algorithms. In order to further motivate our \(\gamma\)-center density property (see Definition 3.5), we also evaluate this property on the clusterings computed in our study.
7.1 K-Median Experiments
For the \(k\)-median objective function we evaluate the performance of our Algorithm 1 with \(p=1\). Recall that this setting is equivalent to optimizing the \(k\)-median objective. Here we term this algorithm Large-Scale K-Median (LSKM). We compare performance with two alternative \(k\)-median algorithms. The first is simply Lloyd’s algorithm run on a sample of points chosen uniformly at random, which we term Lloyd’s. The initial centers for Lloyd’s are chosen uniformly at random. The second algorithm is the centralized \(k\)-median coreset construction from Feldman and Langberg [23]. Their construction takes a \(k\)-median solution as input and computes the coreset by sampling points based on their contribution to the objective in this solution. We compute the initial \(k\)-median solution using Lloyd’s. We then use the algorithm of Feldman and Langberg [23] to compute the coreset. Finally, we compute the clustering of the coreset using the algorithm of Arya et al. [3], swapping one center per stage. We refer to the entire procedure as Coreset.
We run each algorithm using a sample of points and/or coreset of the same size and record the objective value of the output clustering (specified by a set of centers). Note that the objective value is computed with respect to the entire dataset. We rerun each algorithm 10 times and report the average objective value. We also vary the number of clusters (\(k\)), and report the results for \(k=50\) and \(k=100\). Figure 2 displays the experimental results for the Covertype dataset. Figures 3 and 4 display the results for the Quantum Physics and Protein Homology datasets, respectively.
Fig. 2.
Fig. 2. Performance comparison for the \(k\)-median objective function for the Covertype dataset. Left: the results for \(k=50\). Right: the results for \(k=100\). The algorithms compared are Lloyd’s run on a sample of points chosen uniformly at random (Lloyd’s), the \(k\)-median coreset construction of Feldman and Langberg [23] (Coreset), and Algorithm 1 with \(p=1\), termed LSKM.
Fig. 3.
Fig. 3. Performance comparison for the \(k\)-median objective function for the Quantum Physics dataset. Left: the results for \(k=50\). Right: the results for \(k=100\). The algorithms compared are Lloyd’s run on a sample of points chosen uniformly at random (Lloyd’s), the \(k\)-median coreset construction of Feldman and Langberg [23] (Coreset), and Algorithm 1 with \(p=1\), termed LSKM.
Fig. 4.
Fig. 4. Performance comparison for the \(k\)-median objective function for the Protein Homology dataset. Left: the results for \(k=50\). Right: the results for \(k=100\). The algorithms compared are Lloyd’s run on a sample of points chosen uniformly at random (Lloyd’s), the \(k\)-median coreset construction of Feldman and Langberg [23] (Coreset), and Algorithm 1 with \(p=1\), termed LSKM.
In addition to evaluating the clustering quality, we also test our \(\gamma\)-center density property on the clusterings in our study (see Definition 3.5). In particular, we take the best \(k\)-median clustering computed by our algorithm (the one with the lowest objective value), and consider the distances of the points in each cluster to the cluster center. For \(k=50\) we find that \(\gamma\)-center density holds for \(c=0.1\) and \(\gamma=0.29,0.59,0.62\) for the Covertype, Quantum Physics, and Protein Homology datasets, respectively. For \(k=100\) we find that the property holds for \(c=0.1\) and \(\gamma=0.32,0.70,0.75\). These results are summarized in Table 2. Assuming that the optimal clustering \(C^{\ast}\) is structurally similar to our best clustering (and thus satisfies \(\gamma\)-center density for a similar setting of \(\gamma\)), these moderate settings of \(\gamma\) improve the bound on \(\alpha\) in our theoretic analysis (see Corollary 5.2).
Table 2.
Dataset
\(\boldsymbol{k=50}\)
\(\boldsymbol{k=100}\)
Covertype
0.29
0.32
Quantum Physics
0.59
0.70
Protein Homology
0.62
0.75
Table 2. Testing the \(\gamma\)-Center Density Property—The Observed Setting of \(\gamma\) for the Best \(k\)-Median Clustering Computed by Our Algorithm for Each Dataset
Results shown for \(k=50\) and \(k=100\) clusters.
7.2 K-Means Experiments
For the \(k\)-means objective function we evaluate the performance of our Algorithm 5. We compare performance with two alternative \(k\)-means algorithms. The first is Lloyd’s algorithm run on a sample of points chosen uniformly at random, which we term Lloyd’s. The initial centers for Lloyd’s are chosen uniformly at random. The other algorithm is the \(k\)-means coreset construction from Bachem et al. [7]. We compute the coreset as described in Bachem et al. [7] and then cluster these points using \(k\)-means\(++\). We refer to the entire procedure as Coreset.
As before, we run each algorithm using a sample of points and/or coreset of the same size and record the objective value of the output clustering (specified by a set of centers). The objective value is computed with respect to the entire dataset. We rerun each algorithm 10 times and report the average objective value as well as its 95% confidence interval. We also vary the number of clusters, and report the results for \(k=50\) and \(k=100\). Figure 5 displays the experimental results for the Covertype dataset. Figures 6 and 7 display the results for the Quantum Physics and Protein Homology datasets, respectively.
Fig. 5.
Fig. 5. Performance comparison for the \(k\)-means objective function for the Covertype dataset. Left: the results for \(k=50\). Right: the results for \(k=100\). The algorithms compared are Lloyd’s run on a sample of points chosen uniformly at random (Lloyd’s), the \(k\)-means coreset construction of Bachem et al. [7] (Coreset), and Algorithm 5, termed LSKM. The error bars represent 95% confidence intervals.
Fig. 6.
Fig. 6. Performance comparison for the \(k\)-means objective function for the Quantum Physics dataset. Left: the results for \(k=50\). Right: the results for \(k=100\). The algorithms compared are Lloyd’s run on a sample of points chosen uniformly at random (Lloyd’s), the \(k\)-means coreset construction of Bachem et al. [7] (Coreset), and Algorithm 5, termed LSKM. The error bars represent 95% confidence intervals.
Fig. 7.
Fig. 7. Performance comparison for the \(k\)-means objective function for the Protein Homology dataset. Left: the results for \(k=50\). Right: the results for \(k=100\). The algorithms compared are Lloyd’s run on a sample of points chosen uniformly at random (Lloyd’s), the \(k\)-means coreset construction of Bachem et al. [7] (Coreset), and Algorithm 5, termed LSKM. The error bars represent 95% confidence intervals.
We also test our \(\gamma\)-center density property in this context. As before, we take the best \(k\)-means clustering computed by our algorithm (the one with the lowest objective value), and consider the distances of the points in each cluster to the cluster center. For \(k=50\) we find that \(\gamma\)-center density holds for \(c=0.1\) and \(\gamma=0.58,0.47,0.58\) for the Covertype, Quantum Physics, and Protein Homology datasets, respectively. For \(k=100\) we find that the property holds for \(c=0.1\) and \(\gamma=0.63,0.61,0.68\). These results are summarized in Table 3. As before, these moderate settings of \(\gamma\) give us a fairly good bound on \(\alpha\) in our theoretic analysis (see Theorem 6.2).
Table 3.
Dataset
\(\boldsymbol{k=50}\)
\(\boldsymbol{k=100}\)
Covertype
0.58
0.63
Quantum Physics
0.47
0.61
Protein Homology
0.58
0.68
Table 3. Testing the \(\gamma\)-Center Density Property—The Observed Setting of \(\gamma\) for the Best \(k\)-Means Clustering Computed by Our Algorithm for Each Dataset
Results shown for \(k=50\) and \(k=100\) clusters.
7.3 Algorithm Runtime
In this section we investigate the runtime of our algorithms, and discuss how well they scale to larger datasets. Here, we focus on the largest dataset in our study (Covertype), and our \(k\)-means algorithm (Algorithm 5), given that the asymptotic runtime of Algorithm 5 and Algorithm 1 is exactly the same.
Recall that our algorithm has a component that may be implemented in parallel (computing the Voronoi decomposition with respect to the sampled points), and another component that must be implemented in memory on a single machine (computing the hierarchical clustering tree and the corresponding dynamic programming table of the clustering solutions).
We first discuss the runtime of the in-memory component; Figure 8 displays the runtime in seconds as a function of the sample size. We see that the actual runtime is consistent with our asymptotic analysis, which shows that this part of the algorithm takes \(O(t^{3})\) time, where \(t\) is the number of sampled points. We run the code on an Intel Xeon 2.20GHz CPU; we suspect the runtime may be improved by a factor of 4-10 on a more powerful machine.
Fig. 8.
Fig. 8. The in-memory runtime of Algorithm 5 on the Covertype dataset. Left: the runtime for \(k=50\). Right: the runtime for \(k=100\).
Further recall that in our theoretic analysis we set \(t=O(k\textrm{log}\frac{k}{\delta})\), where \(\delta\) is an approximation parameter (see Theorem 6.2). Albeit our experiments show that just setting \(t=4\cdot k\) already works quite well. Still, it follows that in practice we may not be able to set \(k\) much larger than \(10^{4}\).
We next investigate the parallel part of the algorithm (the Voronoi decomposition). We start by still recording the runtime on a single machine (the same Intel Xeon 2.20GHz CPU). Figure 9 displays the runtime in seconds as a function of the sample size. As expected, we see that the runtime is linear in the sample size. Note that here the runtime is only a function of the dataset size and the number of sampled points; therefore it is the same for \(k=50\) vs. \(k=100\) clusters.
Fig. 9.
Fig. 9. The Voronoi decomposition runtime for the Covertype dataset on a single machine.
Then we observe that on a parallel computing cluster with \(m\) machines, the dataset size may be scaled by a factor of \(m\) while achieving roughly the same runtime (depending on the exact processor speeds of the machines involved). In addition, the user needs to pay some constant overhead cost for automatically managing the parallel computation (such as bringing up the worker jobs, monitoring them, and then bringing them down), which may be around 5–10 minutes on a modern computing cluster. Given that the Covertype dataset has size on the order of \(10^{5}\), and \(m\) may be set to \(10^{3}\) or \(10^{4}\), it follows that a parallel Voronoi decomposition can still run in minutes for a dataset of size \(10^{10}\) (provided the sample size does not get much larger).
Finally, we note that the baselines in our experiments also have very efficient parallel implementations. In particular, for both coreset constructions in our experiments, it is possible to compute the information needed for sampling the points in the coreset in parallel. Our Lloyd’s heuristic also has a trivial parallel implementation. The advantages of our approach are then in the combination of clustering quality and scalability; our algorithms are not necessarily faster than the baselines in the experiments.
8 Discussion
We design parallel \(k\)-clustering algorithms that easily scale to very large data sets. We prove that for stable clustering instances our algorithms are still able to compute clusterings with low approximation ratios for the \(k\)-median, \(k\)-medoids, and \(k\)-means objective functions. Our experiments show that our algorithms work well in practice—we compute much better \(k\)-median and \(k\)-means clusterings than alternative methods using samples of the same size.
The work presented here may be extended in several directions. It may be possible to further relax the stability assumptions on the data. In particular, for \(\alpha\)-perturbation stability and/or \(\alpha\)-center proximity we may focus on improving the bound on \(\alpha\). A related direction is to consider a relaxed setting where only some of the points satisfy these properties (this setting was also considered in Balcan and Liang [12]). The general algorithmic approach used in this work, which solves an in-memory problem using a limited amount of information computed in parallel from a much larger dataset, may be of independent interest as well. It may be relevant for other tasks in unsupervised and supervised learning.
References
[1]
Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. 2009. Streaming K-Means Approximation. In NIPS, 10–18.
Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. 2001. Local Search Heuristics for K-Median and Facility Location Problems. In STOC, 21–29.
Pranjal Awasthi, Avrim Blum, and Or Sheffet. 2012. Center-Based Clustering under Perturbation Stability. Inform. Process. Lett. 112, 1–2 (2012), 49–54.
Moses Charikar, Sudipto Guha, Eva Tardos, and David B. Shmoys. 2002. A Constant-Factor Approximation Algorithm for the K-Median Problem. J. Comput. System Sci. 65, 1 (2002), 129–149.
Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. 2022. Towards Optimal Lower Bounds for K-Median and K-Means Coresets. In STOC, 1038–1051.
Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Ali Sheikh-Omar. 2022. Improved Coresets for Euclidean K-Means. In NIPS, 2679–2694.
Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. 2021. Parallel and Efficient Hierarchical K-Median Clustering. In NIPS, 20333–20345.
Lingxiao Huang, Jian Li, and Xuan Wu. 2024. On Optimal Coreset Construction for Euclidean \((k,z)\)-Clustering. arxiv:2211.11923. Retrieved from https://arxiv.org/abs/2211.11923
Kamal Jain and Vijay V. Vazirani. 2001. Approximation Algorithms for Metric Facility Location and K-Median Problems Using the Primal-Dual Schema and Lagrangian Relaxation. J. ACM 48, 2 (2001), 274–296.
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. 2004. A Local Search Approximation Algorithm for K-Means Clustering. Comput. Geometry 28, 2–3 (2004), 89–112.
Mario Lucic, Matthew Faulkner, Andreas Krause, and Dan Feldman. 2018. Training Gaussian Mixture Models at Scale via Coresets. J. Machine Learning Res. 18, 160 (2018), 1–25.
Gustavo Malkomes, Matt J. Kusner, Wenlin Chen, Kilian Q. Weinberger, and Benjamin Moseley. 2015. Fast Distributed K-Center Clustering with Outliers on Massive Data. In NIPS, 1063–1071.
Adam Meyerson, Liadan O’Callaghan, and Serge Plotkin. 2004. A K-Median Algorithm with Running Time Independent of Data Size. Machine Learning 56, 1 (2004), 61–87.
Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. 2020. SCAN: Learning to Classify Images Without Labels. In ECCV, 268–285.
One of the most widely used techniques for data clustering is agglomerative clustering. Such algorithms have been long used across many different fields ranging from computational biology to social sciences to computer vision in part because their output ...
The efficiency of clustering algorithms is strongly needed with very large databases and high-dimensional data types. As a solution, parallel algorithms can be used to provide powerful computing ability. PCs cluster system is one of low-cost general-...
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big Data
K-means is the most widely used clustering algorithm due to its fairly straightforward implementations in various problems. Meanwhile, when the number of clusters increase, the number of iterations also tend to slightly increase. However there are still ...