Abstract
The min-sum k -clustering problem is to partition a metric space (P,d) into k clusters \(C_1, \ldots,C_k \subseteq P\) such that \(\sum_{i=1}^k \sum_{p,q \in C_i} d(p,q)\) is minimized. We show the first efficient construction of a coreset for this problem. Our coreset construction is based on a new adaptive sampling algorithm. Using our coresets we obtain three main algorithmic results.
The first result is a sublinear time (4 + ε)-approximation algorithm for the min-sum k-clustering problem in metric spaces. The running time of this algorithm is \(\widetilde{O}(n)\) for any constant k and ε, and it is o(n 2) for all k = o(logn/loglogn). Since the description size of the input is Θ(n 2), this is sublinear in the input size.
Our second result is the first pass-efficient data streaming algorithm for min-sum k-clustering in the distance oracle model, i.e., an algorithm that uses \({\mathit{poly}}(\log n, k)\) space and makes 2 passes over the input point set arriving as a data stream.
Our third result is a sublinear-time polylogarithmic-factor- approximation algorithm for the min-sum k-clustering problem for arbitrary values of k.
To develop the coresets, we introduce the concept of α-preserving metric embeddings. Such an embedding satisfies properties that (a the distance between any pair of points does not decrease, and (b) the cost of an optimal solution for the considered problem on input (P,d′) is within a constant factor of the optimal solution on input (P,d). In other words, the idea is find a metric embedding into a (structurally simpler) metric space that approximates the original metric up to a factor of α with respect to a certain problem. We believe that this concept is an interesting generalization of coresets.
Research supported in part by NSF ITR grant CCR-0313219, by EPSRC grant EP/D063191/1, and by DFG grant Me 872/8-3.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bădoiu, M., et al.: Facility location in sublinear time. In: Caires, L., et al. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 866–877. Springer, Heidelberg (2005)
Bădoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: STOC, pp. 250–257 (2002)
Bartal, Y.: On approximating arbitrary metrics by tree metrics. In: STOC, pp. 161–168 (1998)
Charikar, M., et al.: Incremental clustering and dynamic information retrieval. In: STOC, pp. 626–635 (1997)
Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: STOC, pp. 30–39 (2003)
Chen, K.: On k-median clustering in high dimensions. In: SODA, pp. 1177–1185 (2006)
Czumaj, A., Sohler, C.: Abstract combinatorial programs and efficient property testers. SICOMP 34(3), 580–615 (2005)
de la Vega, W.F., et al.: Approximation schemes for clustering problems. In: STOC, pp. 50–58 (2003)
Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: STOC, pp. 209–217 (2005)
Guha, S., et al.: Clustering data streams. In: FOCS, pp. 359–366 (2000)
Gutmann-Beck, N., Hassin, R.: Approximation algorithms for min-sum p-clustering. Discrete Applied Mathematics 89, 125–142 (1998)
Har-Peled, S., Mazumdar, S.: Coresets for k-means and k-medians and their applications. In: STOC, pp. 291–300 (2004)
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: SoCG, pp. 126–134 (2005)
Indyk, P.: Sublinear time algorithms for metric space problems. In: STOC, pp. 428–434 (1999)
Indyk, P.: High-Dimensional Computational Geometry. PhD thesis, Stanford (2000)
Indyk, P.: Algorithms for dynamic geometric problems over data streams. In: STOC, pp. 373–380 (2004)
Indyk, P., Matoušek, J.: Low-distortion embeddings of finite metric spaces. In: Handbook of Discrete and Computational Geometry, 2nd edn., pp. 177–196 (2004)
Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In: FOCS, pp. 454–462 (2004)
Kumar, A., Sabharwal, Y., Sen, S.: Linear time algorithms for clustering problems in any dimensions. In: Caires, L., et al. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1374–1385. Springer, Heidelberg (2005)
Mettu, R., Plaxton, G.: Optimal time bounds for approximate clustering. Machine Learning 56(1-3), 35–60 (2004)
Meyerson, A., O’Callaghan, L., Plotkin, S.: A k-median algorithm with running time independent of data size. Machine Learning 56(1–3), 61–87 (2004)
Sahni, S., Gonzalez, T.: P-complete approximation problems. JACM 23, 555–566 (1976)
Schulman, L.J.: Clustering for edge-cost minimization. In: STOC, pp. 547–555 (2000)
Thorup, M.: Quick k-median, k-center, and facility location for sparse graphs. SICOMP 34(2), 405–432 (2005)
Tokuyama, T., Nakano, J.: Geometric algorithms for the minimum cost assignment problem. Random Structures and Algorithms 6(4), 393–406 (1995)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Czumaj, A., Sohler, C. (2007). Small Space Representations for Metric Min-Sum k-Clustering and Their Applications. In: Thomas, W., Weil, P. (eds) STACS 2007. STACS 2007. Lecture Notes in Computer Science, vol 4393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-70918-3_46
Download citation
DOI: https://doi.org/10.1007/978-3-540-70918-3_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-70917-6
Online ISBN: 978-3-540-70918-3
eBook Packages: Computer ScienceComputer Science (R0)