Abstract
We define the notion of coresets for probabilistic clustering problems and propose the first (k,ε)-coreset constructions for the probabilistic k-median problem in the metric and Euclidean case. The coresets are of size poly(ε − 1,k,log(W/(w min ·p min ·δ))), where W is the expected total weight of the weighted probabilistic input points, w min is the minimum weight of a probabilistic input point, p min is the minimum realization probability, and δ is the error probability of the construction. We show how to maintain our coreset for Euclidean spaces in data streams.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. Journal of the ACM 51(4), 606–635 (2004)
Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering 21(5), 609–623 (2009)
Arora, S.: Polynomial time approximation schemes for euclidean traveling salesman and other geometric problems. Journal of the ACM 45(5), 753–782 (1998)
Arora, S., Raghavan, P., Rao, S.: Approximation schemes for euclidean k-medians and related problems. In: Proc. of the 30th STOC, pp. 106–113 (1998)
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristics for k-median and facility location problems. SIAM Journal on Computing 33(3), 544–562 (2004)
Bădoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proc. of the 34th STOC, pp. 250–257 (2002)
Bentley, J.L., Saxe, J.B.: Decomposable searching problems I: Static-to-dynamic transformation. Journal of Algorithms 1(4), 301–358 (1980)
Charikar, M., Guha, S.: Improved combinatorial algorithms for facility location problems. SIAM Journal on Computing 34(4), 803–824 (2005)
Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences 65(1), 129–149 (2002)
Chau, M., Cheng, R., Kao, B., Ng, J.: Uncertain data mining: An example in clustering location data. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 199–204. Springer, Heidelberg (2006)
Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing 39(3), 923–947 (2009)
Cormode, G., McGregor, A.: Approximation algorithms for clustering uncertain data. In: Proc. of the 27th PODS, pp. 191–200 (2008)
Edmonds, J., Karp, R.M.: Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM 19(2), 248–264 (1972)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd ACM SIGKDD, pp. 226–231 (1996)
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proc. of the 23rd SoCG, pp. 11–18 (2007)
Forgey, E.: Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 768(21) (1965)
Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proc. of the 37th STOC, pp. 209–217 (2005)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)
Guha, S., Munagala, K.: Exceeding expectations and clustering uncertain data. In: Proc. of the 28th PODS, pp. 269–278 (2009)
Günnemann, S., Kremer, H., Seidl, T.: Subspace clustering for uncertain data. In: Proc. of the SIAM International Conference on Data Mining, pp. 385–396 (2010)
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry 37(1), 3–19 (2007)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)
Haussler, D.: Decision theoretic generalizations of the pac model for neural net and other learning applications. Information & Computation 100(1), 78–150 (1992)
Indyk, P.: Sublinear time algorithms for metric space problems. In: Proc. of the 31st STOC, pp. 428–434 (1999)
Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: Proc. of the 34th STOC, pp. 731–740 (2002)
Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proc. of the 40th FOCS, pp. 2–13 (1999)
Kolliopoulos, S.G., Rao, S.: A nearly linear-time approximation scheme for the euclidean k-median problem. SIAM Journal on Computing 37(3), 757–782 (2007)
Kriegel, H.P., Pfeifle, M.: Density-based clustering of uncertain data. In: Proc. of the 11th ACM SIGKDD, pp. 672–677 (2005)
Kriegel, H.P., Pfeifle, M.: Hierarchical density-based clustering of uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 689–692 (2005)
Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM 57(2) (2010)
Lammersen, C., Schmidt, M., Sohler, C.: Probabilistic k-median. Tech. rep., Report CGL-TR-02, STREP project Computational Geometric Learning (2011), http://cgl.uni-jena.de/pub/Publications/WebHome/CGL-TR-02.pdf
Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–136 (1982)
Mettu, R.R., Plaxton, C.G.: Optimal time bounds for approximate clustering. Machine Learning 56(1-3), 35–60 (2004)
Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient clustering of uncertain data. In: Proc. of the 6th IEEE ICDM, pp. 436–445 (2006)
Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: Proc. of the 6th ICCV, pp. 59–66 (1998)
Xu, H., Li, G.: Density-based probabilistic clustering of uncertain data. In: Proc. of the 1st CSSE, vol. 4, pp. 474–477 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lammersen, C., Schmidt, M., Sohler, C. (2013). Probabilistic k-Median Clustering in Data Streams. In: Erlebach, T., Persiano, G. (eds) Approximation and Online Algorithms. WAOA 2012. Lecture Notes in Computer Science, vol 7846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38016-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-38016-7_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38015-0
Online ISBN: 978-3-642-38016-7
eBook Packages: Computer ScienceComputer Science (R0)