Skip to main content

Probabilistic k-Median Clustering in Data Streams

  • Conference paper
Approximation and Online Algorithms (WAOA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7846))

Included in the following conference series:

Abstract

We define the notion of coresets for probabilistic clustering problems and propose the first (k,ε)-coreset constructions for the probabilistic k-median problem in the metric and Euclidean case. The coresets are of size poly(ε − 1,k,log(W/(w min ·p min ·δ))), where W is the expected total weight of the weighted probabilistic input points, w min is the minimum weight of a probabilistic input point, p min is the minimum realization probability, and δ is the error probability of the construction. We show how to maintain our coreset for Euclidean spaces in data streams.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. Journal of the ACM 51(4), 606–635 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE Transactions on Knowledge and Data Engineering 21(5), 609–623 (2009)

    Article  Google Scholar 

  3. Arora, S.: Polynomial time approximation schemes for euclidean traveling salesman and other geometric problems. Journal of the ACM 45(5), 753–782 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  4. Arora, S., Raghavan, P., Rao, S.: Approximation schemes for euclidean k-medians and related problems. In: Proc. of the 30th STOC, pp. 106–113 (1998)

    Google Scholar 

  5. Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristics for k-median and facility location problems. SIAM Journal on Computing 33(3), 544–562 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bădoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proc. of the 34th STOC, pp. 250–257 (2002)

    Google Scholar 

  7. Bentley, J.L., Saxe, J.B.: Decomposable searching problems I: Static-to-dynamic transformation. Journal of Algorithms 1(4), 301–358 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  8. Charikar, M., Guha, S.: Improved combinatorial algorithms for facility location problems. SIAM Journal on Computing 34(4), 803–824 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  9. Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences 65(1), 129–149 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  10. Chau, M., Cheng, R., Kao, B., Ng, J.: Uncertain data mining: An example in clustering location data. In: Ng, W.-K., Kitsuregawa, M., Li, J., Chang, K. (eds.) PAKDD 2006. LNCS (LNAI), vol. 3918, pp. 199–204. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing 39(3), 923–947 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  12. Cormode, G., McGregor, A.: Approximation algorithms for clustering uncertain data. In: Proc. of the 27th PODS, pp. 191–200 (2008)

    Google Scholar 

  13. Edmonds, J., Karp, R.M.: Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM 19(2), 248–264 (1972)

    Article  MATH  Google Scholar 

  14. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd ACM SIGKDD, pp. 226–231 (1996)

    Google Scholar 

  15. Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proc. of the 23rd SoCG, pp. 11–18 (2007)

    Google Scholar 

  16. Forgey, E.: Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 768(21) (1965)

    Google Scholar 

  17. Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proc. of the 37th STOC, pp. 209–217 (2005)

    Google Scholar 

  18. Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering 15(3), 515–528 (2003)

    Article  Google Scholar 

  19. Guha, S., Munagala, K.: Exceeding expectations and clustering uncertain data. In: Proc. of the 28th PODS, pp. 269–278 (2009)

    Google Scholar 

  20. Günnemann, S., Kremer, H., Seidl, T.: Subspace clustering for uncertain data. In: Proc. of the SIAM International Conference on Data Mining, pp. 385–396 (2010)

    Google Scholar 

  21. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete & Computational Geometry 37(1), 3–19 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  22. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proc. of the 36th STOC, pp. 291–300 (2004)

    Google Scholar 

  23. Haussler, D.: Decision theoretic generalizations of the pac model for neural net and other learning applications. Information & Computation 100(1), 78–150 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  24. Indyk, P.: Sublinear time algorithms for metric space problems. In: Proc. of the 31st STOC, pp. 428–434 (1999)

    Google Scholar 

  25. Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: Proc. of the 34th STOC, pp. 731–740 (2002)

    Google Scholar 

  26. Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proc. of the 40th FOCS, pp. 2–13 (1999)

    Google Scholar 

  27. Kolliopoulos, S.G., Rao, S.: A nearly linear-time approximation scheme for the euclidean k-median problem. SIAM Journal on Computing 37(3), 757–782 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  28. Kriegel, H.P., Pfeifle, M.: Density-based clustering of uncertain data. In: Proc. of the 11th ACM SIGKDD, pp. 672–677 (2005)

    Google Scholar 

  29. Kriegel, H.P., Pfeifle, M.: Hierarchical density-based clustering of uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 689–692 (2005)

    Google Scholar 

  30. Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM 57(2) (2010)

    Google Scholar 

  31. Lammersen, C., Schmidt, M., Sohler, C.: Probabilistic k-median. Tech. rep., Report CGL-TR-02, STREP project Computational Geometric Learning (2011), http://cgl.uni-jena.de/pub/Publications/WebHome/CGL-TR-02.pdf

  32. Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–136 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  33. Mettu, R.R., Plaxton, C.G.: Optimal time bounds for approximate clustering. Machine Learning 56(1-3), 35–60 (2004)

    Article  MATH  Google Scholar 

  34. Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient clustering of uncertain data. In: Proc. of the 6th IEEE ICDM, pp. 436–445 (2006)

    Google Scholar 

  35. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: Proc. of the 6th ICCV, pp. 59–66 (1998)

    Google Scholar 

  36. Xu, H., Li, G.: Density-based probabilistic clustering of uncertain data. In: Proc. of the 1st CSSE, vol. 4, pp. 474–477 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lammersen, C., Schmidt, M., Sohler, C. (2013). Probabilistic k-Median Clustering in Data Streams. In: Erlebach, T., Persiano, G. (eds) Approximation and Online Algorithms. WAOA 2012. Lecture Notes in Computer Science, vol 7846. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38016-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38016-7_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38015-0

  • Online ISBN: 978-3-642-38016-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics