Skip to main content
Log in

Improvements on approximation algorithms for clustering probabilistic data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Uncertainty about data appears in many real-world applications and an important issue is how to manage, analyze and solve optimization problems over such data. An important tool for data analysis is clustering. When the data set is uncertain, we can model them as a set of probabilistic points each formalized as a probability distribution function which describes the possible locations of the points. In this paper, we study k-center problem for probabilistic points in a general metric space. First, we present a fast greedy approximation algorithm that builds k centers using a farthest-first traversal in k iterations. This algorithm improves the previous approximation factor of the unrestricted assigned k-center problem from 10 (see [1]) to 6. Next, we restrict the centers to be selected from all the probabilistic locations of the given points and we show that an optimal solution for this restricted setting is a 2-approximation factor solution for an optimal solution of the assigned k-center problem with expected distance assignment. Using this idea, we improve the approximation factor of the unrestricted assigned k-center problem to 4 by increasing the running time. The algorithm also runs in polynomial time when k is a constant. Additionally, we implement our algorithms on three real data sets. The experimental results show that in practice the approximation factors of our algorithms are better than in theory for these data sets. Also we compare the results of our algorithm with the previous works and discuss about the achieved results. At the end, we present our theoretical results for probabilistic k-median clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. https://www.kaggle.com/kveykva/sf-bay-area-pokemon-go-spawns.

  2. https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew.

  3. https://www.kaggle.com/foenix/slc-crime.

References

  1. Alipour S, Jafari A (2018) Improvements on the k-center problem for uncertain data. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, Houston, TX, USA, 10–15 Jun, 2018, pp 425–433. [Online]. Available: https://doi.org/10.1145/3196959.3196969

  2. Guha S, Munagala K (2009) Exceeding expectations and clustering uncertain data. In: Proceedings of the twenty-eigth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2009, Jun 19–Jul 1, 2009, Providence, Rhode Island, USA, pp 269–278. [Online]. Available: https://doi.org/10.1145/1559795.1559836

  3. Megiddo N (1984) On the complexity of some common geometric location problems. SIAM J Comput 13(1):182–196. https://doi.org/10.1137/0213014

    Article  MathSciNet  MATH  Google Scholar 

  4. Wang H, Zhang J (2015) One-dimensional k-center on uncertain data. Theor Comput Sci 602:114–124. https://doi.org/10.1016/j.tcs.2015.08.017

    Article  MathSciNet  MATH  Google Scholar 

  5. Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for k-median and facility location problems. SIAM J Comput 33(3):544–562. https://doi.org/10.1137/S0097539702416402

    Article  MathSciNet  MATH  Google Scholar 

  6. Badoiu M, Har-Peled S, Indyk P (2002) Approximate clustering via core-sets. In: Proceedings on 34th annual ACM symposium on theory of computing, 19–21 May, 2002, Montréal, Québec, Canada, pp 250–257. [Online]. Available: https://doi.org/10.1145/509907.509947

  7. Har-Peled S, Mazumdar S (2004) On coresets for k-means and k-median clustering. In: Proceedings of the 36th annual ACM symposium on theory of computing, Chicago, IL, USA, 13–16 Jun, 2004, pp 291–300. [Online]. Available: https://doi.org/10.1145/1007352.1007400

  8. Dyer ME (1986) On a multidimensional search technique and its application to the euclidean one-centre problem. SIAM J Comput 15(3):725–738. https://doi.org/10.1137/0215052

    Article  MathSciNet  MATH  Google Scholar 

  9. Lee DT, Wu Y (1986) Complexity of some laction problems. Algorithmica 1(2):193–211. https://doi.org/10.1007/BF01840442

    Article  MathSciNet  Google Scholar 

  10. Megiddo N (1983) Linear-time algorithms for linear programming in r\({}^{\text{3 }}\) and related problems. SIAM J. Comput. 12(4):759–776. https://doi.org/10.1137/0212052

    Article  MathSciNet  MATH  Google Scholar 

  11. Chandrasekaran R, Tamir A (1990) Algebraic optimization: the fermat-weber location problem. Math Program 46:219–224. https://doi.org/10.1007/BF01585739

    Article  MathSciNet  MATH  Google Scholar 

  12. Chandrasekaran R, Tamir A (1982) Polynomially bounded algorithms for locating p-centers on a tree. Math Program 22(1):304–315. https://doi.org/10.1007/BF01581045

    Article  MathSciNet  MATH  Google Scholar 

  13. Frederickson GN (1991) Parametric search and locating supply centers in trees. In: Proceedings of algorithms and data structures, 2nd workshop WADS ’91, Ottawa, Canada, 14–16 Aug, 1991, pp 299–319. [Online]. Available: https://doi.org/10.1007/BFb002827

  14. Megiddo N, Tamir A (1983) New results on the complexity of p-center problems. SIAM J Comput 12(4):751–758. https://doi.org/10.1137/0212051

    Article  MathSciNet  MATH  Google Scholar 

  15. Drezner Z, Hamacher HW (2002) Facility location—applications and theory. Springer [Online] Available: http://www.springer.com/computer/swe/book/978-3-540-42172-6

  16. Megiddo N, Tamir A, Zemel E, Chandrasekaran R (1981) An o(n log\({}^{\text{2 }}\) n) algorithm for the k-th longest path in a tree with applications to location problems. SIAM J Comput 10(2):328–337. https://doi.org/10.1137/0210023

    Article  MathSciNet  MATH  Google Scholar 

  17. Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306. https://doi.org/10.1016/0304-3975(85)90224-5

    Article  MathSciNet  MATH  Google Scholar 

  18. Feder T, Greene DH (1988) Optimal algorithms for approximate clustering. In: Proceedings of the 20th annual ACM symposium on theory of computing, 2–4 May, 1988, Chicago, Illinois, USA, pp 434–444. [Online]. Available: https://doi.org/10.1145/62212.62255

  19. Agarwal PK, Procopiuc CM (2002) Exact and approximation algorithms for clustering. Algorithmica 33(2):201–226. https://doi.org/10.1007/s00453-001-0110-y

    Article  MathSciNet  MATH  Google Scholar 

  20. Kumar P, Kumar P (2010) Almost optimal solutions to k-clustering problems. Int J Comput Geom Appl 20(4):431–447. https://doi.org/10.1142/S0218195910003372

    Article  MathSciNet  MATH  Google Scholar 

  21. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), Portland, Oregon, USA, pp 226–231. [Online]. Available: http://www.aaai.org/Library/KDD/1996/kdd96-037.php

  22. Kriegel H, Pfeifle M (2005) “Density-based clustering of uncertain data,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 21–24 Aug, 2005, pp 672–677. [Online]. Available: https://doi.org/10.1145/1081870.1081955

  23. Kriegel H, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of the 5th IEEE international conference on data mining (ICDM 2005), 27–30 Nov, 2005, Houston, Texas, USA, pp 689–692. [Online]. Available: https://doi.org/10.1109/ICDM.2005.75

  24. Xu H, Li G (2008) Density-based probabilistic clustering of uncertain data. In: International conference on computer science and software engineering, CSSE 2008, Volume 4: embedded programming/database technology / neural networks and applications/other applications, 12–14 Dec, 2008, Wuhan, China, pp 474–477. [Online]. Available: https://doi.org/10.1109/CSSE.2008.968

  25. Aggarwal CC, Yu PS (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21(5):609–623. [Online]. Available: https://doi.org/10.1109/TKDE.2008.190

  26. Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2008, 9–11 Jun, 2008, Vancouver, BC, Canada, pp 191–200. [Online]. Available: https://doi.org/10.1145/1376916.1376944

  27. Munteanu A, Sohler C, Feldman D (2014) Smallest enclosing ball for probabilistic data. In: 30th annual symposium on computational geometry, SOCG’14, Kyoto, Japan, 08–11 Jun, 2014, p 214. [Online]. Available: https://doi.org/10.1145/2582112.2582114

  28. Huang L, Li J (2017) Stochastic k-center and j-flat-center problems. In: Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, 16–19 Jan, 2017, pp 110–129. [Online]. Available: https://doi.org/10.1137/1.9781611974782.8

  29. Charikar M, Guha S (1999) Improved combinatorial algorithms for the facility location and k-median problems. In: 40th annual symposium on foundations of computer science, FOCS ’99, 17–18 Oct, 1999, New York, NY, USA, pp 378–388. [Online]. Available: https://doi.org/10.1109/SFFCS.1999.814609

  30. Charikar M, Guha S, Tardos É, Shmoys DB (1999) A constant-factor approximation algorithm for the k-median problem (extended abstract). In: Proceedings of the thirty-first annual ACM symposium on theory of computing, 1–4 May, 1999, Atlanta, Georgia, USA, pp 1–10. [Online]. Available: https://doi.org/10.1145/301250.301257

  31. Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. J ACM 48(2):274–296. https://doi.org/10.1145/375827.375845

    Article  MathSciNet  MATH  Google Scholar 

  32. Kolliopoulos SG, Rao S (1999) A nearly linear-time approximation scheme for the euclidean kappa-median problem. In: Algorithms—ESA ’99, Proceedings of the 7th annual European symposium, Prague, Czech Republic, pp 378–389. https://doi.org/10.1007/3-540-48481-7_33

  33. Alipour S (2020) Approximation algorithms for probabilistic k-center clustering. In: 20th IEEE international conference on data mining, ICDM 2020, Sorrento, Italy, November 17–20, 2020, pp 1–11. [Online]. Available: https://doi.org/10.1109/ICDM50108.2020.00009

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sharareh Alipour.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alipour, S. Improvements on approximation algorithms for clustering probabilistic data. Knowl Inf Syst 63, 2719–2740 (2021). https://doi.org/10.1007/s10115-021-01601-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01601-4

Keywords

Navigation