Abstract
Uncertainty about data appears in many real-world applications and an important issue is how to manage, analyze and solve optimization problems over such data. An important tool for data analysis is clustering. When the data set is uncertain, we can model them as a set of probabilistic points each formalized as a probability distribution function which describes the possible locations of the points. In this paper, we study k-center problem for probabilistic points in a general metric space. First, we present a fast greedy approximation algorithm that builds k centers using a farthest-first traversal in k iterations. This algorithm improves the previous approximation factor of the unrestricted assigned k-center problem from 10 (see [1]) to 6. Next, we restrict the centers to be selected from all the probabilistic locations of the given points and we show that an optimal solution for this restricted setting is a 2-approximation factor solution for an optimal solution of the assigned k-center problem with expected distance assignment. Using this idea, we improve the approximation factor of the unrestricted assigned k-center problem to 4 by increasing the running time. The algorithm also runs in polynomial time when k is a constant. Additionally, we implement our algorithms on three real data sets. The experimental results show that in practice the approximation factors of our algorithms are better than in theory for these data sets. Also we compare the results of our algorithm with the previous works and discuss about the achieved results. At the end, we present our theoretical results for probabilistic k-median clustering.
Similar content being viewed by others
References
Alipour S, Jafari A (2018) Improvements on the k-center problem for uncertain data. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems, Houston, TX, USA, 10–15 Jun, 2018, pp 425–433. [Online]. Available: https://doi.org/10.1145/3196959.3196969
Guha S, Munagala K (2009) Exceeding expectations and clustering uncertain data. In: Proceedings of the twenty-eigth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2009, Jun 19–Jul 1, 2009, Providence, Rhode Island, USA, pp 269–278. [Online]. Available: https://doi.org/10.1145/1559795.1559836
Megiddo N (1984) On the complexity of some common geometric location problems. SIAM J Comput 13(1):182–196. https://doi.org/10.1137/0213014
Wang H, Zhang J (2015) One-dimensional k-center on uncertain data. Theor Comput Sci 602:114–124. https://doi.org/10.1016/j.tcs.2015.08.017
Arya V, Garg N, Khandekar R, Meyerson A, Munagala K, Pandit V (2004) Local search heuristics for k-median and facility location problems. SIAM J Comput 33(3):544–562. https://doi.org/10.1137/S0097539702416402
Badoiu M, Har-Peled S, Indyk P (2002) Approximate clustering via core-sets. In: Proceedings on 34th annual ACM symposium on theory of computing, 19–21 May, 2002, Montréal, Québec, Canada, pp 250–257. [Online]. Available: https://doi.org/10.1145/509907.509947
Har-Peled S, Mazumdar S (2004) On coresets for k-means and k-median clustering. In: Proceedings of the 36th annual ACM symposium on theory of computing, Chicago, IL, USA, 13–16 Jun, 2004, pp 291–300. [Online]. Available: https://doi.org/10.1145/1007352.1007400
Dyer ME (1986) On a multidimensional search technique and its application to the euclidean one-centre problem. SIAM J Comput 15(3):725–738. https://doi.org/10.1137/0215052
Lee DT, Wu Y (1986) Complexity of some laction problems. Algorithmica 1(2):193–211. https://doi.org/10.1007/BF01840442
Megiddo N (1983) Linear-time algorithms for linear programming in r\({}^{\text{3 }}\) and related problems. SIAM J. Comput. 12(4):759–776. https://doi.org/10.1137/0212052
Chandrasekaran R, Tamir A (1990) Algebraic optimization: the fermat-weber location problem. Math Program 46:219–224. https://doi.org/10.1007/BF01585739
Chandrasekaran R, Tamir A (1982) Polynomially bounded algorithms for locating p-centers on a tree. Math Program 22(1):304–315. https://doi.org/10.1007/BF01581045
Frederickson GN (1991) Parametric search and locating supply centers in trees. In: Proceedings of algorithms and data structures, 2nd workshop WADS ’91, Ottawa, Canada, 14–16 Aug, 1991, pp 299–319. [Online]. Available: https://doi.org/10.1007/BFb002827
Megiddo N, Tamir A (1983) New results on the complexity of p-center problems. SIAM J Comput 12(4):751–758. https://doi.org/10.1137/0212051
Drezner Z, Hamacher HW (2002) Facility location—applications and theory. Springer [Online] Available: http://www.springer.com/computer/swe/book/978-3-540-42172-6
Megiddo N, Tamir A, Zemel E, Chandrasekaran R (1981) An o(n log\({}^{\text{2 }}\) n) algorithm for the k-th longest path in a tree with applications to location problems. SIAM J Comput 10(2):328–337. https://doi.org/10.1137/0210023
Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306. https://doi.org/10.1016/0304-3975(85)90224-5
Feder T, Greene DH (1988) Optimal algorithms for approximate clustering. In: Proceedings of the 20th annual ACM symposium on theory of computing, 2–4 May, 1988, Chicago, Illinois, USA, pp 434–444. [Online]. Available: https://doi.org/10.1145/62212.62255
Agarwal PK, Procopiuc CM (2002) Exact and approximation algorithms for clustering. Algorithmica 33(2):201–226. https://doi.org/10.1007/s00453-001-0110-y
Kumar P, Kumar P (2010) Almost optimal solutions to k-clustering problems. Int J Comput Geom Appl 20(4):431–447. https://doi.org/10.1142/S0218195910003372
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96), Portland, Oregon, USA, pp 226–231. [Online]. Available: http://www.aaai.org/Library/KDD/1996/kdd96-037.php
Kriegel H, Pfeifle M (2005) “Density-based clustering of uncertain data,” in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 21–24 Aug, 2005, pp 672–677. [Online]. Available: https://doi.org/10.1145/1081870.1081955
Kriegel H, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of the 5th IEEE international conference on data mining (ICDM 2005), 27–30 Nov, 2005, Houston, Texas, USA, pp 689–692. [Online]. Available: https://doi.org/10.1109/ICDM.2005.75
Xu H, Li G (2008) Density-based probabilistic clustering of uncertain data. In: International conference on computer science and software engineering, CSSE 2008, Volume 4: embedded programming/database technology / neural networks and applications/other applications, 12–14 Dec, 2008, Wuhan, China, pp 474–477. [Online]. Available: https://doi.org/10.1109/CSSE.2008.968
Aggarwal CC, Yu PS (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21(5):609–623. [Online]. Available: https://doi.org/10.1109/TKDE.2008.190
Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS 2008, 9–11 Jun, 2008, Vancouver, BC, Canada, pp 191–200. [Online]. Available: https://doi.org/10.1145/1376916.1376944
Munteanu A, Sohler C, Feldman D (2014) Smallest enclosing ball for probabilistic data. In: 30th annual symposium on computational geometry, SOCG’14, Kyoto, Japan, 08–11 Jun, 2014, p 214. [Online]. Available: https://doi.org/10.1145/2582112.2582114
Huang L, Li J (2017) Stochastic k-center and j-flat-center problems. In: Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, 16–19 Jan, 2017, pp 110–129. [Online]. Available: https://doi.org/10.1137/1.9781611974782.8
Charikar M, Guha S (1999) Improved combinatorial algorithms for the facility location and k-median problems. In: 40th annual symposium on foundations of computer science, FOCS ’99, 17–18 Oct, 1999, New York, NY, USA, pp 378–388. [Online]. Available: https://doi.org/10.1109/SFFCS.1999.814609
Charikar M, Guha S, Tardos É, Shmoys DB (1999) A constant-factor approximation algorithm for the k-median problem (extended abstract). In: Proceedings of the thirty-first annual ACM symposium on theory of computing, 1–4 May, 1999, Atlanta, Georgia, USA, pp 1–10. [Online]. Available: https://doi.org/10.1145/301250.301257
Jain K, Vazirani VV (2001) Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. J ACM 48(2):274–296. https://doi.org/10.1145/375827.375845
Kolliopoulos SG, Rao S (1999) A nearly linear-time approximation scheme for the euclidean kappa-median problem. In: Algorithms—ESA ’99, Proceedings of the 7th annual European symposium, Prague, Czech Republic, pp 378–389. https://doi.org/10.1007/3-540-48481-7_33
Alipour S (2020) Approximation algorithms for probabilistic k-center clustering. In: 20th IEEE international conference on data mining, ICDM 2020, Sorrento, Italy, November 17–20, 2020, pp 1–11. [Online]. Available: https://doi.org/10.1109/ICDM50108.2020.00009
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alipour, S. Improvements on approximation algorithms for clustering probabilistic data. Knowl Inf Syst 63, 2719–2740 (2021). https://doi.org/10.1007/s10115-021-01601-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-021-01601-4