Abstract
Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is the k-means problem, which, given a set P of points from a metric space and a parameter \(k<|P|\), requires finding a subset \(S \subset P\) of k points, dubbed centers, which minimizes the sum of all squared distances of points in P from their closest center. A more general formulation, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term \(O(\gamma )\) away from the one achievable by the best known polynomial-time sequential (possibly bicriteria) approximation algorithm, where \(\gamma \) can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by its doubling dimension D. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahmadian, S., Norouzi-Fard, A., Svensson, O., Ward, J.: Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. SIAM J. Comput. 49(4), 97–156 (2020)
Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the ACM-SIAM SODA, pp. 1027–1035 (2007)
Bakhthemmat, A., Izadi, M.: Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J. Big Data 7(1), 6 (2020)
Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings of the ACM PODS, pp. 273–284 (2013)
Ceccarello, M., Pietracaprina, A., Pucci, G.: Fast coreset-based diversity maximization under matroid constraints. In: Proceedings of the ACM WSDM, pp. 81–89 (2018)
Ceccarello, M., Pietracaprina, A., Pucci, G.: Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially. Proc. VLDB Endow. 12(7), 766–778 (2019)
Ceccarello, M., Pietracaprina, A., Pucci, G., Upfal, E.: A practical parallel algorithm for diameter approximation of massive weighted graphs. In: Proceedings of the IEEE IPDPS, pp. 12–21 (2016)
Charikar, M., Khuller, S., Mount, D., Narasimhan, G.: Algorithms for facility location problems with outliers. In: Proceedings of the ACM-SIAM SODA, pp. 642–651 (2001)
Chen, J., Azer, E., Zhang, Q.: A practical algorithm for distributed clustering and outlier detection. In: Proceedings of the NeurIPS, pp. 2253–2262 (2018)
Cohen-Addad, V., Feldmann, A., Saulpic, D.: Near-linear time approximation schemes for clustering in doubling metrics. J. ACM 68(6), 44:1–44:34 (2021)
Dandolo, E., Pietracaprina, A., Pucci, G.: Distributed k-means with outliers in general metrics. CoRR abs/2202.08173 (2022)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Deshpande, A., Kacham, P., Pratap, R.: Robust k-means++. In: Proceedings of the UAI, pp. 799–808 (2020)
Friggstad, Z., Khodamoradi, K., Rezapour, M., Salavatipour, M.: Approximation schemes for clustering with outliers. ACM Trans. Algorithms 15(2), 26:1–26:26 (2019)
Guha, S., Li, Y., Zhang, Q.: Distributed partial clustering. ACM Trans. Parallel Comput. 6(3), 11:1–11:20 (2019)
Gupta, S., Kumar, R., Lu, K., Moseley, B., Vassilvitskii, S.: Local search methods for k-means with outliers. Proc. VLDB Endow. 10(7), 757–768 (2017)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the ACM STOC, pp. 291–300 (2004)
Heinonen, J.: Lectures on Analysis of Metric Spaces. Universitext. Springer, Berlin (2001)
Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of Cluster Analysis. CRC Press, Boca Raton (2015)
Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k-means clustering. Comput. Geom. 28(2–3), 89–112 (2004)
Krishnaswamy, R., Li, S., Sandeep, S.: Constant approximation for k-median and k-means with outliers via iterative rounding. In: Proceedings of the ACM STOC 2018, pp. 646–659 (2018)
Li, S., Guo, X.: Distributed k-clustering for data with heavy noise. In: Proceedings of the NeurIPS, pp. 7849–7857 (2018)
Mazzetto, A., Pietracaprina, A., Pucci, G.: Accurate MapReduce algorithms for k-median and k-means in general metric spaces. In: Proceedings of the ISAAC, pp. 34:1–34:16 (2019)
Pietracaprina, A., Pucci, G., Riondato, M., Silvestri, F., Upfal, E.: Space-round tradeoffs for MapReduce computations. In: Proceedings of the ACM ICS, pp. 235–244 (2012)
Sreedhar, C., Kasiviswanath, N., Chenna Reddy, P.: Clustering large datasets using k-means modified inter and intra clustering (KM-I2C) in Hadoop. J. Big Data 4, 27 (2017)
Statman, A., Rozenberg, L., Feldman, D.: k-means: outliers-resistant clustering+++. MDPI Algorithms 13(12), 311 (2020)
Wei, D.: A constant-factor bi-criteria approximation guarantee for k-means++. In: Proceedings of the NIPS, pp. 604–612 (2016)
Acknowledgements
This work was supported, in part, by MUR of Italy, under Projects PRIN 20174LF3T8 (AHeAD: Efficient Algorithms for HArnessing Networked Data), and PNRR CN00000013 (National Centre for HPC, Big Data and Quantum Computing), and by the University of Padova under Project SID 2020 (RATED-X: Resource-Allocation TradEoffs for Dynamic and eXtreme data).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dandolo, E., Pietracaprina, A., Pucci, G. (2023). Distributed k-Means with Outliers in General Metrics. In: Cano, J., Dikaiakos, M.D., Papadopoulos, G.A., Pericàs, M., Sakellariou, R. (eds) Euro-Par 2023: Parallel Processing. Euro-Par 2023. Lecture Notes in Computer Science, vol 14100. Springer, Cham. https://doi.org/10.1007/978-3-031-39698-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-031-39698-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39697-7
Online ISBN: 978-3-031-39698-4
eBook Packages: Computer ScienceComputer Science (R0)