skip to main content
research-article

Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization

Published: 26 October 2017 Publication History

Abstract

k-means clustering is a widely used clustering analysis technique in machine learning. In this article, we study the problem of differentially private k-means clustering. Several state-of-the-art methods follow the single-workload approach, which adapts an existing machine-learning algorithm by making each step private. However, most of them do not have satisfactory empirical performance. In this work, we develop techniques to analyze the empirical error behaviors of one of the state-of-the-art single-workload approaches, DPLloyd, which is a differentially private version of the Lloyd algorithm for k>-means clustering. Based on the analysis, we propose an improvement of DPLloyd. We also propose a new algorithm for k-means clustering from the perspective of the noninteractive approach, which publishes a synopsis of the input dataset and then runs k-means on synthetic data generated from the synopsis. We denote this approach by EUGkM. After analyzing the empirical error behaviors of EUGkM, we further propose a hybrid approach that combines our DPLloyd improvement and EUGkM. Results from extensive and systematic experiments support our analysis and demonstrate the effectiveness of the DPLloyd improvement, EUGkM, and the hybrid approach.

References

[1]
Sanjeev Arora, Elad Hazan, and Satyen Kale. 2012. The multiplicative weights update method: A meta-algorithm and applications. Theory of Computing 8, 1 (2012), 121--164.
[2]
M. Lichman. 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[3]
Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. 2010. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, 503--512.
[4]
Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. 2005. Practical privacy: The SuLQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’05). ACM, New York, 128--138.
[5]
United States Census. 1991. Topologically Integrated Geographic Encoding and Referencing. Retrieved from http://www.census.gov/geo/maps-data/data/tiger.html.
[6]
Kamalika Chaudhuri and Claire Monteleoni. 2008. Privacy-preserving logistic regression. In Proceedings of the 21st International Conference on Neural Information Processing Systems (NIPS’08). Curran Associates, 289--296.
[7]
Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially private empirical risk minimization. J. Mach. Learn. Res. 12 (July 2011), 1069--1109.
[8]
Graham Cormode, Cecilia Procopiuc, Divesh Srivastava, Entong Shen, and Ting Yu. 2012. Differentially private spatial decompositions. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering (ICDE’12). IEEE Computer Society, 20--31.
[9]
Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’03). ACM, New York, 202--210.
[10]
Cynthia Dwork. 2006. Differential privacy. In Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II (ICALP’06). Springer-Verlag, Berlin,1--12.
[11]
Cynthia Dwork. 2011. A firm foundation for private data analysis. Commun. ACM 54, 1 (Jan. 2011), 86--95.
[12]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography (TCC’06). Springer-Verlag, Berlin, 265--284.
[13]
Cynthia Dwork and Kobbi Nissim. 2004. Privacy-Preserving Datamining on Vertically Partitioned Databases. Springer, Berlin, 528--544.
[14]
Pasi Fränti. 2006. Clustering datasets. Retrieved from http://cs.joensuu.fi/sipu/datasets/.
[15]
Arik Friedman and Assaf Schuster. 2010. Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10). ACM, New York, 493--502.
[16]
Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A simple and practical algorithm for differentially private data release. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). Curran Associates, 2339--2347.
[17]
Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, and Dan Zhang. 2016. Principled evaluation of differentially private algorithms using DPBench. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, 139--154.
[18]
Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2010. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 3, 1--2 (Sept. 2010), 1021--1032.
[19]
Lei Jing. 2011. Differentially private M-estimators. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11). Curran Associates, 361--369. http://dl.acm.org/citation.cfm?id=2986459.2986500
[20]
K. Krishna and M. Narasimha Murty. 1999. Genetic K-means algorithm. Trans. Sys. Man Cyber. Part B 29, 3 (June 1999), 433--439.
[21]
Ninghui Li, Wahbeh Qardaji, Dong Su, and Jianneng Cao. 2012. PrivBasis: Frequent itemset mining with differential privacy. Proc. VLDB Endow. 5, 11 (July 2012), 1340--1351.
[22]
Ninghui Li, Weining Yang, and Wahbeh Qardaji. 2013. Differentially private grids for geospatial data. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE’13). IEEE Computer Society, 757--768.
[23]
S. Lloyd. 2006. Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Sept. 2006), 129--137.
[24]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York.
[25]
Frank McSherry. 2009. Privacy Integrated Queries (PINQ) Infrastructure. Retrieved from http://research.microsoft.com/en-us/downloads/73099525-fd8d-4966-9b93-574e6023147f/.
[26]
Frank McSherry and Ilya Mironov. 2009. Differentially private recommender systems: Building privacy into the net. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, 627--636.
[27]
Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07). IEEE Computer Society, 94--103.
[28]
Frank D. McSherry. 2009. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD’09). ACM, New York, 19--30.
[29]
Prashanth Mohan. 2012. GUPT: A platform for privacy-preserving data mining. Retrieved from https://github.com/prashmohan/GUPT.
[30]
Prashanth Mohan, Abhradeep Thakurta, Elaine Shi, Dawn Song, and David Culler. 2012. GUPT: Privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD’12). ACM, New York, 349--360.
[31]
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing (STOC’07). ACM, New York, 75--84.
[32]
J. M. Peña, J. A. Lozano, and P. Larrañaga. 1999. An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn. Lett. 20, 10 (Oct. 1999), 1027--1040.
[33]
Weiliang Qiu. 2015. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). Retrieved from http://cran.r-project.org/web/packages/clusterGeneration/index.html.
[34]
Siddheswar Ray and Rose H. Turi. 1999. Determination of number of clusters in K-means clustering and application in colour image segmentation. In The 4th International Conference on Advances in Pattern Recognition and Digital Techniques. 137--143.
[35]
Scipy.org. 2001. Scientific Computing Tools for Python. Retrieved from http://scipy.org/.
[36]
Adam Smith. 2011. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the 43th Annual ACM Symposium on Theory of Computing (STOC’11). ACM, New York. 813–822.
[37]
Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. 2016. Differentially private K-means clustering. In Proceedings of the 6th ACM Conference on Data and Application Security and Privacy (CODASPY’16). ACM, New York, 26--37.
[38]
Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. B (Stat. Methodol.) 63, 2 (2001), 411--423.
[39]
Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2011. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23, 8 (Aug. 2011), 1200--1214.
[40]
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2014. PrivBayes: Private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD’14). ACM, New York, 1423--1434.
[41]
Jun Zhang, Xiaokui Xiao, Yin Yang, Zhenjie Zhang, and Marianne Winslett. 2013. PrivGene: Differentially private model fitting using genetic algorithms. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD’13). ACM, New York, 665--676.
[42]
Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, and Marianne Winslett. 2012. Functional mechanism: Regression analysis under differential privacy. Proc. VLDB Endow. 5, 11 (July 2012), 1364--1375.
[43]
Xiaojian Zhang, Rui Chen, Jianliang Xu, Xiaofeng Meng, and Yingtao Xie. 2014. Towards accurate histogram publication under differential privacy. In Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 587--595.

Cited By

View all
  • (2025)Improving the utility of differentially private clustering through dynamical processingPattern Recognition10.1016/j.patcog.2024.110890157:COnline publication date: 1-Jan-2025
  • (2024)Co-clustering: A Survey of the Main Methods, Recent Trends, and Open ProblemsACM Computing Surveys10.1145/369887557:2(1-33)Online publication date: 4-Oct-2024
  • (2024)A Lightweight Mutual Privacy Preserving $k$-Means Clustering in Industrial IoTIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.333782811:2(2138-2152)Online publication date: Mar-2024
  • Show More Cited By

Index Terms

  1. Differentially Private K-Means Clustering and a Hybrid Approach to Private Optimization

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Privacy and Security
      ACM Transactions on Privacy and Security  Volume 20, Issue 4
      November 2017
      150 pages
      ISSN:2471-2566
      EISSN:2471-2574
      DOI:10.1145/3143524
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 26 October 2017
      Accepted: 01 August 2017
      Revised: 01 August 2017
      Received: 01 October 2016
      Published in TOPS Volume 20, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. k-means clustering
      2. Differential privacy
      3. private data publishing

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • United States National Science Foundation

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)65
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Improving the utility of differentially private clustering through dynamical processingPattern Recognition10.1016/j.patcog.2024.110890157:COnline publication date: 1-Jan-2025
      • (2024)Co-clustering: A Survey of the Main Methods, Recent Trends, and Open ProblemsACM Computing Surveys10.1145/369887557:2(1-33)Online publication date: 4-Oct-2024
      • (2024)A Lightweight Mutual Privacy Preserving $k$-Means Clustering in Industrial IoTIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.333782811:2(2138-2152)Online publication date: Mar-2024
      • (2024)Fuzzy Prediction Model in Privacy Protection: Takagi–Sugeno Rules Model via Differential PrivacyIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.338059632:6(3716-3728)Online publication date: 26-Mar-2024
      • (2024)HiDS Data clustering algorithm based on differential privacy2024 International Conference on Networking and Network Applications (NaNA)10.1109/NaNA63151.2024.00029(131-136)Online publication date: 9-Aug-2024
      • (2024)On the Privacy of Federated Clustering: a Cryptographic ViewICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447243(4865-4869)Online publication date: 14-Apr-2024
      • (2024)Improving search result clustering using nature inspired approachMultimedia Tools and Applications10.1007/s11042-023-18067-x83:23(62971-62988)Online publication date: 9-Jan-2024
      • (2024)Coordinate Descent for k-Means with Differential PrivacyWeb and Big Data10.1007/978-981-97-2387-4_10(145-158)Online publication date: 28-Apr-2024
      • (2023)Differential Privacy-based Personalized Recommendation Service for "Helping Farmers" Tourism in Guizhou ProvinceAcademic Journal of Science and Technology10.54097/ajst.v4i3.47834:3(42-46)Online publication date: 8-Feb-2023
      • (2023)K-Means Clustering with Local Distance PrivacyBig Data Mining and Analytics10.26599/BDMA.2022.90200506:4(433-442)Online publication date: Dec-2023
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media