skip to main content
10.1145/3357384.3357969acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Privacy Preserving Approximate K-means Clustering

Authors Info & Claims
Published:03 November 2019Publication History

ABSTRACT

Privacy preserving computation is of utmost importance in a cloud computing environment where a client often requires to send sensitive data to servers offering computing services over untrusted networks. Eavesdropping over the network or malware at the server may lead to leaking sensitive information from the data. To prevent this, we propose to encode the input data in such a way that, firstly, it should be difficult to decode it back to the true data, and secondly, the computational results obtained with the encoded data should not be substantially different from those obtained with the true data. Specifically, the computational activity that we focus on is the K-means clustering, which is widely used for many data mining tasks. Our proposed variant of the K-means algorithm is capable of privacy preservation in the sense that it requires as input only binary encoded data, and is not allowed to access the true data vectors at any stage of the computation. During intermediate stages of K-means computation, our algorithm is able to effectively process the inputs with incomplete information seeking to yield outputs relatively close to the complete information (non-encoded) case. Evaluation on real datasets show that the proposed methods yields comparable clustering effectiveness in comparison to the standard K-means algorithm on image clustering (MNIST-8M dataset), and in fact outperforms the standard K-means on text clustering (ODPtweets dataset).

References

  1. B. Michael and Z. Tom. A Face is exposed for AOL searcher no. 4417749. New York Times, page A1, 08 2006.Google ScholarGoogle Scholar
  2. W. Benjamin et al. Syntf: Synthetic and differentially private term frequency vectors for privacy-preserving text mining. In SIGIR '18, pages 305--314, 2018.Google ScholarGoogle Scholar
  3. J. Marek, J. Martin, and R. Konrad. Smart metering de-pseudonymization. In Proc. of ACSAC '11, pages 227--236. ACM, 2011.Google ScholarGoogle Scholar
  4. Ricardo Mendes and Jo ao P Vilela. Privacy-preserving data mining: methods, metrics, and applications. IEEE Access, 5:10562--10582, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  5. D. Irit and N. Kobbi. Revealing information while preserving privacy. In Proc. of Symposium on Principles of Database Systems , pages 202--210. ACM, 2003.Google ScholarGoogle Scholar
  6. V. S Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis. State-of-the-art in privacy preserving data mining. Sigmod Record, 33:50--57, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. William and L. Joram. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability, volume 26 of Contemporary Mathematics, pages 189--206. 1984.Google ScholarGoogle Scholar
  8. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Xiao-Bo, L. Weiwei, T Ivor W, S. Fumin, and S. Quan-Sen. Compressed k-means for large-scale clustering. In AAAI, pages 2527--2533, 2017.Google ScholarGoogle Scholar
  10. K. Siddhesh and A. Amit. Faster k-means cluster estimation. In European Conference on Information Retrieval, pages 520--526. Springer, 2017.Google ScholarGoogle Scholar
  11. Y. Jinfeng, W. Jun, and J. Rong. Privacy and regression model preserved learning. In Proc. of AAAI '14, pages 1341--1347, 2014.Google ScholarGoogle Scholar
  12. J. Geetha and W. Rebecca N. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In Proc. of KDD'05, pages 593--599. ACM, 2005.Google ScholarGoogle Scholar
  13. M. Noman, C. Rui, F. Benjamin, and Philip S Y. Differentially private data release for data mining. In Proc. of KDD'11, pages 493--501, 2011.Google ScholarGoogle Scholar
  14. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. Journal of Privacy and Confidentiality, 7:17--51, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  15. H. Kashima, J. Hu, B. Ray, and M. Singh. K-means clustering of proportional data using l1 distance. In Proc. of PR '08, pages 1--4, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. David. Web-scale k-means clustering. In WWW'10, pages 1177--1178, 2010.Google ScholarGoogle Scholar
  17. J. Herve, D. Matthijs, and S. Cordelia. Product quantization for nearest neighbor search. IEEE Trans. PAMI, 33(1):117--128, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Yusuke et al. PQk-means: Billion-scale clustering for product-quantized codes. In Multimedia Conference, pages 1725--1733, 2017.Google ScholarGoogle Scholar
  19. J. Ji, J. Li, S. Yan, B., and Q. Tian. Super-bit locality-sensitive hashing. In Proc. of NIPS'12, pages 108--116, 2012.Google ScholarGoogle Scholar
  20. X. Yi, C. Caramanis, and E. Price. Binary embedding: Fundamental limits and fast algorithm. In Proc. of ICML'15, pages 2162--2170, 2015.Google ScholarGoogle Scholar
  21. B. Eisenberg and R. Sullivan. Why is the sum of independent normal random variables normal. In Math. Mag., volume 81, pages 362--366, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  22. H. Chang et al. Robust path-based spectral clustering. PR, 41:191--203, 2008.Google ScholarGoogle Scholar
  23. J. Anil K et al. Data clustering: A user's dilemma. In PReMI, pages 1--10, 2005.Google ScholarGoogle Scholar
  24. F. Limin and M. Enzo. Flame, a novel fuzzy clustering method for the analysis of dna microarray data. BMC Bioinformatics, 8(1), Jan 2007.Google ScholarGoogle Scholar
  25. F. Pasi and S. Sami. K-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12):4743--4759, Dec 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K. Krauth, E. V. Bonilla, K. Cutajar, and M. Filippone. Autogp: Exploring the capabilities and limitations of gaussian process models. In Proc. of UAI'17, 2017.Google ScholarGoogle Scholar
  27. C. Moses S. Similarity estimation techniques from rounding algorithms. In Proc. of STOC '02, pages 380--388. ACM, 2002.Google ScholarGoogle Scholar
  28. V. Misra and S. Bhatia. Bernoulli embeddings for graphs. In AAAI'18, 2018.Google ScholarGoogle Scholar
  29. T. Mikolov et al. Distributed representations of words and phrases and their compositionality. In Proc. of NIPS'13, pages 3111--3119, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Privacy Preserving Approximate K-means Clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management
        November 2019
        3373 pages
        ISBN:9781450369763
        DOI:10.1145/3357384

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 3 November 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        CIKM '19 Paper Acceptance Rate202of1,031submissions,20%Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader