ABSTRACT
Privacy preserving computation is of utmost importance in a cloud computing environment where a client often requires to send sensitive data to servers offering computing services over untrusted networks. Eavesdropping over the network or malware at the server may lead to leaking sensitive information from the data. To prevent this, we propose to encode the input data in such a way that, firstly, it should be difficult to decode it back to the true data, and secondly, the computational results obtained with the encoded data should not be substantially different from those obtained with the true data. Specifically, the computational activity that we focus on is the K-means clustering, which is widely used for many data mining tasks. Our proposed variant of the K-means algorithm is capable of privacy preservation in the sense that it requires as input only binary encoded data, and is not allowed to access the true data vectors at any stage of the computation. During intermediate stages of K-means computation, our algorithm is able to effectively process the inputs with incomplete information seeking to yield outputs relatively close to the complete information (non-encoded) case. Evaluation on real datasets show that the proposed methods yields comparable clustering effectiveness in comparison to the standard K-means algorithm on image clustering (MNIST-8M dataset), and in fact outperforms the standard K-means on text clustering (ODPtweets dataset).
- B. Michael and Z. Tom. A Face is exposed for AOL searcher no. 4417749. New York Times, page A1, 08 2006.Google Scholar
- W. Benjamin et al. Syntf: Synthetic and differentially private term frequency vectors for privacy-preserving text mining. In SIGIR '18, pages 305--314, 2018.Google Scholar
- J. Marek, J. Martin, and R. Konrad. Smart metering de-pseudonymization. In Proc. of ACSAC '11, pages 227--236. ACM, 2011.Google Scholar
- Ricardo Mendes and Jo ao P Vilela. Privacy-preserving data mining: methods, metrics, and applications. IEEE Access, 5:10562--10582, 2017.Google ScholarCross Ref
- D. Irit and N. Kobbi. Revealing information while preserving privacy. In Proc. of Symposium on Principles of Database Systems , pages 202--210. ACM, 2003.Google Scholar
- V. S Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin, and Y. Theodoridis. State-of-the-art in privacy preserving data mining. Sigmod Record, 33:50--57, 2004.Google ScholarDigital Library
- J. William and L. Joram. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability, volume 26 of Contemporary Mathematics, pages 189--206. 1984.Google Scholar
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008.Google ScholarDigital Library
- S. Xiao-Bo, L. Weiwei, T Ivor W, S. Fumin, and S. Quan-Sen. Compressed k-means for large-scale clustering. In AAAI, pages 2527--2533, 2017.Google Scholar
- K. Siddhesh and A. Amit. Faster k-means cluster estimation. In European Conference on Information Retrieval, pages 520--526. Springer, 2017.Google Scholar
- Y. Jinfeng, W. Jun, and J. Rong. Privacy and regression model preserved learning. In Proc. of AAAI '14, pages 1341--1347, 2014.Google Scholar
- J. Geetha and W. Rebecca N. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In Proc. of KDD'05, pages 593--599. ACM, 2005.Google Scholar
- M. Noman, C. Rui, F. Benjamin, and Philip S Y. Differentially private data release for data mining. In Proc. of KDD'11, pages 493--501, 2011.Google Scholar
- C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. Journal of Privacy and Confidentiality, 7:17--51, 2017.Google ScholarCross Ref
- H. Kashima, J. Hu, B. Ray, and M. Singh. K-means clustering of proportional data using l1 distance. In Proc. of PR '08, pages 1--4, 2008.Google ScholarCross Ref
- S. David. Web-scale k-means clustering. In WWW'10, pages 1177--1178, 2010.Google Scholar
- J. Herve, D. Matthijs, and S. Cordelia. Product quantization for nearest neighbor search. IEEE Trans. PAMI, 33(1):117--128, 2011.Google ScholarDigital Library
- M. Yusuke et al. PQk-means: Billion-scale clustering for product-quantized codes. In Multimedia Conference, pages 1725--1733, 2017.Google Scholar
- J. Ji, J. Li, S. Yan, B., and Q. Tian. Super-bit locality-sensitive hashing. In Proc. of NIPS'12, pages 108--116, 2012.Google Scholar
- X. Yi, C. Caramanis, and E. Price. Binary embedding: Fundamental limits and fast algorithm. In Proc. of ICML'15, pages 2162--2170, 2015.Google Scholar
- B. Eisenberg and R. Sullivan. Why is the sum of independent normal random variables normal. In Math. Mag., volume 81, pages 362--366, 2008.Google ScholarCross Ref
- H. Chang et al. Robust path-based spectral clustering. PR, 41:191--203, 2008.Google Scholar
- J. Anil K et al. Data clustering: A user's dilemma. In PReMI, pages 1--10, 2005.Google Scholar
- F. Limin and M. Enzo. Flame, a novel fuzzy clustering method for the analysis of dna microarray data. BMC Bioinformatics, 8(1), Jan 2007.Google Scholar
- F. Pasi and S. Sami. K-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12):4743--4759, Dec 2018.Google ScholarDigital Library
- K. Krauth, E. V. Bonilla, K. Cutajar, and M. Filippone. Autogp: Exploring the capabilities and limitations of gaussian process models. In Proc. of UAI'17, 2017.Google Scholar
- C. Moses S. Similarity estimation techniques from rounding algorithms. In Proc. of STOC '02, pages 380--388. ACM, 2002.Google Scholar
- V. Misra and S. Bhatia. Bernoulli embeddings for graphs. In AAAI'18, 2018.Google Scholar
- T. Mikolov et al. Distributed representations of words and phrases and their compositionality. In Proc. of NIPS'13, pages 3111--3119, 2013.Google Scholar
Index Terms
- Privacy Preserving Approximate K-means Clustering
Recommendations
Characterizing pattern preserving clustering
This paper describes a new approach for clustering--pattern preserving clustering--which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data-...
RK-Means Clustering: K-Means with Reliability
This paper presents an RK-means clustering algorithm which is developed for reliable data grouping by introducing a new reliability evaluation to the K-means clustering algorithm. The conventional K-means clustering algorithm has two shortfalls: 1) the ...
Privacy-Preserving Hierarchical-k-means Clustering on Horizontally Partitioned Data
Privacy preserving mining of distributed data is an important direction for data mining, and privacy preserving clustering is one of the main researches. Privacy-preserving data mining techniques enable knowledge discovery without requiring disclosure ...
Comments