A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

Banerjee, Madhushri; Chen, Zhiyuan; Gangopadhyay, Aryya

doi:10.1007/s10619-013-7126-6

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

Published: 01 May 2013

Volume 32, pages 5–35, (2014)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Madhushri Banerjee¹,
Zhiyuan Chen¹ &
Aryya Gangopadhyay¹

455 Accesses
4 Citations
Explore all metrics

Abstract

It is often necessary for organizations to perform data mining tasks collaboratively without giving up their own data. This necessity has led to the development of privacy preserving distributed data mining. Several protocols exist which deal with data mining methods in a distributed scenario but most of these methods handle a single data mining task. Therefore, if the participating parties are interested in more than one classification methods they will have to go through a series of distributed protocols every time, thus increasing the overhead substantially. A second significant drawback with existing methods is that they are often quite expensive due to the use of encryption operations. In this paper a method has been proposed that addresses both these issues and provides a generic approach to efficient privacy preserving classification analysis in a distributed setting with a worst-case privacy guarantee. The experimental results demonstrate the effectiveness of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel cryptographic protocol for privacy preserving classification over distributed encrypted databases

Article 25 May 2022

Distributed Privacy-Preserving Minimal Distance Classification

A New Method for Constructing Ensemble Classifier in Privacy-Preserving Distributed Environment

References

Aggarwal, C.C., Yu, P.S.: A condensation approach to privacy preserving data mining. In: 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece (2004)
Google Scholar
Aggarwal, C.C., Yu, P.S.: Privacy-Preserving Data Mining: Models and Algorithms. Springer, Berlin (2008)
Book Google Scholar
Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: 20th ACM PODS, Santa Barbara, CA, pp. 247–255 (2001)
Google Scholar
Agrawal, S., Haritsa, J.R.: A framework for high-accuracy privacy-preserving mining. In: ICDE (2005)
Google Scholar
Agrawal, R., Srikant, R.: Privacy preserving data mining. In: 2000 ACM SIGMOD, Dallas, TX, May 2000, pp. 439–450 (2000)
Google Scholar
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the sulq framework. In: PODS (2005)
Google Scholar
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, STOC’08, pp. 609–618. ACM, New York (2008). http://doi.acm.org/10.1145/1374376.1374464
Google Scholar
Caragea, D., Silvescu, A., Honavar, V.: Decision tree induction from distributed, heterogeneous, autonomous data sources. In: Conference on Intelligent Systems Design and Applications (2003)
Google Scholar
Chen, K., Liu, L.: A random rotation perturbation approach to privacy-preserving data classification. In: ICDM 2005, Houston, TX, November 2005
Google Scholar
Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., Zhu, M.: Tools for privacy preserving distributed data mining. ACM SIGKDD Explor. 4, 28–34 (2002)
Article Google Scholar
Dalenius, T., Reiss, S.P.: Data-swapping: a technique for disclosure control. J. Stat. Plan. Inference 6, 73–85 (1982)
Article MATH MathSciNet Google Scholar
Du, W., Zhan, Z.: Building decision tree classifier on private data. In: IEEE International Conference on Privacy, Security and Data Mining, Maebashi City, Japan, December 2002, pp. 1–8 (2002)
Google Scholar
Du, W., Zhan, Z.: Using randomized response techniques for privacy preserving data mining. In: 9th ACM SIGKDD, Washington, DC, August 2003, pp. 505–510 (2003)
Google Scholar
Dwork, C.: Differential privacy. In: ICALP, pp. 1–12 (2006)
Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC’08, pp. 1–19. Springer, Berlin (2008). http://dl.acm.org/citation.cfm?id=1791834.1791836
Chapter Google Scholar
Dwork, C., Mcsherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd Theory of Cryptography Conference, pp. 265–284. Springer, Berlin (2006)
Chapter Google Scholar
Dwork, C., Naor, M., Reingold, O., Rothblum, G.N., Vadhan, S.: On the complexity of differentially private data release: efficient algorithms and hardness results. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC’09, pp. 381–390. ACM, New York (2009). http://doi.acm.org/10.1145/1536414.1536467
Chapter Google Scholar
Evfimevski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: 22nd ACM PODS, San Diego, CA, June 2003, pp. 211–222 (2003)
Google Scholar
Feldman, D., Fiat, A., Kaplan, H., Nissim, K.: Private coresets. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC’09, pp. 361–370. ACM, New York (2009). http://doi.acm.org/10.1145/1536414.1536465
Chapter Google Scholar
Fienberg, S.E., McIntyre, J.: Data-swapping: variations on a theme by Dalenius and Reiss. Tech. rep., National Institute of Statistical Sciences (2003)
Frank, A., Asuncion, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml
Gal, T., Chen, Z., Gangopadhyay, A.: A privacy protection model for patient data with multiple sensitive attributes. Int. J. Inf. Secur. Priv. 2(3), 28–44 (2008)
Article Google Scholar
Giannella, C., Liu, K., Olsen, T., Kargupta, H.: Communication efficient construction of decision trees over heterogeneously distributed data. In: Fourth IEEE International Conference on Data Mining (2004)
Google Scholar
Goethals, B., Laur, S., Lipmaa, H., Mielikainen, T.: On secure scalar product computation for privacy-preserving data mining. In: The 7th Annual International Conf. in Information Security and Cryptology (2004)
Google Scholar
Huang, Z., Du, W., Chen, B.: Deriving private information from randomized data. In: SIGMOD 2005, Baltimore, MD, June 2005, pp. 37–48 (2005)
Google Scholar
Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: SIGKDD’05, Chicago, IL, pp. 593–599 (2005)
Google Scholar
Kantarcioglu, M., Clifton, C.: Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Trans. Knowl. Data Eng. 16(9), 1026–1037 (2004)
Article Google Scholar
Kantarcioglu, M., Vaidya, J.: Privacy preserving naïve Bayes classifier for horizontally partitioned data. In: IEEE ICDM Workshop on Privacy Preserving Data Mining, Melbourne, FL, November 2003, pp. 3–9 (2003)
Google Scholar
Kargupta, H., Park, B.H.: A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments. IEEE Trans. Knowl. Data Eng. 16(2), 216–229 (2004)
Article Google Scholar
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: ICDM, pp. 99–106 (2003)
Google Scholar
Kim, J.J., Winkler, W.E.: Multiplicative noise for masking continuous data. Tech. rep. 2003-01, Statistical Research Division, U.S. Bureau of the Census, April 2003
Kim, D., Chen, Z., Gangopadhyay, A.: Optimizing privacy-accuracy tradeoff for privacy preserving distance-based classification. Int. J. Inf. Secur. Priv. 6(2), 16–33 (2012)
Article Google Scholar
Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’10, pp. 123–134. ACM, New York (2010). http://doi.acm.org/10.1145/1807085.1807104
Chapter Google Scholar
Lin, X., Clifton, C., Zhu, Y.: Privacy preserving clustering with distributed em mixture modeling. Int. J. Knowl. Inf. Syst. 8(1), 68–81 (2005)
Article Google Scholar
Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Advances in Cryptology (CRYPTO’00). Lecture Notes in Computer Science, vol. 180, pp. 36–53 (2000)
Google Scholar
Liu, K., Giannella, C., Kargupta, H.: An attacker’s view of distance preserving maps for privacy preserving data mining. In: The 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06) (2006)
Google Scholar
Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)
Article Google Scholar
Ma, D., Sivakumar, K., Kargupta, H.: Privacy sensitive Bayesian network parameter learning. In: 4th IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, November 2004, pp. 487–490 (2004)
Google Scholar
Mcsherry, F.: Mechanism design via differential privacy. In: Proceedings of the 48th Annual Symposium on Foundations of Computer Science (2007)
Google Scholar
Mukherjee, S., Chen, Z., Gangopadhyay, A.: A privacy preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms. VLDB J. 15(4), 292–315 (2006)
Article Google Scholar
Mukherjee, S., Banerjee, M., Chen, Z., Gangopadhyay, A.: A privacy preserving technique for distance-based classification with worst case privacy guarantees. Data Knowl. Eng. 66(2), 264–288 (2008)
Article Google Scholar
Mukherjee, S., Chen, Z., Gangopadhyay, A.: A fuzzy programming approach for data reduction and privacy in distance based mining. Int. J. Inf. Comput. Secur. (in press)
Oliveira, S., Zaïane, O.R.: Privacy preserving clustering by data transformation. In: 18th Brazilian Symposium on Databases, pp. 304–318 (2003)
Google Scholar
Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. In: EUROCRYPT, pp. 223–238. Springer, Berlin (1999)
Google Scholar
Roth, A., Roughgarden, T.: Interactive privacy via the median mechanism. In: Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC’10, pp. 765–774. ACM, New York (2010). http://doi.acm.org/10.1145/1806689.1806794
Chapter Google Scholar
Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)
Article MATH MathSciNet Google Scholar
Vaidya, J.: Towards a holistic approach to privacy-preserving data analysis. In: Workshop on Secure Knowledge Management (2008)
Google Scholar
Vaidya, J.S., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: 8th ACM SIGKDD, Edmonton, Canada, July 2002, pp. 639–644 (2002)
Google Scholar
Vaidya, J.S., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: 9th ACM SIGKDD, Washington, DC, August 2003, pp. 206–215 (2003)
Google Scholar
Vaidya, J., Clifton, C.: Privacy-preserving decision trees over vertically partitioned data. In: Proceedings of the IFIP WG 11.3 International Conference on Data and Applications Security, pp. 139–152. Springer, Berlin (2005)
Chapter Google Scholar
Vaidya, J., Clifton, C., Zhu, M.: Privacy Preserving Data Mining. Springer, Berlin (2005)
Google Scholar
Vaidya, J., Kantarcioglu, M., Clifton, C.: Privacy-preserving naïve Bayes classification. VLDB J. 17(4), 879–898 (2008)
Article Google Scholar
Warner, S.: Randomized response: a survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 60(309), 63–69 (1965)
Article Google Scholar
Wright, R., Yang, Z.: Privacy-preserving Bayesian network structure computation on distributed heterogeneous data. In: 10th ACM SIGKDD Conference (SIGKDD’04), Seattle, WA, August 2004, pp. 713–718 (2004)
Google Scholar
Xiao, X., Bender, G., Hay, M., Gehrke, J.: Ireduct: differential privacy with reduced relative errors. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD’11, pp. 229–240. ACM, New York (2011). http://doi.acm.org/10.1145/1989323.1989348
Chapter Google Scholar
Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23, 1200–1214 (2011). doi:10.1109/TKDE.2010.247
Article Google Scholar
Yao, A.C.: How to generate and exchange secrets. In: 27th IEEE Symposium on Foundations of Computer Science, pp. 162–167 (1986)
Google Scholar
Yu, H., Jiang, X., Vaidya, J.: Privacy-preserving svm using nonlinear kernels on horizontally partitioned data. In: Proceedings of the 2006 ACM Symposium on Applied Computing, SAC’06, pp. 603–610 (2006)
Chapter Google Scholar
Yu, H., Vaidya, J., Jiang, X.: Privacy-preserving svm classification on vertically partitioned data. In: PAKDD, pp. 647–656 (2006)
Google Scholar
Zhu, Y., Liu, L.: Optimal randomization for privacy preserving data mining. In: KDD, pp. 761–766 (2004)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Systems, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, MD, 21250, USA
Madhushri Banerjee, Zhiyuan Chen & Aryya Gangopadhyay

Authors

Madhushri Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Aryya Gangopadhyay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Madhushri Banerjee.

Additional information

Communicated by Elena Ferrari.

Appendix

In the appendix we give brief summary of the SMC protocols we use in this paper.

Secure sum protocol

The secure sum protocol [10] uses homomorphic encryption. More specifically, suppose we want to compute the sum of x _i (1≤i≤r) where x _i belongs to party i. Two parties are first randomly selected. Without loss of generality, suppose party 1 and 2 are chosen. Party 1 then generates a pair of public and private keys for a homomorphic encryption scheme. It then sends the public key to the other parties. Each party uses the public key to encrypt x _i and then sends the encrypted x _i to party 2. Party 2 then computes the encrypted sum without decrypting each individual x _i using the property of homomorphic encryption. Party 2 then sends the result back to party 1, which will decrypt the result and distribute the sum to each party.

The secure sum protocol is used in our algorithm for horizontally partitioned case to compute number of elements in class, class mean, global mean, global S _b, and global S _w, all of which are essentially sum of local shares. The secure sum protocol is also used by our algorithm for vertically partitioned case to compute the result of LDA transform (\(\hat{D}_{t}\)).

The secure sum protocol just needs r encryption operations (r as the number of parties) and 1 decryption operation. So both the communication overhead and computational overhead is O(r).

Secure scalar product

We use the secure scalar product protocol that was first proposed in [24] and later used in [26]. Suppose two parties A and B each has a vector x ^A and x ^B. We want to compute the scalar product of x ^A⋅x ^B. The protocol also uses homomorphic encryption. Party A generates a pair of public and private keys and sends the public key to B. Party A then computes encrypted value for x ^A and sends it to Party B. Party B selects a random value s _B and uses the property of homomorphic encryption to compute the encryption of x ^A⋅x ^B−s _B and sends it back to Party A. Party A decrypts the result and gets s _A=x ^A⋅x ^B−s _B. Now each party has a random share (s _A and s _B) for the scalar product.

The secure scalar product is used in the secure K-Means clustering [26] algorithm to compute distances. We also use it in our algorithm for the vertically partitioned case to compute S _b and S _w. Let us first consider S _b. Let vector x _i(1≤i≤m)be a 1 by c vector where the j-th value x _ij is the value of \(\mu_{j}-\bar{x}\) on column i. So the value at i-th row and j-th column of S _b is essentially scalar product of x _i and x _j. Similarly, for S _w, suppose y _lj is a 1 by |c _l| vector where the i-th element is the value of the i-th row in class c _l on column j minus the value of μ _l on column j. The value at i-th row and j-th column of S _w is the sum of a series of c (c as the number of classes) scalar products where each scalar product equals y _li⋅y _lj (i.e., the scalar product computed in class c _l but over the column i and j).

The protocol requires n encryption operations and one decryption operation for a length n vector. Since the encrypted value of a data element may be used in several scalar product computations (e.g., column 1 will be used to compute the scalar product with column 2, 3, …, ), we can reuse the encrypted value. Thus each data element needs to be encrypted at most once. Our algorithm requires O(mn) encryption operations to compute S _b and S _w in the vertically partitioned case.

Secure comparison protocol

We use the secure comparison protocol to find the closest cluster and the range of each column in our algorithms for horizontally and vertically partitioned cases. The protocol was proposed by Yao at [57].

Secure K-Means clustering

We use the secure K-Means clustering protocol proposed in [26]. This protocol also calls secure sum, secure scalar protocol, and Yao’s secure comparison protocol. The secure K-Means clustering protocol requires O(sngrr _k) encryption operations where n is number of rows in data, s is number of columns after LDA, r is number of parties, g is the number of clusters, and r _k is the number of iterations in K-Means clustering.

The overhead of these SMC protocols are dominated by the cost of encryption. In the horizontally partitioned case, our algorithm requires O(mcr+sgrr _k) encryption operations. Note that this is irrelevant to the number of rows in the data set because each party can do most computations locally. So the algorithm is efficient. In the vertically partitioned case, our algorithm requires O(mn+sngrr _k) encryption operations. Encryption can be quite expensive in practice, especially for large data sets under the vertically partitioned case. However, as stated in [24], the current hardware and some optimization tricks (such as reusing encrypted values) can make the computational overhead tolerable.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Banerjee, M., Chen, Z. & Gangopadhyay, A. A generic and distributed privacy preserving classification method with a worst-case privacy guarantee. Distrib Parallel Databases 32, 5–35 (2014). https://doi.org/10.1007/s10619-013-7126-6

Download citation

Published: 01 May 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10619-013-7126-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

Abstract

Access this article

Similar content being viewed by others

A novel cryptographic protocol for privacy preserving classification over distributed encrypted databases

Distributed Privacy-Preserving Minimal Distance Classification

A New Method for Constructing Ensemble Classifier in Privacy-Preserving Distributed Environment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Secure sum protocol

Secure scalar product

Secure comparison protocol

Secure K-Means clustering

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

Abstract

Access this article

Similar content being viewed by others

A novel cryptographic protocol for privacy preserving classification over distributed encrypted databases

Distributed Privacy-Preserving Minimal Distance Classification

A New Method for Constructing Ensemble Classifier in Privacy-Preserving Distributed Environment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Secure sum protocol

Secure scalar product

Secure comparison protocol

Secure K-Means clustering

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation