Skip to main content
Log in

A generic and distributed privacy preserving classification method with a worst-case privacy guarantee

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

It is often necessary for organizations to perform data mining tasks collaboratively without giving up their own data. This necessity has led to the development of privacy preserving distributed data mining. Several protocols exist which deal with data mining methods in a distributed scenario but most of these methods handle a single data mining task. Therefore, if the participating parties are interested in more than one classification methods they will have to go through a series of distributed protocols every time, thus increasing the overhead substantially. A second significant drawback with existing methods is that they are often quite expensive due to the use of encryption operations. In this paper a method has been proposed that addresses both these issues and provides a generic approach to efficient privacy preserving classification analysis in a distributed setting with a worst-case privacy guarantee. The experimental results demonstrate the effectiveness of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Algorithm 2
Algorithm 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Aggarwal, C.C., Yu, P.S.: A condensation approach to privacy preserving data mining. In: 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece (2004)

    Google Scholar 

  2. Aggarwal, C.C., Yu, P.S.: Privacy-Preserving Data Mining: Models and Algorithms. Springer, Berlin (2008)

    Book  Google Scholar 

  3. Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: 20th ACM PODS, Santa Barbara, CA, pp. 247–255 (2001)

    Google Scholar 

  4. Agrawal, S., Haritsa, J.R.: A framework for high-accuracy privacy-preserving mining. In: ICDE (2005)

    Google Scholar 

  5. Agrawal, R., Srikant, R.: Privacy preserving data mining. In: 2000 ACM SIGMOD, Dallas, TX, May 2000, pp. 439–450 (2000)

    Google Scholar 

  6. Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the sulq framework. In: PODS (2005)

    Google Scholar 

  7. Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, STOC’08, pp. 609–618. ACM, New York (2008). http://doi.acm.org/10.1145/1374376.1374464

    Google Scholar 

  8. Caragea, D., Silvescu, A., Honavar, V.: Decision tree induction from distributed, heterogeneous, autonomous data sources. In: Conference on Intelligent Systems Design and Applications (2003)

    Google Scholar 

  9. Chen, K., Liu, L.: A random rotation perturbation approach to privacy-preserving data classification. In: ICDM 2005, Houston, TX, November 2005

    Google Scholar 

  10. Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., Zhu, M.: Tools for privacy preserving distributed data mining. ACM SIGKDD Explor. 4, 28–34 (2002)

    Article  Google Scholar 

  11. Dalenius, T., Reiss, S.P.: Data-swapping: a technique for disclosure control. J. Stat. Plan. Inference 6, 73–85 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  12. Du, W., Zhan, Z.: Building decision tree classifier on private data. In: IEEE International Conference on Privacy, Security and Data Mining, Maebashi City, Japan, December 2002, pp. 1–8 (2002)

    Google Scholar 

  13. Du, W., Zhan, Z.: Using randomized response techniques for privacy preserving data mining. In: 9th ACM SIGKDD, Washington, DC, August 2003, pp. 505–510 (2003)

    Google Scholar 

  14. Dwork, C.: Differential privacy. In: ICALP, pp. 1–12 (2006)

    Google Scholar 

  15. Dwork, C.: Differential privacy: a survey of results. In: Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC’08, pp. 1–19. Springer, Berlin (2008). http://dl.acm.org/citation.cfm?id=1791834.1791836

    Chapter  Google Scholar 

  16. Dwork, C., Mcsherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd Theory of Cryptography Conference, pp. 265–284. Springer, Berlin (2006)

    Chapter  Google Scholar 

  17. Dwork, C., Naor, M., Reingold, O., Rothblum, G.N., Vadhan, S.: On the complexity of differentially private data release: efficient algorithms and hardness results. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC’09, pp. 381–390. ACM, New York (2009). http://doi.acm.org/10.1145/1536414.1536467

    Chapter  Google Scholar 

  18. Evfimevski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: 22nd ACM PODS, San Diego, CA, June 2003, pp. 211–222 (2003)

    Google Scholar 

  19. Feldman, D., Fiat, A., Kaplan, H., Nissim, K.: Private coresets. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC’09, pp. 361–370. ACM, New York (2009). http://doi.acm.org/10.1145/1536414.1536465

    Chapter  Google Scholar 

  20. Fienberg, S.E., McIntyre, J.: Data-swapping: variations on a theme by Dalenius and Reiss. Tech. rep., National Institute of Statistical Sciences (2003)

  21. Frank, A., Asuncion, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml

  22. Gal, T., Chen, Z., Gangopadhyay, A.: A privacy protection model for patient data with multiple sensitive attributes. Int. J. Inf. Secur. Priv. 2(3), 28–44 (2008)

    Article  Google Scholar 

  23. Giannella, C., Liu, K., Olsen, T., Kargupta, H.: Communication efficient construction of decision trees over heterogeneously distributed data. In: Fourth IEEE International Conference on Data Mining (2004)

    Google Scholar 

  24. Goethals, B., Laur, S., Lipmaa, H., Mielikainen, T.: On secure scalar product computation for privacy-preserving data mining. In: The 7th Annual International Conf. in Information Security and Cryptology (2004)

    Google Scholar 

  25. Huang, Z., Du, W., Chen, B.: Deriving private information from randomized data. In: SIGMOD 2005, Baltimore, MD, June 2005, pp. 37–48 (2005)

    Google Scholar 

  26. Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: SIGKDD’05, Chicago, IL, pp. 593–599 (2005)

    Google Scholar 

  27. Kantarcioglu, M., Clifton, C.: Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Trans. Knowl. Data Eng. 16(9), 1026–1037 (2004)

    Article  Google Scholar 

  28. Kantarcioglu, M., Vaidya, J.: Privacy preserving naïve Bayes classifier for horizontally partitioned data. In: IEEE ICDM Workshop on Privacy Preserving Data Mining, Melbourne, FL, November 2003, pp. 3–9 (2003)

    Google Scholar 

  29. Kargupta, H., Park, B.H.: A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments. IEEE Trans. Knowl. Data Eng. 16(2), 216–229 (2004)

    Article  Google Scholar 

  30. Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: ICDM, pp. 99–106 (2003)

    Google Scholar 

  31. Kim, J.J., Winkler, W.E.: Multiplicative noise for masking continuous data. Tech. rep. 2003-01, Statistical Research Division, U.S. Bureau of the Census, April 2003

  32. Kim, D., Chen, Z., Gangopadhyay, A.: Optimizing privacy-accuracy tradeoff for privacy preserving distance-based classification. Int. J. Inf. Secur. Priv. 6(2), 16–33 (2012)

    Article  Google Scholar 

  33. Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’10, pp. 123–134. ACM, New York (2010). http://doi.acm.org/10.1145/1807085.1807104

    Chapter  Google Scholar 

  34. Lin, X., Clifton, C., Zhu, Y.: Privacy preserving clustering with distributed em mixture modeling. Int. J. Knowl. Inf. Syst. 8(1), 68–81 (2005)

    Article  Google Scholar 

  35. Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Advances in Cryptology (CRYPTO’00). Lecture Notes in Computer Science, vol. 180, pp. 36–53 (2000)

    Google Scholar 

  36. Liu, K., Giannella, C., Kargupta, H.: An attacker’s view of distance preserving maps for privacy preserving data mining. In: The 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06) (2006)

    Google Scholar 

  37. Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)

    Article  Google Scholar 

  38. Ma, D., Sivakumar, K., Kargupta, H.: Privacy sensitive Bayesian network parameter learning. In: 4th IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, November 2004, pp. 487–490 (2004)

    Google Scholar 

  39. Mcsherry, F.: Mechanism design via differential privacy. In: Proceedings of the 48th Annual Symposium on Foundations of Computer Science (2007)

    Google Scholar 

  40. Mukherjee, S., Chen, Z., Gangopadhyay, A.: A privacy preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms. VLDB J. 15(4), 292–315 (2006)

    Article  Google Scholar 

  41. Mukherjee, S., Banerjee, M., Chen, Z., Gangopadhyay, A.: A privacy preserving technique for distance-based classification with worst case privacy guarantees. Data Knowl. Eng. 66(2), 264–288 (2008)

    Article  Google Scholar 

  42. Mukherjee, S., Chen, Z., Gangopadhyay, A.: A fuzzy programming approach for data reduction and privacy in distance based mining. Int. J. Inf. Comput. Secur. (in press)

  43. Oliveira, S., Zaïane, O.R.: Privacy preserving clustering by data transformation. In: 18th Brazilian Symposium on Databases, pp. 304–318 (2003)

    Google Scholar 

  44. Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. In: EUROCRYPT, pp. 223–238. Springer, Berlin (1999)

    Google Scholar 

  45. Roth, A., Roughgarden, T.: Interactive privacy via the median mechanism. In: Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC’10, pp. 765–774. ACM, New York (2010). http://doi.acm.org/10.1145/1806689.1806794

    Chapter  Google Scholar 

  46. Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  47. Vaidya, J.: Towards a holistic approach to privacy-preserving data analysis. In: Workshop on Secure Knowledge Management (2008)

    Google Scholar 

  48. Vaidya, J.S., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: 8th ACM SIGKDD, Edmonton, Canada, July 2002, pp. 639–644 (2002)

    Google Scholar 

  49. Vaidya, J.S., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: 9th ACM SIGKDD, Washington, DC, August 2003, pp. 206–215 (2003)

    Google Scholar 

  50. Vaidya, J., Clifton, C.: Privacy-preserving decision trees over vertically partitioned data. In: Proceedings of the IFIP WG 11.3 International Conference on Data and Applications Security, pp. 139–152. Springer, Berlin (2005)

    Chapter  Google Scholar 

  51. Vaidya, J., Clifton, C., Zhu, M.: Privacy Preserving Data Mining. Springer, Berlin (2005)

    Google Scholar 

  52. Vaidya, J., Kantarcioglu, M., Clifton, C.: Privacy-preserving naïve Bayes classification. VLDB J. 17(4), 879–898 (2008)

    Article  Google Scholar 

  53. Warner, S.: Randomized response: a survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 60(309), 63–69 (1965)

    Article  Google Scholar 

  54. Wright, R., Yang, Z.: Privacy-preserving Bayesian network structure computation on distributed heterogeneous data. In: 10th ACM SIGKDD Conference (SIGKDD’04), Seattle, WA, August 2004, pp. 713–718 (2004)

    Google Scholar 

  55. Xiao, X., Bender, G., Hay, M., Gehrke, J.: Ireduct: differential privacy with reduced relative errors. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD’11, pp. 229–240. ACM, New York (2011). http://doi.acm.org/10.1145/1989323.1989348

    Chapter  Google Scholar 

  56. Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23, 1200–1214 (2011). doi:10.1109/TKDE.2010.247

    Article  Google Scholar 

  57. Yao, A.C.: How to generate and exchange secrets. In: 27th IEEE Symposium on Foundations of Computer Science, pp. 162–167 (1986)

    Google Scholar 

  58. Yu, H., Jiang, X., Vaidya, J.: Privacy-preserving svm using nonlinear kernels on horizontally partitioned data. In: Proceedings of the 2006 ACM Symposium on Applied Computing, SAC’06, pp. 603–610 (2006)

    Chapter  Google Scholar 

  59. Yu, H., Vaidya, J., Jiang, X.: Privacy-preserving svm classification on vertically partitioned data. In: PAKDD, pp. 647–656 (2006)

    Google Scholar 

  60. Zhu, Y., Liu, L.: Optimal randomization for privacy preserving data mining. In: KDD, pp. 761–766 (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Madhushri Banerjee.

Additional information

Communicated by Elena Ferrari.

Appendix

Appendix

In the appendix we give brief summary of the SMC protocols we use in this paper.

Secure sum protocol

The secure sum protocol [10] uses homomorphic encryption. More specifically, suppose we want to compute the sum of x i (1≤ir) where x i belongs to party i. Two parties are first randomly selected. Without loss of generality, suppose party 1 and 2 are chosen. Party 1 then generates a pair of public and private keys for a homomorphic encryption scheme. It then sends the public key to the other parties. Each party uses the public key to encrypt x i and then sends the encrypted x i to party 2. Party 2 then computes the encrypted sum without decrypting each individual x i using the property of homomorphic encryption. Party 2 then sends the result back to party 1, which will decrypt the result and distribute the sum to each party.

The secure sum protocol is used in our algorithm for horizontally partitioned case to compute number of elements in class, class mean, global mean, global S b , and global S w , all of which are essentially sum of local shares. The secure sum protocol is also used by our algorithm for vertically partitioned case to compute the result of LDA transform (\(\hat{D}_{t}\)).

The secure sum protocol just needs r encryption operations (r as the number of parties) and 1 decryption operation. So both the communication overhead and computational overhead is O(r).

Secure scalar product

We use the secure scalar product protocol that was first proposed in [24] and later used in [26]. Suppose two parties A and B each has a vector x A and x B. We want to compute the scalar product of x Ax B. The protocol also uses homomorphic encryption. Party A generates a pair of public and private keys and sends the public key to B. Party A then computes encrypted value for x A and sends it to Party B. Party B selects a random value s B and uses the property of homomorphic encryption to compute the encryption of x Ax Bs B and sends it back to Party A. Party A decrypts the result and gets s A =x Ax Bs B . Now each party has a random share (s A and s B ) for the scalar product.

The secure scalar product is used in the secure K-Means clustering [26] algorithm to compute distances. We also use it in our algorithm for the vertically partitioned case to compute S b and S w . Let us first consider S b . Let vector x i (1≤im)be a 1 by c vector where the j-th value x ij is the value of \(\mu_{j}-\bar{x}\) on column i. So the value at i-th row and j-th column of S b is essentially scalar product of x i and x j . Similarly, for S w , suppose y lj is a 1 by |c l | vector where the i-th element is the value of the i-th row in class c l on column j minus the value of μ l on column j. The value at i-th row and j-th column of S w is the sum of a series of c (c as the number of classes) scalar products where each scalar product equals y li y lj (i.e., the scalar product computed in class c l but over the column i and j).

The protocol requires n encryption operations and one decryption operation for a length n vector. Since the encrypted value of a data element may be used in several scalar product computations (e.g., column 1 will be used to compute the scalar product with column 2, 3, …, ), we can reuse the encrypted value. Thus each data element needs to be encrypted at most once. Our algorithm requires O(mn) encryption operations to compute S b and S w in the vertically partitioned case.

Secure comparison protocol

We use the secure comparison protocol to find the closest cluster and the range of each column in our algorithms for horizontally and vertically partitioned cases. The protocol was proposed by Yao at [57].

Secure K-Means clustering

We use the secure K-Means clustering protocol proposed in [26]. This protocol also calls secure sum, secure scalar protocol, and Yao’s secure comparison protocol. The secure K-Means clustering protocol requires O(sngrr k ) encryption operations where n is number of rows in data, s is number of columns after LDA, r is number of parties, g is the number of clusters, and r k is the number of iterations in K-Means clustering.

The overhead of these SMC protocols are dominated by the cost of encryption. In the horizontally partitioned case, our algorithm requires O(mcr+sgrr k ) encryption operations. Note that this is irrelevant to the number of rows in the data set because each party can do most computations locally. So the algorithm is efficient. In the vertically partitioned case, our algorithm requires O(mn+sngrr k ) encryption operations. Encryption can be quite expensive in practice, especially for large data sets under the vertically partitioned case. However, as stated in [24], the current hardware and some optimization tricks (such as reusing encrypted values) can make the computational overhead tolerable.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Banerjee, M., Chen, Z. & Gangopadhyay, A. A generic and distributed privacy preserving classification method with a worst-case privacy guarantee. Distrib Parallel Databases 32, 5–35 (2014). https://doi.org/10.1007/s10619-013-7126-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-013-7126-6

Keywords

Navigation