Skip to main content
Log in

A distributed approach to enabling privacy-preserving model-based classifier training

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

This paper proposes a novel approach for privacy-preserving distributed model-based classifier training. Our approach is an important step towards supporting customizable privacy modeling and protection. It consists of three major steps. First, each data site independently learns a weak concept model (i.e., local classifier) for a given data pattern or concept by using its own training samples. An adaptive EM algorithm is proposed to select the model structure and estimate the model parameters simultaneously. The second step deals with combined classifier training by integrating the weak concept models that are shared from multiple data sites. To reduce the data transmission costs and the potential privacy breaches, only the weak concept models are sent to the central site and synthetic samples are directly generated from these shared weak concept models at the central site. Both the shared weak concept models and the synthetic samples are then incorporated to learn a reliable and complete global concept model. A computational approach is developed to automatically achieve a good trade off between the privacy disclosure risk, the sharing benefit and the data utility. The third step deals with validating the combined classifier by distributing the global concept model to all these data sites in the collaboration network while at the same time limiting the potential privacy breaches. Our approach has been validated through extensive experiments carried out on four UCI machine learning data sets and two image data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Westin AF (1967) Privacy and freedom. Atheneum, New York

    Google Scholar 

  2. Rosenthal A, Winslett M (2004) Security of shared data in large systems: state of the art and research directions. In: ACM SIGMOD

  3. Thuraisingham BM (2002) Data mining, national security, privacy and civil liberties. SIGKDD Explor Newsl 4(2): 1–5

    Article  Google Scholar 

  4. Aggarwal G, Bawa M, Ganesan P, Garcia-Molina H, Kenthapadi K, Mishra N, Motwani R, Srivastava U, Thomas D, Widom J, Xu Y (2004) Vision paper: enabling privacy for the paranoids. In: VLDB, pp 708–719

  5. Hore B, Mehrotra S, Tsudik G (2004) A privacy-preserving index for range queries. In: VLDB, pp 720–731

  6. Deutsch A, Papakonstantinou Y (2005) Privacy in database publishing. In ICDT, pp 230–245

  7. Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertainty 10(5): 571–588

    MATH  MathSciNet  Google Scholar 

  8. Kantarcioglu M, Jin J, Clifton C (2004) What do data mining results violate privacy. In: ACM SIGKDD

  9. Liew CK, Coi UJ, Liew CJ (1985) A data distortion by probability distribution. ACM Trans Database Syst 10(3): 395–411

    Article  MATH  Google Scholar 

  10. Muralidhar K, Sarathy R (1999) Security of random data perturbation methods. ACM Trans Database Syst 24(4): 487–493

    Article  Google Scholar 

  11. Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: ACM SIGMOD, pp 439–450

  12. Agrawal D, Aggarwal C (2001) On the design and quantification of privacy preserving data mining algorithms. In: ACM PODS

  13. Evfimievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: ACM SIGKDD

  14. Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: ACM PODS

  15. Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: IEEE ICDM

  16. Ma D, Sivakumar K, Kargupta H (2004) privacy sensitive bayesian network parameter learning. In: IEEE ICDM

  17. Yao A (1986) How to generate and exchange secrets. In: IEEE Symp. on Foundations of Computer Science, pp 162–167

  18. Lindell Y, Israel R, Pinkas B (2000) Privacy preserving data mining. CRYPTO, pp 36–54

  19. Goldreich O, Micali S, Wigderson A (1987) How to play any mental game- a completeness theorem for protocols with honest majority. In: STOC

  20. Du W, Atallah MJ (2001) Privacy-preserving cooperative statistical analysis. In: 17th Annual Computer Security Applications Conference, pp 103–110

  21. Du W, Han Y, Chen S (2004) Privacy-preserving multivariate statistical analysis: Linear regression and classification. In: SIAM Conference on Data Mining

  22. Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitional data. In: ACM SIGKDD

  23. Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In: ACM SIGKDD

  24. Wright R, Yang Z (2004) Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: ACM SIGKDD

  25. Chen K, Liu L (2005) Privacy preserving data classification with rotation perturbation. In: IEEE ICDM, pp 589–592

  26. Oliveira S, Zaiane OR (2003) Privacy preserving clustering by data transformation. In: SBBD

  27. Domingo-Ferrer J, Mateo-Sanz JM (2001) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data Eng 14(1): 189–201

    Article  Google Scholar 

  28. Fienberg SE, Makov UE, Steele RJ (1998) Disclosure limitation using perturbation and related methods for categorial data. J Official Stat 14(4): 485–502

    Google Scholar 

  29. Raghunathan TJ, Reiter JP, Rubin D (2003) Multiple imputation for statistical disclosure limitation. J Official Stat 19(1): 1–16

    Google Scholar 

  30. Crises G (2004) Synthetic microdata generation for database privacy protection. Technical report, CRISES Research Group, CRIREP-04-009

  31. Merugu S, Ghosh J (2003) Privacy-preserving distributed clustering using generative models. In: IEEE ICDM

  32. Chan, P, Stolfo, S, Wolpert, D (eds) (1996) Working Notes of AAAI Workshop on Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms, vol 36. AAAI/MIT Press, Cambridge

    Google Scholar 

  33. Kargupta H, Datta S, Wang Q, Sivakumar K (2003) On the privacy preserving properties of random data perturbation techniques. In: IEEE ICDM

  34. Huang Z, Du W, Chen B (2005) Deriving private information from randomized data. In: ACM SIGMOD

  35. Zhu Y, Liu L (2004) Optimal randomization for privacy preserving data mining. In: ACM SIGKDD, pp 761–766

  36. Xiong L, Chitti S, Liu L (2007) Mining multiple private databases using a knn classifier. In: SAC

  37. Kim J, Winkler WE (2003) Multiplicative noise for masking continuous data. Technical report, US Bureau of Census, Statistics Research Division technical report statistics 2003-01

  38. Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1): 92–106

    Article  Google Scholar 

  39. Ting K, Witten I (1999) Issues in stacked generalization. J Artif Intell Res 10: 271–289

    MATH  Google Scholar 

  40. Fan J, Luo H, Hacid M-S, Bertino E (2005) A novel approach for privacy-preserving video sharing. In: ACM CIKM, pp 609–616

  41. Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24: 381–396

    Article  Google Scholar 

  42. McLachlan G, Krishnan T (2000) The EM algorithm and extensions. Wiley, New York

    Google Scholar 

  43. Ueda N, Nakano R, Ghahramani Z, Hinton GE (2002) Smem algorithm for mixture models. Neural Comput 12(9): 2109–2128

    Article  Google Scholar 

  44. Luo H (2007) Concept-based large-scale video database browsing and retrieval via visualization. Ph.D. thesis, The University of North Carolina at Charlotte, pp 58–60. http://hdl.handle.net/2029/87

  45. Hyvarinen A (1998) New approximations of dioeerential entropy for independent component analysisand projection pursuit. In: Annual Conference on Neural Information Processing Systems, vol 10, pp 273–279

  46. Gomantam S, Karr AF, Sanil AP (2005) Data swapping as a decision problem. J Official Stat 13(4): 635–655

    Google Scholar 

  47. Lamber D (1993) Measures of disclosure risk and harm. J Official Stat 9: 313–331

    Google Scholar 

  48. Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2-3): 103–134

    Article  MATH  Google Scholar 

  49. Joachims T (1999) Transductive inference for text classification using support vector machine. In: ICML

  50. Hettich S, Blake C, Merz C (1998) Uci respository of machine learning databases. Technical report. http://www.ics.uci.edu/~mlearn/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianping Fan.

Additional information

This project is supported by National Science Foundation under 0208539-IIS and 0601542-IIS, grants from AO Foundation and CERIAS, Shanghai Pujiang Program under 08PJ1404600, National Natural Science Foundation of China under 60496325 and National Hi-tech R&D Program of China under 2006AA010111.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, H., Fan, J., Lin, X. et al. A distributed approach to enabling privacy-preserving model-based classifier training. Knowl Inf Syst 20, 157–185 (2009). https://doi.org/10.1007/s10115-008-0167-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0167-x

Keywords

Navigation