Skip to main content
Log in

Using deep learning to preserve data confidentiality

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Preserving data confidentiality is crucial when releasing microdata for public-use. There are a variety of proposed approaches; many of them are based on traditional probability theory and statistics. These approaches mainly focus on masking the original data. In practice, these masking techniques, despite covering part of the data, risk leaving sensitive data open to release. In this paper, we approach this problem using a deep learning-based generative model which generates simulation data to mask the original data. Generating simulation data that holds the same statistical characteristics as the raw data becomes the key idea and also the main challenge in this study. In particular, we explore the statistical similarities between the raw data and the generated data, given that the generated data and raw data are not obviously distinguishable. Two statistical evaluation metrics, Absolute Relative Residual Values and Hellinger Distance, are the evaluation methods we have decided upon to evaluate our results. We also conduct extensive experiments to validate our idea with two real-world datasets: the Census Dataset and the Environmental Dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Rubin DB (1993) Discussion: Statistical disclosure limitation. J Off Stat 9

  2. Min CL, Mitra R, Lazaridis E, An CL, Yong KG, Yap WS Data privacy preserving scheme using generalized linear models. Computers Security 2016

  3. Gurjar SPS, Pasupuleti SK (2017) A privacy-preserving multi- keyword ranked search scheme over encrypted cloud data using mir-tree. In: Interna- tional conference on computing, analytics and security trends, pages 533–538

    Google Scholar 

  4. Andruszkiewicz P (2007) Optimization for mask scheme in privacy preserving data mining for association rules. In: International conference on rough sets and intelligent systems paradigms, pp 465–474

    Chapter  Google Scholar 

  5. Willenborg L, De Waal T (2001) Elements of statistical disclosure control. Springer

  6. Fienberg SE, Mcintyre J (2004) Data swapping: variations on a theme by dalenius and reiss. In International Workshop on Privacy in Statistical Databases, pages:14–29

    Chapter  Google Scholar 

  7. Fuller WA (1993) Masking procedures for microdata disclosure limitation. J Off Stat 9(2)

  8. Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  9. Bengio Y (2009) Learning deep architectures for ai. Foundations Trends R in Machine Learning 2(1):1–127

    Article  MathSciNet  Google Scholar 

  10. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: International conference on neural information processing systems, pp 2672–2680

    Google Scholar 

  11. Martin Arjovsky, Soumith Chintala, and L’eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017

  12. Barrow NJ, Campbell NA (1972) Methods of measuring residual value of fertilizers. Aus- tralian Journal of Experimental Agriculture 12(58):502–510

    Article  Google Scholar 

  13. Simpson DG (1987) Minimum hellinger distance estimation for the analysis of count data. J Am Stat Assoc 82(399):802–807

    Article  MathSciNet  Google Scholar 

  14. Rubin DB (2009) Statistical Disclosure Limitation. Springer US

  15. Li N, Li T, Venkatasubramanian S (2007) T-closeness: privacy beyond k-anonymity and l-diversity. In: Data engineering, 2007. ICDE 2007. IEEE 23rd international conference on. IEEE, pp 106–115

  16. Dwork C (2008) Differential privacy: a survey of results. In: International conference on theory and applications of models of computation. Springer, pp 1–19

  17. Van Tilborg HCA, Jajodia S (2014) Encyclopedia of cryptography and security. Springer Science & Business Media

  18. Yang W, Li T, Jia H (2004) Simulation and experiment of machine vision guidance of agriculture vehicles. Transactions of the Chinese Society of Agricultural Engineering

  19. Dormand JR, Prince PJ (1978) New runge-kutta algorithms for numerical simulation in dynamical astronomy. Celest Mech 18(3):223–232

    Article  MathSciNet  Google Scholar 

  20. Stukowski A (2009) Visualization and analysis of atomistic simulation data with ovito-the open visualization tool. IEEE Trans Fuzzy Syst 23(6):2154–2162

    Google Scholar 

  21. Devia N, Weber R (2013) Generating crime data using agent-based simulation. Comput Environ Urban Syst 42(7):26–41

    Article  Google Scholar 

  22. Phillips A, Cardelli L (2007) Efficient, correct simulation of biological processes in the stochastic pi-calculus. In International Conference on Computational Methods in Systems Biology, pages:184–199

  23. Roe C, Meliopoulos AP, Meisel J, Overbye T (2008) Power system level impacts of plug-in hybrid electric vehicles using simulation data. In: Energy 2030 conference. Energy, pp 1–6, 2008

  24. Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. Comput Therm Sci

  25. Mirza M, Osindero S (2014) Conditional generative adversarial nets. Comput Therm Sci:2672–2680

  26. Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. 2016

    Google Scholar 

  27. Dai F, Zhang D, Li J (2013) Encoder/decoder for privacy protection video with privacy region detection and scrambling. In: International conference on multimedia modeling. Springer, pp 525–527

  28. Ismini Psychoula, Erinc Merdivan, Deepika Singh, Liming Chen, Feng Chen, Sten Hanke, Johannes Kropf, Andreas Holzinger, and Matthieu Geist. A deep learning ap- proach for privacy preservation in assisted living. arXiv preprint arXiv:1802.09359, 2018

  29. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In ICLR, 2016

    Google Scholar 

  30. Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The com- plexity of computing a Nash equilibrium. ACM

  31. Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. Comput Therm Sci

  32. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  33. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013), 2013

    Google Scholar 

  34. Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul- man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder arXiv preprint arXiv:1611.02731, 2016

  35. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, pp 448–456

    Google Scholar 

  36. Hardt M, Recht B, Singer Y (2015) Train faster, generalize better: stability of stochastic gradient descent. Mathematics

  37. Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. 2017

    Google Scholar 

  38. Pizer SM, Amburn EP, Austin JD, Cromartie R, Geselowitz A, Greer T, ter Haar Romeny B, Zimmerman JB, Zuiderveld K (1987) Adap- tive histogram equalization and its variations. Computer vision, graphics, and image processing 39(3):355–368

    Article  Google Scholar 

  39. Nowozin S, Cseke B, Tomioka R (2016) F-Gan: training generative neural samplers using variational divergence minimization. In: Advances in neural in- formation processing systems, pp 271–279

    Google Scholar 

  40. Bordes A, Bottou L, Gallinari P (2009) Sgdqn: Careful quasi-newton stochastic gradient descent. J Mach Learn Res 10(3):1737–1754

    MathSciNet  MATH  Google Scholar 

  41. T. Tieleman and G. Hinton. Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learning., 2012

    Google Scholar 

  42. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):257–269

    MathSciNet  MATH  Google Scholar 

  43. Aleksandar Botev, Guy Lever, and David Barber. Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. 2016

  44. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777

    Google Scholar 

Download references

Acknowledgments

The authors would like to acknowledge the support provided by the National Key R&D Program of China (No.2018YFC1604000).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaohui Cui.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W., Meng, P., Hong, Y. et al. Using deep learning to preserve data confidentiality. Appl Intell 50, 341–353 (2020). https://doi.org/10.1007/s10489-019-01515-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01515-3

Keywords

Navigation