Abstract
Preserving data confidentiality is crucial when releasing microdata for public-use. There are a variety of proposed approaches; many of them are based on traditional probability theory and statistics. These approaches mainly focus on masking the original data. In practice, these masking techniques, despite covering part of the data, risk leaving sensitive data open to release. In this paper, we approach this problem using a deep learning-based generative model which generates simulation data to mask the original data. Generating simulation data that holds the same statistical characteristics as the raw data becomes the key idea and also the main challenge in this study. In particular, we explore the statistical similarities between the raw data and the generated data, given that the generated data and raw data are not obviously distinguishable. Two statistical evaluation metrics, Absolute Relative Residual Values and Hellinger Distance, are the evaluation methods we have decided upon to evaluate our results. We also conduct extensive experiments to validate our idea with two real-world datasets: the Census Dataset and the Environmental Dataset.











Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Rubin DB (1993) Discussion: Statistical disclosure limitation. J Off Stat 9
Min CL, Mitra R, Lazaridis E, An CL, Yong KG, Yap WS Data privacy preserving scheme using generalized linear models. Computers Security 2016
Gurjar SPS, Pasupuleti SK (2017) A privacy-preserving multi- keyword ranked search scheme over encrypted cloud data using mir-tree. In: Interna- tional conference on computing, analytics and security trends, pages 533–538
Andruszkiewicz P (2007) Optimization for mask scheme in privacy preserving data mining for association rules. In: International conference on rough sets and intelligent systems paradigms, pp 465–474
Willenborg L, De Waal T (2001) Elements of statistical disclosure control. Springer
Fienberg SE, Mcintyre J (2004) Data swapping: variations on a theme by dalenius and reiss. In International Workshop on Privacy in Statistical Databases, pages:14–29
Fuller WA (1993) Masking procedures for microdata disclosure limitation. J Off Stat 9(2)
Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014
Bengio Y (2009) Learning deep architectures for ai. Foundations Trends R in Machine Learning 2(1):1–127
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: International conference on neural information processing systems, pp 2672–2680
Martin Arjovsky, Soumith Chintala, and L’eon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017
Barrow NJ, Campbell NA (1972) Methods of measuring residual value of fertilizers. Aus- tralian Journal of Experimental Agriculture 12(58):502–510
Simpson DG (1987) Minimum hellinger distance estimation for the analysis of count data. J Am Stat Assoc 82(399):802–807
Rubin DB (2009) Statistical Disclosure Limitation. Springer US
Li N, Li T, Venkatasubramanian S (2007) T-closeness: privacy beyond k-anonymity and l-diversity. In: Data engineering, 2007. ICDE 2007. IEEE 23rd international conference on. IEEE, pp 106–115
Dwork C (2008) Differential privacy: a survey of results. In: International conference on theory and applications of models of computation. Springer, pp 1–19
Van Tilborg HCA, Jajodia S (2014) Encyclopedia of cryptography and security. Springer Science & Business Media
Yang W, Li T, Jia H (2004) Simulation and experiment of machine vision guidance of agriculture vehicles. Transactions of the Chinese Society of Agricultural Engineering
Dormand JR, Prince PJ (1978) New runge-kutta algorithms for numerical simulation in dynamical astronomy. Celest Mech 18(3):223–232
Stukowski A (2009) Visualization and analysis of atomistic simulation data with ovito-the open visualization tool. IEEE Trans Fuzzy Syst 23(6):2154–2162
Devia N, Weber R (2013) Generating crime data using agent-based simulation. Comput Environ Urban Syst 42(7):26–41
Phillips A, Cardelli L (2007) Efficient, correct simulation of biological processes in the stochastic pi-calculus. In International Conference on Computational Methods in Systems Biology, pages:184–199
Roe C, Meliopoulos AP, Meisel J, Overbye T (2008) Power system level impacts of plug-in hybrid electric vehicles using simulation data. In: Energy 2030 conference. Energy, pp 1–6, 2008
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. Comput Therm Sci
Mirza M, Osindero S (2014) Conditional generative adversarial nets. Comput Therm Sci:2672–2680
Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. 2016
Dai F, Zhang D, Li J (2013) Encoder/decoder for privacy protection video with privacy region detection and scrambling. In: International conference on multimedia modeling. Springer, pp 525–527
Ismini Psychoula, Erinc Merdivan, Deepika Singh, Liming Chen, Feng Chen, Sten Hanke, Johannes Kropf, Andreas Holzinger, and Matthieu Geist. A deep learning ap- proach for privacy preservation in assisted living. arXiv preprint arXiv:1802.09359, 2018
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian Goodfellow. Adversarial autoencoders. In ICLR, 2016
Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The com- plexity of computing a Nash equilibrium. ACM
Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. Comput Therm Sci
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013), 2013
Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul- man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder arXiv preprint arXiv:1611.02731, 2016
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift, pp 448–456
Hardt M, Recht B, Singer Y (2015) Train faster, generalize better: stability of stochastic gradient descent. Mathematics
Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. 2017
Pizer SM, Amburn EP, Austin JD, Cromartie R, Geselowitz A, Greer T, ter Haar Romeny B, Zimmerman JB, Zuiderveld K (1987) Adap- tive histogram equalization and its variations. Computer vision, graphics, and image processing 39(3):355–368
Nowozin S, Cseke B, Tomioka R (2016) F-Gan: training generative neural samplers using variational divergence minimization. In: Advances in neural in- formation processing systems, pp 271–279
Bordes A, Bottou L, Gallinari P (2009) Sgdqn: Careful quasi-newton stochastic gradient descent. J Mach Learn Res 10(3):1737–1754
T. Tieleman and G. Hinton. Rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: neural networks for machine learning., 2012
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):257–269
Aleksandar Botev, Guy Lever, and David Barber. Nesterov’s accelerated gradient and momentum as approximations to regularised update descent. 2016
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. In: Advances in neural information processing systems, pp 5767–5777
Acknowledgments
The authors would like to acknowledge the support provided by the National Key R&D Program of China (No.2018YFC1604000).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, W., Meng, P., Hong, Y. et al. Using deep learning to preserve data confidentiality. Appl Intell 50, 341–353 (2020). https://doi.org/10.1007/s10489-019-01515-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-019-01515-3