Data Anonymization for Privacy Aware Machine Learning

Jaidan, David Nizar; Carrere, Maxime; Chemli, Zakaria; Poisvert, Rémi

doi:10.1007/978-3-030-37599-7_60

Data Anonymization for Privacy Aware Machine Learning

David Nizar Jaidan¹³,
Maxime Carrere¹⁴,
Zakaria Chemli¹⁵ &
…
Rémi Poisvert¹⁶

Conference paper
First Online: 03 January 2020

2003 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11943))

Abstract

The increase of data leaks, attacks, and other ransom-ware in the last few years have pointed out concerns about data security and privacy. All this has negatively affected the sharing and publication of data. To address these many limitations, innovative techniques are needed for protecting data. Especially, when used in machine learning based-data models. In this context, differential privacy is one of the most effective approaches to preserve privacy. However, the scope of differential privacy applications is very limited (e. g. numerical and structured data). Therefore, in this study, we aim to investigate the behavior of differential privacy applied to textual data and time series. The proposed approach was evaluated by comparing two Principal Component Analysis based differential privacy algorithms. The effectiveness was demonstrated through the application of three machine learning models to both anonymized and primary data. Their performances were thoroughly evaluated in terms of confidentiality, utility, scalability, and computational efficiency. The PPCA method provides a high anonymization quality at the expense of a high time-consuming, while the DPCA method preserves more utility and faster time computing. We show the possibility to combine a neural network text representation approach with differential privacy methods. We also highlighted that it is well within reach to anonymize real-world measurements data from satellites sensors for an anomaly detection task. We believe that our study will significantly motivate the use of differential privacy techniques, which can lead to more data sharing and privacy preserving.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Albrecht, J.P.: How the GDPR will change the world. Eur. Data Prot. L. Rev. 2, 287 (2016)
Article Google Scholar
Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report, SRI International (1998)
Google Scholar
Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. In: 22nd International Conference on Data Engineering (ICDE 2006), pp. 24–24. IEEE (2006)
Google Scholar
Wong, R.C.-W., Li, J., Fu, A.W.-C., Wang, K.: (\(\alpha \), k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 754–759. ACM (2006)
Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115. IEEE (2007)
Google Scholar
Zhang, Q., Koudas, N., Srivastava, D., Yu, T.: Aggregate query answering on anonymized tables. In: 2007 IEEE 23rd International Conference on Data Engineering, pp. 116–125. IEEE (2007)
Google Scholar
Martin, D.J., Kifer, D., Machanavajjhala, A., Gehrke, J., Halpern, J.Y.: Worst-case background knowledge for privacy-preserving data publishing. In: 2007 IEEE 23rd International Conference on Data Engineering, pp. 126–135. IEEE (2007)
Google Scholar
Dwork, C.: Differential privacy. In: van Tilborg, H.C.A., Jajodia, S. (eds.) Encyclopedia of Cryptography and Security, pp. 338–340. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-5906-5
Chapter Google Scholar
Friedman, A., Schuster, A.: Data mining with differential privacy. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–502. ACM (2010)
Google Scholar
Mohammed, N., Chen, R., Fung, B., Yu, P.S.: Differentially private data release for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 493–501. ACM (2011)
Google Scholar
Sarwate, A.D., Chaudhuri, K.: Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE Signal Process. Mag. 30(5), 86–94 (2013)
Article Google Scholar
Fernandes, N., Dras, M., McIver, A.: Generalised differential privacy for text document processing. In: Nielson, F., Sands, D. (eds.) POST 2019. LNCS, vol. 11426, pp. 123–148. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17138-4_6
Chapter Google Scholar
Fernandes, N., Dras, M., McIver, A.: Processing text for privacy: an information flow perspective. In: Havelund, K., Peleska, J., Roscoe, B., de Vink, E. (eds.) FM 2018. LNCS, vol. 10951, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-95582-7_1
Chapter Google Scholar
Zhang, X., Hamm, J., Reiter, M.K., Zhang, Y.: Statistical privacy for streaming traffic. In: Proceedings of the ISOC Network and Distributed System Security Symposium (2019)
Google Scholar
Beaulieu-Jones, B.K., et al.: Privacy-preserving generative deep neural networks support clinical data sharing, p. 159756. BioRxiv (2018)
Google Scholar
Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12(Mar), 1069–1109 (2011)
MathSciNet MATH Google Scholar
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: Null, pp. 94–103. IEEE (2007)
Google Scholar
Chaudhuri, K., Sarwate, A.D., Sinha, K.: A near-optimal algorithm for differentially-private principal components. J. Mach. Learn. Res. 14(1), 2905–2943 (2013)
MathSciNet MATH Google Scholar
Jiang, X., Ji, Z., Wang, S., Mohammed, N., Cheng, S., Ohno-Machado, L.: Differential-private data publishing through component analysis. Trans. Data Priv. 6(1), 19 (2013)
MathSciNet Google Scholar
Goldberg, Y., Levy, O.: word2vec Explained: deriving Mikolov et al’.s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Google Scholar
Hoff, P.D.: Simulation of the matrix Bingham-von Mises-Fisher distribution, with applications to multivariate and relational data. J. Comput. Graph. Stat. 18(2), 438–456 (2009)
Article MathSciNet Google Scholar
Tianqi, C., Carlos, G.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, New York, NY, USA, pp. 785–794. ACM (2016)
Google Scholar
Gardner, M.W., Dorling, S.R.: Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmos. Environ. 32(14–15), 2627–2636 (1998)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115–124. Association for Computational Linguistics (2005)
Google Scholar
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 142–150. Association for Computational Linguistics (2011)
Google Scholar

Download references

Acknowledgments

This work is supported by Scalian.

Author information

Authors and Affiliations

Innovation L@B Scalian France, Labège, France
David Nizar Jaidan
Centre d’Excellence Datascale Scalian France, Le Haillan, France
Maxime Carrere
Innovation L@B Scalian France, Paris, France
Zakaria Chemli
Innovation L@B Scalian France, Rennes, France
Rémi Poisvert

Authors

David Nizar Jaidan
View author publications
You can also search for this author in PubMed Google Scholar
Maxime Carrere
View author publications
You can also search for this author in PubMed Google Scholar
Zakaria Chemli
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Poisvert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Nizar Jaidan .

Editor information

Editors and Affiliations

University of Cambridge, Cambridge, UK
Giuseppe Nicosia
University of Florida, Gainesville, FL, USA
Panos Pardalos
Harvard University, Cambridge, MA, USA
Renato Umeton
Università di Catania, Catania, Catania, Italy
Giovanni Giuffrida
Almawave, Rome, Roma, Italy
Vincenzo Sciacca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jaidan, D.N., Carrere, M., Chemli, Z., Poisvert, R. (2019). Data Anonymization for Privacy Aware Machine Learning. In: Nicosia, G., Pardalos, P., Umeton, R., Giuffrida, G., Sciacca, V. (eds) Machine Learning, Optimization, and Data Science. LOD 2019. Lecture Notes in Computer Science(), vol 11943. Springer, Cham. https://doi.org/10.1007/978-3-030-37599-7_60

Download citation

DOI: https://doi.org/10.1007/978-3-030-37599-7_60
Published: 03 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37598-0
Online ISBN: 978-3-030-37599-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics