research-article

From One-hot Encoding to Privacy-preserving Synthetic Electronic Health Records Embedding

Authors:

Chuanyi LiuAuthors Info & Claims

CIAT 2020: Proceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies

Pages 407 - 413

https://doi.org/10.1145/3444370.3444605

Published: 04 January 2021 Publication History

Abstract

Categorical Encoding, typically one-hot encoding, plays a central role when we learn Machine Learning models. This classic approach is the most prevalent strategy due to its simplicity. However, as the number of categories grows large and sparse, it becomes infeasible to train since it creates high-dimensional vectors, which is also at the risk of revealing private information and breaking its underlying structure. We here propose to utilize data intermediate representation learning (embedding) to overcome such limitations. Instead of representing data with a one-hot vector of many cardinalities, an embedding serves as a lower-dimensional dense vector in which each cell can contain any number, capturing the latent hierarchical structures of the features in the meantime. It can also be assumed that sharing embedding is safer than releasing raw one-hot encoded data, as the presence of a particular feature is represented by the value of 1, otherwise 0. With the assist of Generative Adversarial Network further alleviates sensitive information leakage issue by creating synthetic data for modeling. Our result suggests that even embedded features may more or less pose privacy flaws, deploying GAN will make a wider variety of medical datasets available by retaining its relative utility while preserving data privacy, which has been identified as a promising method for medical machine learning and prediction.

References

[1]

Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers. International journal of computer applications, 175(4), 7--9.

[2]

Zhang, X., & LeCun, Y. (2017). Which encoding is the best for text classification in chinese, english, japanese and korean?. arXiv preprint arXiv:1708.02657.

[3]

Rodríguez, P., Bautista, M. A., Gonzalez, J., & Escalera, S. (2018). Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75, 21--31.

[4]

Fredrikson, M., Jha, S., & Ristenpart, T. (2015, October). Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (pp. 1322--1333).

Digital Library

[5]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672--2680).

Digital Library

[6]

Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international, 2014.

[7]

Rocher, L., Hendrickx, J. M., & De Montjoye, Y. A. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1), 1--9.

[8]

Osia, S. A., Shamsabadi, A. S., Sajadmanesh, S., Taheri, A., Katevas, K., Rabiee, H. R., ... & Haddadi, H. (2020). A hybrid deep learning architecture for privacy-preserving mobile analytics. IEEE Internet of Things Journal, 7(5), 4505--4518.

[9]

Xiao, T., Tsai, Y. H., Sohn, K., Chandraker, M., & Yang, M. H. (2019). Adversarial Learning of Privacy-Preserving and Task-Oriented Representations. arXiv preprint arXiv:1911.10143.

[10]

Liu, S., Du, J., Shrivastava, A., & Zhong, L. (2019). Privacy Adversarial Network: Representation Learning for Mobile Data Privacy. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(4), 1--18.

Digital Library

[11]

Li, A., Duan, Y., Yang, H., Chen, Y., & Yang, J. (2020, August). TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework for Deep Learning with Anonymized Intermediate Representations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 824--832).

Digital Library

[12]

Sheingold, S. H., Zuckerman, R., & Shartzer, A. (2016). Understanding Medicare hospital readmission rates and differing penalties between safety-net and other hospitals. Health affairs, 35(1), 124--131.

[13]

Slee, V. N. (1978). The International classification of diseases: ninth revision (ICD-9).

[14]

Choi, E., Bahadori, M. T., Searles, E., Coffey, C., Thompson, M., Bost, J., ... & Sun, J. (2016, August). Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1495--1504).

Digital Library

[15]

Leimeister, M., & Wilson, B. J. (2018). Skip-gram word embeddings in hyperbolic space. arXiv preprint arXiv:1809.01498.

[16]

Dhingra, B., Shallue, C. J., Norouzi, M., Dai, A. M., & Dahl, G. E. (2018). Embedding text in hyperbolic spaces. arXiv preprint arXiv:1806.04313.

[17]

Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems (pp. 6338--6347).

Digital Library

[18]

Xu, L. (2020). Synthesizing Tabular Data using Conditional GAN (Doctoral dissertation, Massachusetts Institute of Technology).

[19]

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875.

[20]

Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., & Munos, R. (2017). The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743.

[21]

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems (pp. 5767--5777).

Digital Library

[22]

Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. (2019, April). Privacy preserving synthetic health data.

[23]

Patki, N., Wedge, R., & Veeramachaneni, K. (2016, October). The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 399--410). IEEE.

[24]

Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785--794).

Digital Library

[25]

Liu, X. Y., Wu, J., & Zhou, Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539--550.

Digital Library

Cited By

Tan LMengshan LYu FYelin LJihong ZLixin G(2024)Predicting lncRNA-protein interactions using a hybrid deep learning model with dinucleotide-codon fusion feature encodingBMC Genomics10.1186/s12864-024-11168-325:1Online publication date: 28-Dec-2024
https://doi.org/10.1186/s12864-024-11168-3
Huang LJiao SYang SZhang SZhu XGuo RWang Y(2021)LGFC-CNN: Prediction of lncRNA-Protein Interactions by Using Multiple Types of Features through Deep LearningGenes10.3390/genes1211168912:11(1689)Online publication date: 24-Oct-2021
https://doi.org/10.3390/genes12111689
Wilchek MWang Y(2021)Synthetic Differential Privacy Data Generation for Revealing Bias Modelling Risks2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211(1574-1580)Online publication date: Sep-2021
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211

Index Terms

From One-hot Encoding to Privacy-preserving Synthetic Electronic Health Records Embedding
1. Applied computing
  1. Life and medical sciences
    1. Health informatics
2. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Usability in security and privacy

Recommendations

An efficient privacy mechanism for electronic health records

Electronic health records (EHRs), digitization of patients' health record, offer many advantages over traditional ways of keeping patients' records, such as easing data management and facilitating quick access and real-time treatment. EHRs are a rich ...
Privacy preserving Generative Adversarial Networks to model Electronic Health Records
Abstract
Hospitals and General Practitioner (GP) surgeries within National Health Services (NHS), collect patient information on a routine basis to create personal health records such as family medical history, chronic diseases, medications and ...
Privacy preservation and information security protection for patients' portable electronic health records

As patients face the possibility of copying and keeping their electronic health records (EHRs) through portable storage media, they will encounter new risks to the protection of their private information. In this study, we propose a method to preserve ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CIAT 2020: Proceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies

December 2020

597 pages

ISBN:9781450387828

DOI:10.1145/3444370

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Sun Yat-Sen University
CARLETON UNIVERSITY: INSTITUTE FOR INTERDISCIPLINARY STUDIES
Beijing University of Posts and Telecommunications
Guangdong University of Technology: Guangdong University of Technology
Deakin University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CIAT 2020

CIAT 2020: 2020 International Conference on Cyberspace Innovation of Advanced Technologies

December 4 - 6, 2020

Guangzhou, China

Acceptance Rates

CIAT 2020 Paper Acceptance Rate 94 of 232 submissions, 41%;

Overall Acceptance Rate 94 of 232 submissions, 41%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
266
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tan LMengshan LYu FYelin LJihong ZLixin G(2024)Predicting lncRNA-protein interactions using a hybrid deep learning model with dinucleotide-codon fusion feature encodingBMC Genomics10.1186/s12864-024-11168-325:1Online publication date: 28-Dec-2024
https://doi.org/10.1186/s12864-024-11168-3
Huang LJiao SYang SZhang SZhu XGuo RWang Y(2021)LGFC-CNN: Prediction of lncRNA-Protein Interactions by Using Multiple Types of Features through Deep LearningGenes10.3390/genes1211168912:11(1689)Online publication date: 24-Oct-2021
https://doi.org/10.3390/genes12111689
Wilchek MWang Y(2021)Synthetic Differential Privacy Data Generation for Revealing Bias Modelling Risks2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211(1574-1580)Online publication date: Sep-2021
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten