skip to main content
10.1145/3444370.3444605acmotherconferencesArticle/Chapter ViewAbstractPublication PagesciatConference Proceedingsconference-collections
research-article

From One-hot Encoding to Privacy-preserving Synthetic Electronic Health Records Embedding

Published: 04 January 2021 Publication History

Abstract

Categorical Encoding, typically one-hot encoding, plays a central role when we learn Machine Learning models. This classic approach is the most prevalent strategy due to its simplicity. However, as the number of categories grows large and sparse, it becomes infeasible to train since it creates high-dimensional vectors, which is also at the risk of revealing private information and breaking its underlying structure. We here propose to utilize data intermediate representation learning (embedding) to overcome such limitations. Instead of representing data with a one-hot vector of many cardinalities, an embedding serves as a lower-dimensional dense vector in which each cell can contain any number, capturing the latent hierarchical structures of the features in the meantime. It can also be assumed that sharing embedding is safer than releasing raw one-hot encoded data, as the presence of a particular feature is represented by the value of 1, otherwise 0. With the assist of Generative Adversarial Network further alleviates sensitive information leakage issue by creating synthetic data for modeling. Our result suggests that even embedded features may more or less pose privacy flaws, deploying GAN will make a wider variety of medical datasets available by retaining its relative utility while preserving data privacy, which has been identified as a promising method for medical machine learning and prediction.

References

[1]
Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers. International journal of computer applications, 175(4), 7--9.
[2]
Zhang, X., & LeCun, Y. (2017). Which encoding is the best for text classification in chinese, english, japanese and korean?. arXiv preprint arXiv:1708.02657.
[3]
Rodríguez, P., Bautista, M. A., Gonzalez, J., & Escalera, S. (2018). Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75, 21--31.
[4]
Fredrikson, M., Jha, S., & Ristenpart, T. (2015, October). Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (pp. 1322--1333).
[5]
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672--2680).
[6]
Strack, B., DeShazo, J. P., Gennings, C., Olmo, J. L., Ventura, S., Cios, K. J., & Clore, J. N. (2014). Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed research international, 2014.
[7]
Rocher, L., Hendrickx, J. M., & De Montjoye, Y. A. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nature communications, 10(1), 1--9.
[8]
Osia, S. A., Shamsabadi, A. S., Sajadmanesh, S., Taheri, A., Katevas, K., Rabiee, H. R., ... & Haddadi, H. (2020). A hybrid deep learning architecture for privacy-preserving mobile analytics. IEEE Internet of Things Journal, 7(5), 4505--4518.
[9]
Xiao, T., Tsai, Y. H., Sohn, K., Chandraker, M., & Yang, M. H. (2019). Adversarial Learning of Privacy-Preserving and Task-Oriented Representations. arXiv preprint arXiv:1911.10143.
[10]
Liu, S., Du, J., Shrivastava, A., & Zhong, L. (2019). Privacy Adversarial Network: Representation Learning for Mobile Data Privacy. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 3(4), 1--18.
[11]
Li, A., Duan, Y., Yang, H., Chen, Y., & Yang, J. (2020, August). TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework for Deep Learning with Anonymized Intermediate Representations. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 824--832).
[12]
Sheingold, S. H., Zuckerman, R., & Shartzer, A. (2016). Understanding Medicare hospital readmission rates and differing penalties between safety-net and other hospitals. Health affairs, 35(1), 124--131.
[13]
Slee, V. N. (1978). The International classification of diseases: ninth revision (ICD-9).
[14]
Choi, E., Bahadori, M. T., Searles, E., Coffey, C., Thompson, M., Bost, J., ... & Sun, J. (2016, August). Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1495--1504).
[15]
Leimeister, M., & Wilson, B. J. (2018). Skip-gram word embeddings in hyperbolic space. arXiv preprint arXiv:1809.01498.
[16]
Dhingra, B., Shallue, C. J., Norouzi, M., Dai, A. M., & Dahl, G. E. (2018). Embedding text in hyperbolic spaces. arXiv preprint arXiv:1806.04313.
[17]
Nickel, M., & Kiela, D. (2017). Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems (pp. 6338--6347).
[18]
Xu, L. (2020). Synthesizing Tabular Data using Conditional GAN (Doctoral dissertation, Massachusetts Institute of Technology).
[19]
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875.
[20]
Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., & Munos, R. (2017). The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743.
[21]
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems (pp. 5767--5777).
[22]
Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. (2019, April). Privacy preserving synthetic health data.
[23]
Patki, N., Wedge, R., & Veeramachaneni, K. (2016, October). The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 399--410). IEEE.
[24]
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785--794).
[25]
Liu, X. Y., Wu, J., & Zhou, Z. H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539--550.

Cited By

View all
  • (2024)Predicting lncRNA-protein interactions using a hybrid deep learning model with dinucleotide-codon fusion feature encodingBMC Genomics10.1186/s12864-024-11168-325:1Online publication date: 28-Dec-2024
  • (2021)LGFC-CNN: Prediction of lncRNA-Protein Interactions by Using Multiple Types of Features through Deep LearningGenes10.3390/genes1211168912:11(1689)Online publication date: 24-Oct-2021
  • (2021)Synthetic Differential Privacy Data Generation for Revealing Bias Modelling Risks2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211(1574-1580)Online publication date: Sep-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CIAT 2020: Proceedings of the 2020 International Conference on Cyberspace Innovation of Advanced Technologies
December 2020
597 pages
ISBN:9781450387828
DOI:10.1145/3444370
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Sun Yat-Sen University
  • CARLETON UNIVERSITY: INSTITUTE FOR INTERDISCIPLINARY STUDIES
  • Beijing University of Posts and Telecommunications
  • Guangdong University of Technology: Guangdong University of Technology
  • Deakin University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. electronic health record
  2. embedding
  3. encoding
  4. generative adversarial network
  5. privacy

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CIAT 2020

Acceptance Rates

CIAT 2020 Paper Acceptance Rate 94 of 232 submissions, 41%;
Overall Acceptance Rate 94 of 232 submissions, 41%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)58
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Predicting lncRNA-protein interactions using a hybrid deep learning model with dinucleotide-codon fusion feature encodingBMC Genomics10.1186/s12864-024-11168-325:1Online publication date: 28-Dec-2024
  • (2021)LGFC-CNN: Prediction of lncRNA-Protein Interactions by Using Multiple Types of Features through Deep LearningGenes10.3390/genes1211168912:11(1689)Online publication date: 24-Oct-2021
  • (2021)Synthetic Differential Privacy Data Generation for Revealing Bias Modelling Risks2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00211(1574-1580)Online publication date: Sep-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media