Authors:
Takayuki Miura
1
;
Eizen Kimura
2
;
Atsunori Ichikawa
1
;
Masanobu Kii
1
and
Juko Yamamoto
1
Affiliations:
1
NTT Social Informatics Laboratories, Tokyo, Japan
;
2
Dept. Medical Informatics, Medical School of Ehime Univ., Ehime, Japan
Keyword(s):
Synthetic Data Generation, Differential Privacy, Real-World Data.
Abstract:
Anticipation surrounds the use of real-world data for data analysis in medicine and healthcare, yet handling sensitive data demands ethical review and safety management, presenting bottlenecks in the swift progression of research. Consequently, numerous techniques have emerged for generating synthetic data, which preserves the features of the original data. Nonetheless, the quality of such synthetic data, particularly in the context of real-world data, has yet to be sufficiently examined. In this paper, we conduct experiments with a Diagonosis Procedure Combination (DPC) dataset to evaluate the quality of synthetic data generated by statistics-based, graphical model-based, and deep neural network-based methods. Further, we implement differential privacy for theoretical privacy protection and assess the resultant degradation of data quality. The findings indicate that a statistics-based method called Gaussian Copula and a graphical-model-based method called AIM yield high-quality synt
hetic data regarding statistical similarity and machine learning model performance. The paper also summarizes issues pertinent to the practical application of synthetic data derived from the experimental results.
(More)