Abstract:
The use of synthetic data is a widely acknowledged privacy-preserving measure that reduces identity and attribute disclosure risks in micro-data. The idea is to learn the...View moreMetadata
Abstract:
The use of synthetic data is a widely acknowledged privacy-preserving measure that reduces identity and attribute disclosure risks in micro-data. The idea is to learn the statistical properties of an original dataset, store this information in a model, and then use this model to generate artificial samples and build a synthetic dataset that resembles the original. One of the many different approaches of synthetization tools relies on describing the original dataset by using a Bayesian network. This method is implemented in the open-source tool DataSynthesizer and has proven particularly suitable for datasets with a small to moderate number of attributes. In this paper, we will substitute the greedy algorithm used for learning the Bayesian network by a substantially faster genetic algorithm. In addition, our goal is to protect particularly sensitive attributes by decreasing specific correlations in the synthetic data that may reveal personal information. We will thus show how to customize the network structures for specific machine learning tasks. Our experiments demonstrate that this technique allows to further decrease the disclosure risks and, hence, add to the applicability of synthetic data as technique for privacy preservation.
Date of Conference: 17-20 December 2022
Date Added to IEEE Xplore: 26 January 2023
ISBN Information: