Synthetic Data Generation for Differential Privacy Using Maximum Weight Matching

Zhang, Miao; Ye, Xinxin; Deng, Hai

doi:10.1007/978-981-97-0798-0_8

Miao Zhang¹⁰,
Xinxin Ye¹⁰ &
Hai Deng¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14489))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

82 Accesses

Abstract

Differential privacy synthetic data is one of the most effective methods for privacy preserving data release. However, the existing schemes still suffer from high computational complexity and inability to directly handle values of large domain size when synthesizing high-dimensional data. To mitigate this gap, we propose synthetic data generation for differential privacy using maximum weight matching (DPMWM), a method for automatically synthesizing tabular data in high-dimensional large domain size via differential privacy. Specifically, DPMWM uses differential privacy maximum weight matching for low-dimensional marginal selection and then automatically synthesizes multiple records based on the filtered marginals. The experimental results show that DPMWM outperforms the state-of-the-art in terms of accuracy for counting queries and classification tasks on datasets with larger domain size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
When $m_i$ is 1-way marginal, ${\text {dom}}\left( m_i\right) $ indicates the domain size of a single attribute. When $m_i$ is 2-way marginal, such as $m_i=(V_1,V_2)$, then ${\text {dom}}\left( m_i\right) ={\text {dom}}\left( V_1\right) \cdot {\text {dom}}\left( V_2\right) $.

References

NIST. 2021 differential privacy synthetic data challenge. https://github.com/ryan112358/nist-synthetic-data-2021
Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)
Google Scholar
Asuncion, A., Newman, D., Bache, K., Lichman, M.: UCI machine learning repository. Meta 2003 (2003)
Google Scholar
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282 (2007)
Google Scholar
Bun, M., Steinke, T.: Concentrated differential privacy: simplifications, extensions, and lower bounds. In: Hirt, M., Smith, A. (eds.) TCC 2016. LNCS, vol. 9985, pp. 635–658. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53641-4_24
Chapter Google Scholar
Cai, K., Lei, X., Wei, J., Xiao, X.: Data synthesis via differentially private Markov random fields. Proc. VLDB Endow. 14(11), 2190–2202 (2021)
Article Google Scholar
Chen, D., Kerkouche, R., Fritz, M.: Private set generation with discriminative information. arXiv preprint arXiv:2211.04446 (2022)
Chen, D., Orekondy, T., Fritz, M.: GS-WGAN: a gradient-sanitized approach for learning differentially private generators. In: 34th Conference on Neural Information Processing Systems, pp. 12673–12684. Curran Associates, Inc. (2020)
Google Scholar
Chen, R., Xiao, Q., Zhang, Y., Xu, J.: Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 2015, p. 129 (2015)
Google Scholar
Chen, X., Wang, C., Yang, Q., et al.: Locally differentially private high-dimensional data synthesis (2023)
Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Chapter Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Dwork, C., Rothblum, G.N., Vadhan, S.: Boosting and differential privacy. In: Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51–60 (2010)
Google Scholar
Yu, W., Iranmanesh, S., Haldar, A., Zhang, M., Ferhatosmanoglu, H.: An axiomatic role similarity measure based on graph topology. In: Qin, L., et al. (eds.) SFDI LSGDA 2020. CCIS, vol. 1281, pp. 33–48. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61133-0_3
Chapter Google Scholar
Harder, F., Adamczewski, K., Park, M.: DP-MERF: differentially private mean embeddings with random features for practical privacy-preserving data generation. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), vol. 130, pp. 1819–1827. PMLR (2021)
Google Scholar
Kato, F., Takahashi, T., Takagi, S., Cao, Y., Liew, S.P., Yoshikawa, M.: HDPView: differentially private materialized view for exploring high dimensional relational data. arXiv preprint arXiv:2203.06791 (2022)
Kohavi, R.: Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In: Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207 (1996)
Google Scholar
Li, J., Gan, W., Gui, Y., Wu, Y., Yu, P.S.: Frequent itemset mining with local differential privacy. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1146–1155 (2022)
Google Scholar
Libbi, C.A., Trienes, J., Trieschnigg, D., Seifert, C.: Generating synthetic training data for supervised de-identification of electronic health records. Future Internet 13(5), 136 (2021)
Article Google Scholar
Liu, F.: Model-based differentially private data synthesis and statistical inference in multiply synthetic differentially private data. arXiv e-prints, pp. arXiv-1606 (2016)
Google Scholar
Long, Y., et al.: G-pate: scalable differentially private data generator via private aggregation of teacher discriminators. In: 35th Conference on Neural Information Processing Systems, NeurIPS 2021, pp. 2965–2977. Neural Information Processing Systems Foundation (2021)
Google Scholar
McKenna, R., Sheldon, D., Miklau, G.: Graphical-model based estimation and inference for differential privacy. In: International Conference on Machine Learning, pp. 4435–4444. PMLR (2019)
Google Scholar
Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014)
Article Google Scholar
Olave, M., Rajkovic, V., Bohanec, M.: An application for admission in public school systems. Expert Syst. Public Adm. 1, 145–160 (1989)
Google Scholar
Qardaji, W., Yang, W., Li, N.: Priview: practical differentially private release of marginal contingency tables. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1435–1446 (2014)
Google Scholar
Takagi, S., Takahashi, T., Cao, Y., Yoshikawa, M.: P3GM: private high-dimensional data release via privacy preserving phased generative model. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 169–180. IEEE Computer Society (2021)
Google Scholar
Torfi, A., Fox, E.A., Reddy, C.K.: Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022)
Article Google Scholar
Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: differentially private synthetic data and label generation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 98–104. IEEE (2019)
Google Scholar
Wang, T., Lopuhaa-Zwakenberg, M., Li, Z., Skoric, B., Li, N.: Locally differentially private frequency estimation with consistency. In: NDSS 2020: Proceedings of the NDSS Symposium (2020)
Google Scholar
Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739 (2018)
Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22, 797–822 (2013)
Article Google Scholar
Yue, X., et al.: Synthetic text generation with differential privacy: a simple and practical recipe. arXiv preprint arXiv:2210.14348 (2022)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1423–1434 (2014)
Google Scholar
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)
Article MathSciNet Google Scholar
Zhang, Z., et al.: PrivSyn: differentially private data synthesis. In: Proceedings of the 30th USENIX Security Symposium (2021)
Google Scholar
Zhu, T., Li, G., Zhou, W., Yu, P.S.: Differentially private data publishing and analysis: a survey. IEEE Trans. Knowl. Data Eng. 29(8), 1619–1638 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Miao Zhang & Xinxin Ye
Department of Electrical and Computer Engineering, Florida International University, Miami, FL, 33174, USA
Hai Deng

Authors

Miao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xinxin Ye
View author publications
You can also search for this author in PubMed Google Scholar
Hai Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinxin Ye .

Editor information

Editors and Affiliations

Royal Melbourne Institute of Technology, Melbourne, VIC, Australia
Zahir Tari
Tianjin University, Tianjin, China
Keqiu Li
University of Arizona, Tucson, AZ, USA
Hongyi Wu

A Proof of Lemma

Proof

We assume a dataset D contains n records, and consider two attributes a and b. Denote the frequency of different values of attribute a is $\left\{ a_1,a_2,\ldots \right\} $ and the frequency of b is $\left\{ b_1,b_2,\ldots \right\} $. For 2-way marginal of (a, b), denote its frequency of joint distribution is $\left\{ c_{11},c_{12},\ldots \right\} $.

The metric w(a, b) is

$$\begin{aligned} w(a, b)=\frac{1}{2} \sum _{i j}\left| \frac{a_i b_j}{n}-c_{i j}\right| \end{aligned}$$

If we add a user with value $u$ for $a$ and $v$ for $b$, then

$$ \begin{aligned} w^{\prime }(a, b) & =\frac{1}{2} \sum _{i \ne u, j \ne v}\left| \frac{a_{i} b_{j}}{n+1}-c_{i j}\right| \\ & +\frac{1}{2} \sum _{i \ne u}\left| \frac{a_{i}\left( b_{v}+1\right) }{n+1}-c_{i v}\right| \\ & +\frac{1}{2} \sum _{j \ne v}\left| \frac{\left( a_{u}+1\right) b_{j}}{n+1}-c_{u j}\right| \\ & +\frac{1}{2}\left| \frac{\left( a_{u}+1\right) \left( b_{v}+1\right) }{n+1}-\left( c_{u v}+1\right) \right| \end{aligned} $$

The sensitivity is given by

$$\begin{aligned} \begin{aligned} \varDelta _{w} & =\left| w(a, b)-w^{\prime }(a, b)\right| \\ & \le \frac{1}{2} \sum _{i \ne u, j \ne v}\left| \frac{a_{i} b_{j}}{n(n+1)}\right| +\frac{1}{2} \sum _{i \ne u}\left| \frac{a_{i} b_{v}}{n(n+1)}-\frac{a_{i}}{n+1}\right| \\ & +\frac{1}{2} \sum _{j \ne v}\left| \frac{a_{u}b_j}{n(n+1)}-\frac{b_{j}}{n+1}\right| +\frac{1}{2}\left| \frac{(n+1)a_{u} b_{v}-n\left( a_{u}+1\right) (b_{v}+1)+n(n+1)}{n(n+1)}\right| \\ & =\frac{\frac{1}{2} \sum _{i \ne u,j \ne v} a_{i} b_{j}-\frac{1}{2} \sum _{i \ne u}\left( a_{i} b_{v}-na_{i}\right) -\frac{1}{2} \sum _{j \ne v}\left( a_{u} b_{j}-n b_{j}\right) }{n(n+1)} \\ & + \frac{(n+1) a_{u} b_{v}-n\left( a_{u}+1\right) \left( b_{v}+1\right) +n(n+1)}{2 n(n+1)} \\ & =\frac{(n-a_u)(n-b_v)-(n-a_u)(b_v-n)-(a_u-n)(n-b_v)+(n-a_u)(n-b_v)}{2 n(n+1)} \\ & =\frac{4(n-a_u) \cdot (n-b_v)}{2 n(n+1)} \\ & =\frac{2\left( n-a_{u}\right) \left( n-b_{v}\right) }{n(n+1)} \le 2 \\ \end{aligned} \end{aligned}$$

(8)

For the above formula, some details are $\sum _{i \ne u} a_i=n-a_u$, $\sum _{j \ne v}b_j=n-b_v$ and $\sum _{i \ne u,j \ne v} a_{i} b_{j}=(n-a_u)(n-b_v)$.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, M., Ye, X., Deng, H. (2024). Synthetic Data Generation for Differential Privacy Using Maximum Weight Matching. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14489. Springer, Singapore. https://doi.org/10.1007/978-981-97-0798-0_8

Download citation

DOI: https://doi.org/10.1007/978-981-97-0798-0_8
Published: 01 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-0797-3
Online ISBN: 978-981-97-0798-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Synthetic Data Generation for Differential Privacy Using Maximum Weight Matching

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Proof of Lemma

A Proof of Lemma

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation