Skip to main content

Synthetic Data Generation for Differential Privacy Using Maximum Weight Matching

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14489))

  • 82 Accesses

Abstract

Differential privacy synthetic data is one of the most effective methods for privacy preserving data release. However, the existing schemes still suffer from high computational complexity and inability to directly handle values of large domain size when synthesizing high-dimensional data. To mitigate this gap, we propose synthetic data generation for differential privacy using maximum weight matching (DPMWM), a method for automatically synthesizing tabular data in high-dimensional large domain size via differential privacy. Specifically, DPMWM uses differential privacy maximum weight matching for low-dimensional marginal selection and then automatically synthesizes multiple records based on the filtered marginals. The experimental results show that DPMWM outperforms the state-of-the-art in terms of accuracy for counting queries and classification tasks on datasets with larger domain size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    When \(m_i\) is 1-way marginal, \({\text {dom}}\left( m_i\right) \) indicates the domain size of a single attribute. When \(m_i\) is 2-way marginal, such as \(m_i=(V_1,V_2)\), then \({\text {dom}}\left( m_i\right) ={\text {dom}}\left( V_1\right) \cdot {\text {dom}}\left( V_2\right) \).

References

  1. NIST. 2021 differential privacy synthetic data challenge. https://github.com/ryan112358/nist-synthetic-data-2021

  2. Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318 (2016)

    Google Scholar 

  3. Asuncion, A., Newman, D., Bache, K., Lichman, M.: UCI machine learning repository. Meta 2003 (2003)

    Google Scholar 

  4. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282 (2007)

    Google Scholar 

  5. Bun, M., Steinke, T.: Concentrated differential privacy: simplifications, extensions, and lower bounds. In: Hirt, M., Smith, A. (eds.) TCC 2016. LNCS, vol. 9985, pp. 635–658. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53641-4_24

    Chapter  Google Scholar 

  6. Cai, K., Lei, X., Wei, J., Xiao, X.: Data synthesis via differentially private Markov random fields. Proc. VLDB Endow. 14(11), 2190–2202 (2021)

    Article  Google Scholar 

  7. Chen, D., Kerkouche, R., Fritz, M.: Private set generation with discriminative information. arXiv preprint arXiv:2211.04446 (2022)

  8. Chen, D., Orekondy, T., Fritz, M.: GS-WGAN: a gradient-sanitized approach for learning differentially private generators. In: 34th Conference on Neural Information Processing Systems, pp. 12673–12684. Curran Associates, Inc. (2020)

    Google Scholar 

  9. Chen, R., Xiao, Q., Zhang, Y., Xu, J.: Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 2015, p. 129 (2015)

    Google Scholar 

  10. Chen, X., Wang, C., Yang, Q., et al.: Locally differentially private high-dimensional data synthesis (2023)

    Google Scholar 

  11. Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1

    Chapter  Google Scholar 

  12. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  13. Dwork, C., Rothblum, G.N., Vadhan, S.: Boosting and differential privacy. In: Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51–60 (2010)

    Google Scholar 

  14. Yu, W., Iranmanesh, S., Haldar, A., Zhang, M., Ferhatosmanoglu, H.: An axiomatic role similarity measure based on graph topology. In: Qin, L., et al. (eds.) SFDI LSGDA 2020. CCIS, vol. 1281, pp. 33–48. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61133-0_3

    Chapter  Google Scholar 

  15. Harder, F., Adamczewski, K., Park, M.: DP-MERF: differentially private mean embeddings with random features for practical privacy-preserving data generation. In: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021), vol. 130, pp. 1819–1827. PMLR (2021)

    Google Scholar 

  16. Kato, F., Takahashi, T., Takagi, S., Cao, Y., Liew, S.P., Yoshikawa, M.: HDPView: differentially private materialized view for exploring high dimensional relational data. arXiv preprint arXiv:2203.06791 (2022)

  17. Kohavi, R.: Scaling up the accuracy of Naive-Bayes classifiers: a decision-tree hybrid. In: Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207 (1996)

    Google Scholar 

  18. Li, J., Gan, W., Gui, Y., Wu, Y., Yu, P.S.: Frequent itemset mining with local differential privacy. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 1146–1155 (2022)

    Google Scholar 

  19. Libbi, C.A., Trienes, J., Trieschnigg, D., Seifert, C.: Generating synthetic training data for supervised de-identification of electronic health records. Future Internet 13(5), 136 (2021)

    Article  Google Scholar 

  20. Liu, F.: Model-based differentially private data synthesis and statistical inference in multiply synthetic differentially private data. arXiv e-prints, pp. arXiv-1606 (2016)

    Google Scholar 

  21. Long, Y., et al.: G-pate: scalable differentially private data generator via private aggregation of teacher discriminators. In: 35th Conference on Neural Information Processing Systems, NeurIPS 2021, pp. 2965–2977. Neural Information Processing Systems Foundation (2021)

    Google Scholar 

  22. McKenna, R., Sheldon, D., Miklau, G.: Graphical-model based estimation and inference for differential privacy. In: International Conference on Machine Learning, pp. 4435–4444. PMLR (2019)

    Google Scholar 

  23. Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014)

    Article  Google Scholar 

  24. Olave, M., Rajkovic, V., Bohanec, M.: An application for admission in public school systems. Expert Syst. Public Adm. 1, 145–160 (1989)

    Google Scholar 

  25. Qardaji, W., Yang, W., Li, N.: Priview: practical differentially private release of marginal contingency tables. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1435–1446 (2014)

    Google Scholar 

  26. Takagi, S., Takahashi, T., Cao, Y., Yoshikawa, M.: P3GM: private high-dimensional data release via privacy preserving phased generative model. In: 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp. 169–180. IEEE Computer Society (2021)

    Google Scholar 

  27. Torfi, A., Fox, E.A., Reddy, C.K.: Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022)

    Article  Google Scholar 

  28. Torkzadehmahani, R., Kairouz, P., Paten, B.: DP-CGAN: differentially private synthetic data and label generation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 98–104. IEEE (2019)

    Google Scholar 

  29. Wang, T., Lopuhaa-Zwakenberg, M., Li, Z., Skoric, B., Li, N.: Locally differentially private frequency estimation with consistency. In: NDSS 2020: Proceedings of the NDSS Symposium (2020)

    Google Scholar 

  30. Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739 (2018)

  31. Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., Winslett, M.: Differentially private histogram publication. VLDB J. 22, 797–822 (2013)

    Article  Google Scholar 

  32. Yue, X., et al.: Synthetic text generation with differential privacy: a simple and practical recipe. arXiv preprint arXiv:2210.14348 (2022)

  33. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1423–1434 (2014)

    Google Scholar 

  34. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via Bayesian networks. ACM Trans. Database Syst. (TODS) 42(4), 1–41 (2017)

    Article  MathSciNet  Google Scholar 

  35. Zhang, Z., et al.: PrivSyn: differentially private data synthesis. In: Proceedings of the 30th USENIX Security Symposium (2021)

    Google Scholar 

  36. Zhu, T., Li, G., Zhou, W., Yu, P.S.: Differentially private data publishing and analysis: a survey. IEEE Trans. Knowl. Data Eng. 29(8), 1619–1638 (2017)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinxin Ye .

Editor information

Editors and Affiliations

A Proof of Lemma

A Proof of Lemma

Proof

We assume a dataset D contains n records, and consider two attributes a and b. Denote the frequency of different values of attribute a is \(\left\{ a_1,a_2,\ldots \right\} \) and the frequency of b is \(\left\{ b_1,b_2,\ldots \right\} \). For 2-way marginal of (a, b), denote its frequency of joint distribution is \(\left\{ c_{11},c_{12},\ldots \right\} \).

The metric w(a, b) is

$$\begin{aligned} w(a, b)=\frac{1}{2} \sum _{i j}\left| \frac{a_i b_j}{n}-c_{i j}\right| \end{aligned}$$

If we add a user with value \(u\) for \(a\) and \(v\) for \(b\), then

$$ \begin{aligned} w^{\prime }(a, b) & =\frac{1}{2} \sum _{i \ne u, j \ne v}\left| \frac{a_{i} b_{j}}{n+1}-c_{i j}\right| \\ & +\frac{1}{2} \sum _{i \ne u}\left| \frac{a_{i}\left( b_{v}+1\right) }{n+1}-c_{i v}\right| \\ & +\frac{1}{2} \sum _{j \ne v}\left| \frac{\left( a_{u}+1\right) b_{j}}{n+1}-c_{u j}\right| \\ & +\frac{1}{2}\left| \frac{\left( a_{u}+1\right) \left( b_{v}+1\right) }{n+1}-\left( c_{u v}+1\right) \right| \end{aligned} $$

The sensitivity is given by

$$\begin{aligned} \begin{aligned} \varDelta _{w} & =\left| w(a, b)-w^{\prime }(a, b)\right| \\ & \le \frac{1}{2} \sum _{i \ne u, j \ne v}\left| \frac{a_{i} b_{j}}{n(n+1)}\right| +\frac{1}{2} \sum _{i \ne u}\left| \frac{a_{i} b_{v}}{n(n+1)}-\frac{a_{i}}{n+1}\right| \\ & +\frac{1}{2} \sum _{j \ne v}\left| \frac{a_{u}b_j}{n(n+1)}-\frac{b_{j}}{n+1}\right| +\frac{1}{2}\left| \frac{(n+1)a_{u} b_{v}-n\left( a_{u}+1\right) (b_{v}+1)+n(n+1)}{n(n+1)}\right| \\ & =\frac{\frac{1}{2} \sum _{i \ne u,j \ne v} a_{i} b_{j}-\frac{1}{2} \sum _{i \ne u}\left( a_{i} b_{v}-na_{i}\right) -\frac{1}{2} \sum _{j \ne v}\left( a_{u} b_{j}-n b_{j}\right) }{n(n+1)} \\ & + \frac{(n+1) a_{u} b_{v}-n\left( a_{u}+1\right) \left( b_{v}+1\right) +n(n+1)}{2 n(n+1)} \\ & =\frac{(n-a_u)(n-b_v)-(n-a_u)(b_v-n)-(a_u-n)(n-b_v)+(n-a_u)(n-b_v)}{2 n(n+1)} \\ & =\frac{4(n-a_u) \cdot (n-b_v)}{2 n(n+1)} \\ & =\frac{2\left( n-a_{u}\right) \left( n-b_{v}\right) }{n(n+1)} \le 2 \\ \end{aligned} \end{aligned}$$
(8)

For the above formula, some details are \(\sum _{i \ne u} a_i=n-a_u\), \(\sum _{j \ne v}b_j=n-b_v\) and \(\sum _{i \ne u,j \ne v} a_{i} b_{j}=(n-a_u)(n-b_v)\).

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, M., Ye, X., Deng, H. (2024). Synthetic Data Generation for Differential Privacy Using Maximum Weight Matching. In: Tari, Z., Li, K., Wu, H. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2023. Lecture Notes in Computer Science, vol 14489. Springer, Singapore. https://doi.org/10.1007/978-981-97-0798-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-0798-0_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-0797-3

  • Online ISBN: 978-981-97-0798-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics