Skip to main content

A Bayesian Nonparametric Approach to Differentially Private Data

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

  • 838 Accesses

Abstract

The protection of private and sensitive data is an important problem of increasing interest due to the vast amount of personal data collected. Differential Privacy is arguably the most dominant approach to address privacy protection, and is currently implemented in both industry and government. In a decentralized paradigm, the sensitive information belonging to each individual will be locally transformed by a known privacy-maintaining mechanism Q. The objective of differential privacy is to allow an analyst to recover the distribution of the raw data, or some functionals of it, while only having access to the transformed data. In this work, we propose a Bayesian nonparametric methodology to perform inference on the distribution of the sensitive data, reformulating the differentially private estimation problem as a latent variable Dirichlet Process mixture model. This methodology has the advantage that it can be applied to any mechanism Q and works as a “black box” procedure, being able to estimate the distribution and functionals thereof using the same MCMC draws and with very little tuning. Also, being a fully nonparametric procedure, it requires very little assumptions on the distribution of the raw data. For the most popular mechanisms Q, like Laplace and Gaussian, we describe efficient specialized MCMC algorithms and provide theoretical guarantees. Experiments on both synthetic and real dataset show a good performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Antoniak, C.E.: Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems (1974)

    Google Scholar 

  2. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM (2007)

    Google Scholar 

  3. Borgs, C., Chayes, J., Smith, A.: Private graphon estimation for sparse graphs. In: Advances in Neural Information Processing Systems, pp. 1369–1377 (2015)

    Google Scholar 

  4. Borgs, C., Chayes, J., Smith, A., Zadik, I.: Revealing network structure, confidentially: improved rates for node-private graphon estimation. In: 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 533–543. IEEE (2018)

    Google Scholar 

  5. Duchi, J.C., Jordan, M.I., Wainwright, M.J.: Minimax optimal procedures for locally private estimation. J. Am. Stat. Assoc. 113(521), 182–201 (2018)

    Article  MathSciNet  Google Scholar 

  6. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  7. Eland, A.: Tackling urban mobility with technology. Google Europe Blog, 18 November 2015

    Google Scholar 

  8. Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1054–1067. ACM (2014)

    Google Scholar 

  9. Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional Contingency Tables. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 187–199. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15838-4_17

    Chapter  Google Scholar 

  10. Gaboardi, M., Lim, H.W., Rogers, R.M., Vadhan, S.P.: Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing. In: ICML 2016 Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR (2016)

    Google Scholar 

  11. Gaboardi, M., Rogers, R.: Local private hypothesis testing: chi-square tests. arXiv preprint arXiv:1709.07155 (2017)

  12. Gao, F., van der Vaart, A., et al.: Posterior contraction rates for deconvolution of Dirichlet-Laplace mixtures. Electron. J. Stat. 10(1), 608–627 (2016)

    Article  MathSciNet  Google Scholar 

  13. Ghosal, S., Van der Vaart, A.: Fundamentals of Nonparametric Bayesian Inference, vol. 44. Cambridge University Press, Cambridge (2017)

    Book  Google Scholar 

  14. Karwa, V., Slavković, A., et al.: Inference using noisy degrees: differentially private \(\beta \)-model and synthetic graphs. Ann. Stat. 44(1), 87–112 (2016)

    Article  MathSciNet  Google Scholar 

  15. Kasiviswanathan, S.P., Nissim, K., Raskhodnikova, S., Smith, A.: Analyzing graphs with node differential privacy. In: Sahai, A. (ed.) TCC 2013. LNCS, vol. 7785, pp. 457–476. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36594-2_26

    Chapter  Google Scholar 

  16. Lo, A.Y.: On a class of Bayesian nonparametric estimates: I. Density estimates. Ann. Stat. 12, 351–357 (1984)

    Article  MathSciNet  Google Scholar 

  17. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286. IEEE Computer Society (2008)

    Google Scholar 

  18. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS 2007, pp. 94–103 (2007)

    Google Scholar 

  19. Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)

    MathSciNet  Google Scholar 

  20. Nguyen, X., et al.: Convergence of latent mixing measures in finite and infinite mixture models. Ann. Stat. 41(1), 370–400 (2013)

    Article  MathSciNet  Google Scholar 

  21. Rinott, Y., O’Keefe, C.M., Shlomo, N., Skinner, C., et al.: Confidentiality and differential privacy in the dissemination of frequency tables. Stat. Sci. 33(3), 358–385 (2018)

    Article  MathSciNet  Google Scholar 

  22. Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-71050-9

    Book  MATH  Google Scholar 

  23. Wang, Y., Lee, J., Kifer, D.: Revisiting differentially private hypothesis tests for categorical data. arXiv preprint arXiv:1511.03376 (2015)

  24. Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Battiston .

Editor information

Editors and Affiliations

Appendices

Appendix A: Algorithm for Laplace Mechanism

In this Section, we derive the posterior \(\mathbb {P}(dX^{*}_{k}|Z_{j_{1}:j_{n_k}})\) in the case of Laplace Mechanism. Together with Algorithm 2 in the main text, this posterior offers an efficient MCMC algorithm to perform posterior estimation when the Laplace mechanism has been applied to the original data. We remark that, even though the posterior (4) might look complicated at first glance, it is actually just a mixture distribution. For most choices of \(P_0\), it is very easy to compute the weights of this mixture and sample from it. After the proof of Proposition A, we will detail a specific example of (4) for \(P_0\) being Gaussian, which will be used in the experiments. The parameters r and \(\lambda _\alpha \) are chosen as in [5] so that the Laplace Mechanism satisfies Differential Privacy.

Proposition A

(Posterior with Laplace Mech.). Let \(r > 0\) and \(\varPi _{[-r,r]}\) denote the projection operator on \([-r,r]\), defined as . Let \(Z_{i} | X_{i} \sim \text {Laplace} (\varPi _{[-r,r]}(X_{i}), \lambda _\alpha )\) \(i=1,\ldots ,n\) and let \(Z_{j_1},\ldots ,Z_{j_{n_k}}\) denote the \(n_k\) observations currently assigned to cluster k, i.e. with \(c_{j_i}=k\), assumed w.l.o.g. to be ordered increasingly. Let also \(i_- := \min \{ i\ |\ Z_{j_i} \ge -r\}\) (\(i_- = m+1\) if the set is empty) and \(i_+ := \max \{ i\ |\ Z_{j_i} \le r\}\) (\(i_+ = 0\) if the set is empty) and \(\widetilde{Z}_{i_- - 1} = -r\), \(\widetilde{Z}_{i_+ + 1} = r\) and for \(i \in [i_-, i_+]\), \(\widetilde{Z}_i = Z_{j_i}\). Then, the posterior distribution \(\mathbb {P}(dX^{*}_{k}|Z_{j_{1}:j_{n_k}})\) is proportional to

(4)

where \(C_j = e^{\frac{1}{\lambda _\alpha } \left( \sum \limits _{i=1}^{j} \widetilde{Z}_{i} - \sum \limits _{i = j+1}^n \widetilde{Z}_{i}\right) }\) for \(j=\{i_- -1,\ldots , i_+ \}\).

Normal Base Measure: Let \(P_0(dX) = \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X-\mu )^2}{2\sigma ^2}} dX\) be a Normal distribution. Let us denote \(\tilde{\mu }_j = \frac{(n-2j)\sigma ^2}{\lambda _\alpha } + \mu \). Then the posterior (4) specializes into

$$\begin{aligned} \mathbb {P}&(X^{*}_{k} | Z_{j_{1}:j_{n_k}}) \propto \mathbb {I}_{X^{*}_{k} < -r}\ C_{i_- - 1}\ e^{\frac{2i_- - n_k - 2}{\lambda _\alpha } r} \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X^{*}_{k}-\mu )^2}{2\sigma ^2}}\\&+ \sum \limits _{j=i_- - 1}^{i_+} \mathbb {I}_{X^{*}_{k} \in [\widetilde{Z}_j, \widetilde{Z}_{j+1})} \ C_j\ e^{\frac{\tilde{\mu }_j^2 - \mu ^2}{2\sigma ^2}} \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X^{*}_{k}-\tilde{\mu }_j)^2}{2\sigma ^2}} \\ {}&+\,\mathbb {I}_{X^{*}_{k} \ge r}\ C_{i_+}\ e^{-\frac{2i_+ - n_k}{\lambda _\alpha } r} \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X^{*}_{k}-\mu )^2}{2\sigma ^2}}, \end{aligned}$$

where we have used the fact that

Let us denote, for \(j=i_- - 2,.., i_+ + 1\),

$$\begin{aligned}&\varPi _{i_--2} = C_{i_- - 1}\ e^{\frac{2i_- - n_k - 2}{\lambda _\alpha } r} \left[ 1 + \text {erf}\left( \frac{-r-\mu }{\sqrt{2}\sigma } \right) \right] \\&\varPi _j = C_j e^{\frac{\tilde{\mu }_j^2 - \mu ^2}{2\sigma ^2}} [ \text {erf}\left( \frac{\widetilde{Z}_{j+1}-\tilde{\mu }_j}{\sqrt{2}\sigma } \right) - \text {erf}\left( \frac{\widetilde{Z}_{j}-\tilde{\mu }_j}{\sqrt{2}\sigma } \right) ]\ \ \ \text {for } j=i_--1,.., i_+; \\&\varPi _{i_+ + 1} = C_{i_+}\ e^{-\frac{2i_+ - n_k - 2}{\lambda _\alpha } r} \left[ 1 - \text {erf}\left( \frac{r-\mu }{\sqrt{2}\sigma } \right) \right] \end{aligned}$$

where \(\text {erf}\) denotes the Gauss error function. Let \((\pi _j)_j = (\varPi _j/\sum _k \varPi _k)_j\) the normalized weights. The posterior is then a mixture of truncated Normals with disjoint supports. In order to sample for it, we can proceed in 2 steps. First, we sample the a categorical variable J such that \(\mathbb {P}(J=j) = \pi _j\). If \(J = i_- - 2\), we sample \(X^{*}_{k}\) from a truncated Normal with mean and variance respectively \(\mu \) and \(\sigma ^2\) restricted on \((-\infty ,-r)\). If \(J = i_+ + 1\), we sample \(X^{*}_{k}\) from a truncated Normal with same mean and variance on \((r,\infty )\). Otherwise, sample \(X^{*}_{k}\) from a truncated Normal with mean and variance respectively \(\tilde{\mu }_j\) and \(\sigma ^2\) restricted to \((\widetilde{Z}_{j}, \widetilde{Z}_{j+1})\).

Appendix B: Proof of Proposition 1

Denote first \(M_P (Z_i) = \int Q(Z_i|X_i) P(dX_i)\), the marginal of the observations when the sensitive data is distributed according to P. Therefore, denoting \(P_*\) the true distribution of the sensitive data \(X_i\), it comes that the true marginal distribution of \(Z_i\) is \(M_{P_*}\). We will prove Proposition 2, following these steps,

  1. 1.

    Step 1: We show that

    $$\begin{aligned} \forall \epsilon> 0,\ \varPi (h(M_P,M_{P^*}) > \epsilon \ |\ Z_{1:n}) \rightarrow 0 \ \ \text {a.s.} \end{aligned}$$
    (5)

    Here, \(\varPi \) denotes the Dirichlet process prior and \(\varPi ( \cdot |Z_{1:n})\) denotes the posterior under the DPM model and h the Hellinger distance.

  2. 2.

    Step 2: We will show that for any \(\delta > 0\),

    $$\begin{aligned} W_2(P,P_*)^2 \le C_\delta h(M_P,M_{P^*}) ^{3/4}+ C \delta ^2 \end{aligned}$$
    (6)

    where \(W_2\) is the \(\mathbb {L}_2\) Wasserstein distance.

  3. 3.

    Conclusion: Using step 1 and 2, we will show that for any \(\epsilon > 0\),

    $$\begin{aligned} \varPi (W_1(P,P_*) > \epsilon \ |\ Z_{1:n}) \rightarrow 0 \ \ \text {a.s.} \end{aligned}$$
    (7)

    Now, since \(W_1\) is convex and uniformly bounded on the space of probability measures on \(\mathcal {X} \subset [a,b]\), Theorem 6.8 of [13] gives that \(\mathbb {E}(P|Z_{1:n})\) converges almost surely to \(P_*\) for the \(W_1\) metric. Since [ab] is compact, this implies that it also converges for any \(W_k\) for \(k \ge 1\).

To simplify the reading of the proof, in the following C will refer to constant quantities (in particular they do not depend on n), that can change from line to line.

Let us start with the easiest step, which is Eq. (7) of step 3. Let \(\epsilon > 0\), from Eq. (6), we know that

$$\begin{aligned} \varPi (&W_2(P,P_*)^2 \le \epsilon \ |\ Z_{1:n}) \ge \varPi (C_\delta h(M_P,M_{P^*})^{3/4} + C \delta ^2 \le \epsilon \ |\ Z_{1:n}). \end{aligned}$$

Take \(\delta \) such that \(C \delta ^2 \le \epsilon /2\). We can hence lower bound the left hand side of previous inequality by \( \varPi (C_\delta h(M_P,M_{P^*})^{3/4} \le \epsilon /2 \ |\ Z_{1:n})\), which we know from Eq. (5) converges almost surely to 1, proving convergence in \(W_2\), which implies (7) since \(\mathcal {X} \subset [a,b]\).

Now let us consider Step 1. The Dirichlet prior \(\varPi \) defines a prior on the marginals of \(Z_i\), \(M_P\) (also denoted \(\varPi \)). Since \(Z_i \overset{iid}{\sim } M_{P_*}\), Schwartz theorem guarantees that (5) holds as long as \(M_{P_*}\) is in the Kullback-Leibler support of \(\varPi \). We will use Theorem 7.2 of [13] to prove it. Let

$$\underline{Q}(Z_i ; \mathcal {X}) = \inf \limits _{x \in \mathcal {X}} Q(Z_i | x).$$

Let \(Z_i \in \mathcal {Z}\), for any \(X_i \in \mathcal {X}\), the differential privacy condition gives

$$Q(Z_i | X_i) \le e^\alpha \underline{Q}(Z_i ; \mathcal {X}) < +\infty , $$

which corresponds to condition (A1) in the theorem of [13]. We only need to prove that (A2) holds, i.e.

$$\begin{aligned} \int \log \left( \frac{M_{P_*}(Z_i)}{\underline{Q}(Z_i ; \mathcal {X})} \right) M_{P_*}(dZ_i) < +\infty , \end{aligned}$$

for any probability measure P on \(\mathcal {X}\). To see this we rewrite the expression in the \(\log \) as follows

$$\begin{aligned} \frac{M_{P_*}(Z_i)}{\underline{Q}(Z_i ; \mathcal {X})}= & {} \int \frac{Q(Z_i|X_i)}{\underline{Q}(Z_i ; \mathcal {X})} P(dX_i) \le e^\alpha \int P(dX_i) = e^\alpha \end{aligned}$$

where last inequality is due to the differential privacy property of Q. This proves Step 1.

It remains to prove Step 2. We remark first that since the noise is additive in our setting, \(Q(Z_i | X_i) = C_Q e^{-\alpha \rho (X_i-Z_i)/\varDelta }\) where \(C_Q\) is a constant (independent of \(X_i\)). Denote \(f: t \mapsto C_Q e^{-\alpha \rho (t)/\varDelta }\) and \(\mathcal {L}(f)\) its Fourier transform. Denote \(P*f\) the convolution of P and f. We also recall that

$$\begin{aligned} \mathcal {L}(M_P) = \mathcal {L}(P*f) = \mathcal {L}(P) \mathcal {L}(f) \end{aligned}$$

This part follows the same strategy as the proof of Theorem 2 in [20], the main difference being that here we are not interested in rates and hence need weaker conditions on f. In a similar way, we define a symmetric density K on \(\mathbb {R}\) whose Fourier transform \(\mathcal {L}(K)\) is continuous, bounded and with support included in \([-1,1]\). Let \(\delta \in (0,1)\) and \(K_\delta (x) = \frac{1}{\delta } K(x/\delta )\). Following the lines of the proof of Theorem 2 in [20], we find that

$$\begin{aligned} W_2^2( P, P_*) \le C ( || P * K_\delta - P_* * K_\delta ||_2^{3/4} + \delta ^2), \end{aligned}$$
(8)

where C is a constant (depending only on K), and that

$$\begin{aligned} || P * K_\delta - P_* * K_\delta ||_2\le & {} 2 d_{TV} (M_P, M_{P_*}) || g_\delta ||_2 \end{aligned}$$

where \(g_\delta \) is the inverse Fourier transform of \( \frac{\mathcal {L}(K_\delta )}{\mathcal {L}(f)}\) and \(d_{TV}\) the total variation distance. Now, using Plancherel’s identity it comes that

$$\begin{aligned} ||g_\delta ||_2^2\le & {} C \int \left| \frac{ \mathcal {L}(K_\delta )^2(t) }{\mathcal {L}(f)^2(t)} \right| ^2 dt \le C \int _{[-1/\delta , 1/\delta ]} \left| \frac{ \mathcal {L}(K_\delta )^2(t) }{\mathcal {L}(f)^2(t)} \right| ^2 dt \le C \sup \limits _{[-1/\delta , 1/\delta ] } |\mathcal {L}(f)|^{-2} \end{aligned}$$

where second line comes from the fact that the support of \(\mathcal {L}(K_\delta )\) is in \([-1/\delta ,1/\delta ]\), and third line from the fact that it is bounded. Since \(|\mathcal {L}(f)|\) is strictly positive (from assumptions) and continuous, it comes that \(C_\delta ^2 = C \sup \limits _{[-1/ \delta , 1/ \delta ] } |\mathcal {L}(f)|^{-2} < +\infty \). Using the bound \(d_{TV} \le \sqrt{2}\ h\), we can write

$$\begin{aligned} || P * K_\delta - P_* * K_\delta ||_2 \le C_\delta h(M_P, M_{P_*}), \end{aligned}$$
(9)

which together with (8) gives

$$ W_2^2( P, P_*) \le C_\delta h(M_P,M_{P^*}) ^{3/4}+ C \delta ^2 $$

Convergence of moments follows directly from [22] (Theorems 6.7 and 6.8).

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ayed, F., Battiston, M., Di Benedetto, G. (2020). A Bayesian Nonparametric Approach to Differentially Private Data. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57521-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57520-5

  • Online ISBN: 978-3-030-57521-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics