A Bayesian Nonparametric Approach to Differentially Private Data

Ayed, Fadhel; Battiston, Marco; Di Benedetto, Giuseppe

doi:10.1007/978-3-030-57521-2_3

Fadhel Ayed¹⁰,
Marco Battiston¹¹ &
Giuseppe Di Benedetto¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12276))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

838 Accesses

Abstract

The protection of private and sensitive data is an important problem of increasing interest due to the vast amount of personal data collected. Differential Privacy is arguably the most dominant approach to address privacy protection, and is currently implemented in both industry and government. In a decentralized paradigm, the sensitive information belonging to each individual will be locally transformed by a known privacy-maintaining mechanism Q. The objective of differential privacy is to allow an analyst to recover the distribution of the raw data, or some functionals of it, while only having access to the transformed data. In this work, we propose a Bayesian nonparametric methodology to perform inference on the distribution of the sensitive data, reformulating the differentially private estimation problem as a latent variable Dirichlet Process mixture model. This methodology has the advantage that it can be applied to any mechanism Q and works as a “black box” procedure, being able to estimate the distribution and functionals thereof using the same MCMC draws and with very little tuning. Also, being a fully nonparametric procedure, it requires very little assumptions on the distribution of the raw data. For the most popular mechanisms Q, like Laplace and Gaussian, we describe efficient specialized MCMC algorithms and provide theoretical guarantees. Experiments on both synthetic and real dataset show a good performance of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

PrivGMM: Probability Density Estimation with Local Differential Privacy

Multiple Privacy Regimes Mechanism for Local Differential Privacy

Statistic selection and MCMC for differentially private Bayesian estimation

Article 16 August 2022

References

Antoniak, C.E.: Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems (1974)
Google Scholar
Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM (2007)
Google Scholar
Borgs, C., Chayes, J., Smith, A.: Private graphon estimation for sparse graphs. In: Advances in Neural Information Processing Systems, pp. 1369–1377 (2015)
Google Scholar
Borgs, C., Chayes, J., Smith, A., Zadik, I.: Revealing network structure, confidentially: improved rates for node-private graphon estimation. In: 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 533–543. IEEE (2018)
Google Scholar
Duchi, J.C., Jordan, M.I., Wainwright, M.J.: Minimax optimal procedures for locally private estimation. J. Am. Stat. Assoc. 113(521), 182–201 (2018)
Article MathSciNet Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Eland, A.: Tackling urban mobility with technology. Google Europe Blog, 18 November 2015
Google Scholar
Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, pp. 1054–1067. ACM (2014)
Google Scholar
Fienberg, S.E., Rinaldo, A., Yang, X.: Differential privacy and the risk-utility tradeoff for multi-dimensional Contingency Tables. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 187–199. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15838-4_17
Chapter Google Scholar
Gaboardi, M., Lim, H.W., Rogers, R.M., Vadhan, S.P.: Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing. In: ICML 2016 Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR (2016)
Google Scholar
Gaboardi, M., Rogers, R.: Local private hypothesis testing: chi-square tests. arXiv preprint arXiv:1709.07155 (2017)
Gao, F., van der Vaart, A., et al.: Posterior contraction rates for deconvolution of Dirichlet-Laplace mixtures. Electron. J. Stat. 10(1), 608–627 (2016)
Article MathSciNet Google Scholar
Ghosal, S., Van der Vaart, A.: Fundamentals of Nonparametric Bayesian Inference, vol. 44. Cambridge University Press, Cambridge (2017)
Book Google Scholar
Karwa, V., Slavković, A., et al.: Inference using noisy degrees: differentially private $\beta $-model and synthetic graphs. Ann. Stat. 44(1), 87–112 (2016)
Article MathSciNet Google Scholar
Kasiviswanathan, S.P., Nissim, K., Raskhodnikova, S., Smith, A.: Analyzing graphs with node differential privacy. In: Sahai, A. (ed.) TCC 2013. LNCS, vol. 7785, pp. 457–476. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36594-2_26
Chapter Google Scholar
Lo, A.Y.: On a class of Bayesian nonparametric estimates: I. Density estimates. Ann. Stat. 12, 351–357 (1984)
Article MathSciNet Google Scholar
Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pp. 277–286. IEEE Computer Society (2008)
Google Scholar
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS 2007, pp. 94–103 (2007)
Google Scholar
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000)
MathSciNet Google Scholar
Nguyen, X., et al.: Convergence of latent mixing measures in finite and infinite mixture models. Ann. Stat. 41(1), 370–400 (2013)
Article MathSciNet Google Scholar
Rinott, Y., O’Keefe, C.M., Shlomo, N., Skinner, C., et al.: Confidentiality and differential privacy in the dissemination of frequency tables. Stat. Sci. 33(3), 358–385 (2018)
Article MathSciNet Google Scholar
Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-71050-9
Book MATH Google Scholar
Wang, Y., Lee, J., Kifer, D.: Revisiting differentially private hypothesis tests for categorical data. arXiv preprint arXiv:1511.03376 (2015)
Wasserman, L., Zhou, S.: A statistical framework for differential privacy. J. Am. Stat. Assoc. 105(489), 375–389 (2010)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Oxford University, Oxford, OX1 4BH, UK
Fadhel Ayed & Giuseppe Di Benedetto
Lancaster University, Lancaster, LA1 4YW, UK
Marco Battiston

Authors

Fadhel Ayed
View author publications
You can also search for this author in PubMed Google Scholar
Marco Battiston
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Di Benedetto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Battiston .

Editor information

Editors and Affiliations

Rovira i Virgili University, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
University of Oklahoma, Norman, OK, USA
Krishnamurty Muralidhar

Appendices

Appendix A: Algorithm for Laplace Mechanism

In this Section, we derive the posterior $\mathbb {P}(dX^{*}_{k}|Z_{j_{1}:j_{n_k}})$ in the case of Laplace Mechanism. Together with Algorithm 2 in the main text, this posterior offers an efficient MCMC algorithm to perform posterior estimation when the Laplace mechanism has been applied to the original data. We remark that, even though the posterior (4) might look complicated at first glance, it is actually just a mixture distribution. For most choices of $P_0$, it is very easy to compute the weights of this mixture and sample from it. After the proof of Proposition A, we will detail a specific example of (4) for $P_0$ being Gaussian, which will be used in the experiments. The parameters r and $\lambda _\alpha $ are chosen as in [5] so that the Laplace Mechanism satisfies Differential Privacy.

Proposition A

(Posterior with Laplace Mech.). Let $r > 0$ and $\varPi _{[-r,r]}$ denote the projection operator on $[-r,r]$, defined as . Let $Z_{i} | X_{i} \sim \text {Laplace} (\varPi _{[-r,r]}(X_{i}), \lambda _\alpha )$ $i=1,\ldots ,n$ and let $Z_{j_1},\ldots ,Z_{j_{n_k}}$ denote the $n_k$ observations currently assigned to cluster k, i.e. with $c_{j_i}=k$, assumed w.l.o.g. to be ordered increasingly. Let also $i_- := \min \{ i\ |\ Z_{j_i} \ge -r\}$ ($i_- = m+1$ if the set is empty) and $i_+ := \max \{ i\ |\ Z_{j_i} \le r\}$ ($i_+ = 0$ if the set is empty) and $\widetilde{Z}_{i_- - 1} = -r$, $\widetilde{Z}_{i_+ + 1} = r$ and for $i \in [i_-, i_+]$, $\widetilde{Z}_i = Z_{j_i}$. Then, the posterior distribution $\mathbb {P}(dX^{*}_{k}|Z_{j_{1}:j_{n_k}})$ is proportional to

(4)

where $C_j = e^{\frac{1}{\lambda _\alpha } \left( \sum \limits _{i=1}^{j} \widetilde{Z}_{i} - \sum \limits _{i = j+1}^n \widetilde{Z}_{i}\right) }$ for $j=\{i_- -1,\ldots , i_+ \}$.

Normal Base Measure: Let $P_0(dX) = \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X-\mu )^2}{2\sigma ^2}} dX$ be a Normal distribution. Let us denote $\tilde{\mu }_j = \frac{(n-2j)\sigma ^2}{\lambda _\alpha } + \mu $. Then the posterior (4) specializes into

$$\begin{aligned} \mathbb {P}&(X^{*}_{k} | Z_{j_{1}:j_{n_k}}) \propto \mathbb {I}_{X^{*}_{k} < -r}\ C_{i_- - 1}\ e^{\frac{2i_- - n_k - 2}{\lambda _\alpha } r} \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X^{*}_{k}-\mu )^2}{2\sigma ^2}}\\&+ \sum \limits _{j=i_- - 1}^{i_+} \mathbb {I}_{X^{*}_{k} \in [\widetilde{Z}_j, \widetilde{Z}_{j+1})} \ C_j\ e^{\frac{\tilde{\mu }_j^2 - \mu ^2}{2\sigma ^2}} \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X^{*}_{k}-\tilde{\mu }_j)^2}{2\sigma ^2}} \\ {}&+\,\mathbb {I}_{X^{*}_{k} \ge r}\ C_{i_+}\ e^{-\frac{2i_+ - n_k}{\lambda _\alpha } r} \frac{1}{\sqrt{2 \pi } \sigma } e^{-\frac{(X^{*}_{k}-\mu )^2}{2\sigma ^2}}, \end{aligned}$$

where we have used the fact that

Let us denote, for $j=i_- - 2,.., i_+ + 1$,

$$\begin{aligned}&\varPi _{i_--2} = C_{i_- - 1}\ e^{\frac{2i_- - n_k - 2}{\lambda _\alpha } r} \left[ 1 + \text {erf}\left( \frac{-r-\mu }{\sqrt{2}\sigma } \right) \right] \\&\varPi _j = C_j e^{\frac{\tilde{\mu }_j^2 - \mu ^2}{2\sigma ^2}} [ \text {erf}\left( \frac{\widetilde{Z}_{j+1}-\tilde{\mu }_j}{\sqrt{2}\sigma } \right) - \text {erf}\left( \frac{\widetilde{Z}_{j}-\tilde{\mu }_j}{\sqrt{2}\sigma } \right) ]\ \ \ \text {for } j=i_--1,.., i_+; \\&\varPi _{i_+ + 1} = C_{i_+}\ e^{-\frac{2i_+ - n_k - 2}{\lambda _\alpha } r} \left[ 1 - \text {erf}\left( \frac{r-\mu }{\sqrt{2}\sigma } \right) \right] \end{aligned}$$

where $\text {erf}$ denotes the Gauss error function. Let $(\pi _j)_j = (\varPi _j/\sum _k \varPi _k)_j$ the normalized weights. The posterior is then a mixture of truncated Normals with disjoint supports. In order to sample for it, we can proceed in 2 steps. First, we sample the a categorical variable J such that $\mathbb {P}(J=j) = \pi _j$. If $J = i_- - 2$, we sample $X^{*}_{k}$ from a truncated Normal with mean and variance respectively $\mu $ and $\sigma ^2$ restricted on $(-\infty ,-r)$. If $J = i_+ + 1$, we sample $X^{*}_{k}$ from a truncated Normal with same mean and variance on $(r,\infty )$. Otherwise, sample $X^{*}_{k}$ from a truncated Normal with mean and variance respectively $\tilde{\mu }_j$ and $\sigma ^2$ restricted to $(\widetilde{Z}_{j}, \widetilde{Z}_{j+1})$.

Appendix B: Proof of Proposition 1

Denote first $M_P (Z_i) = \int Q(Z_i|X_i) P(dX_i)$, the marginal of the observations when the sensitive data is distributed according to P. Therefore, denoting $P_*$ the true distribution of the sensitive data $X_i$, it comes that the true marginal distribution of $Z_i$ is $M_{P_*}$. We will prove Proposition 2, following these steps,

1.
Step 1: We show that
$$\begin{aligned} \forall \epsilon> 0,\ \varPi (h(M_P,M_{P^*}) > \epsilon \ |\ Z_{1:n}) \rightarrow 0 \ \ \text {a.s.} \end{aligned}$$
(5)
Here, $\varPi $ denotes the Dirichlet process prior and $\varPi ( \cdot |Z_{1:n})$ denotes the posterior under the DPM model and h the Hellinger distance.
2.
Step 2: We will show that for any $\delta > 0$,
$$\begin{aligned} W_2(P,P_*)^2 \le C_\delta h(M_P,M_{P^*}) ^{3/4}+ C \delta ^2 \end{aligned}$$
(6)
where $W_2$ is the $\mathbb {L}_2$ Wasserstein distance.
3.
Conclusion: Using step 1 and 2, we will show that for any $\epsilon > 0$,
$$\begin{aligned} \varPi (W_1(P,P_*) > \epsilon \ |\ Z_{1:n}) \rightarrow 0 \ \ \text {a.s.} \end{aligned}$$
(7)
Now, since $W_1$ is convex and uniformly bounded on the space of probability measures on $\mathcal {X} \subset [a,b]$, Theorem 6.8 of [13] gives that $\mathbb {E}(P|Z_{1:n})$ converges almost surely to $P_*$ for the $W_1$ metric. Since [a, b] is compact, this implies that it also converges for any $W_k$ for $k \ge 1$.

To simplify the reading of the proof, in the following C will refer to constant quantities (in particular they do not depend on n), that can change from line to line.

Let us start with the easiest step, which is Eq. (7) of step 3. Let $\epsilon > 0$, from Eq. (6), we know that

$$\begin{aligned} \varPi (&W_2(P,P_*)^2 \le \epsilon \ |\ Z_{1:n}) \ge \varPi (C_\delta h(M_P,M_{P^*})^{3/4} + C \delta ^2 \le \epsilon \ |\ Z_{1:n}). \end{aligned}$$

Take $\delta $ such that $C \delta ^2 \le \epsilon /2$. We can hence lower bound the left hand side of previous inequality by $ \varPi (C_\delta h(M_P,M_{P^*})^{3/4} \le \epsilon /2 \ |\ Z_{1:n})$, which we know from Eq. (5) converges almost surely to 1, proving convergence in $W_2$, which implies (7) since $\mathcal {X} \subset [a,b]$.

Now let us consider Step 1. The Dirichlet prior $\varPi $ defines a prior on the marginals of $Z_i$, $M_P$ (also denoted $\varPi $). Since $Z_i \overset{iid}{\sim } M_{P_*}$, Schwartz theorem guarantees that (5) holds as long as $M_{P_*}$ is in the Kullback-Leibler support of $\varPi $. We will use Theorem 7.2 of [13] to prove it. Let

$$\underline{Q}(Z_i ; \mathcal {X}) = \inf \limits _{x \in \mathcal {X}} Q(Z_i | x).$$

Let $Z_i \in \mathcal {Z}$, for any $X_i \in \mathcal {X}$, the differential privacy condition gives

$$Q(Z_i | X_i) \le e^\alpha \underline{Q}(Z_i ; \mathcal {X}) < +\infty , $$

which corresponds to condition (A1) in the theorem of [13]. We only need to prove that (A2) holds, i.e.

$$\begin{aligned} \int \log \left( \frac{M_{P_*}(Z_i)}{\underline{Q}(Z_i ; \mathcal {X})} \right) M_{P_*}(dZ_i) < +\infty , \end{aligned}$$

for any probability measure P on $\mathcal {X}$. To see this we rewrite the expression in the $\log $ as follows

$$\begin{aligned} \frac{M_{P_*}(Z_i)}{\underline{Q}(Z_i ; \mathcal {X})}= & {} \int \frac{Q(Z_i|X_i)}{\underline{Q}(Z_i ; \mathcal {X})} P(dX_i) \le e^\alpha \int P(dX_i) = e^\alpha \end{aligned}$$

where last inequality is due to the differential privacy property of Q. This proves Step 1.

It remains to prove Step 2. We remark first that since the noise is additive in our setting, $Q(Z_i | X_i) = C_Q e^{-\alpha \rho (X_i-Z_i)/\varDelta }$ where $C_Q$ is a constant (independent of $X_i$). Denote $f: t \mapsto C_Q e^{-\alpha \rho (t)/\varDelta }$ and $\mathcal {L}(f)$ its Fourier transform. Denote $P*f$ the convolution of P and f. We also recall that

$$\begin{aligned} \mathcal {L}(M_P) = \mathcal {L}(P*f) = \mathcal {L}(P) \mathcal {L}(f) \end{aligned}$$

This part follows the same strategy as the proof of Theorem 2 in [20], the main difference being that here we are not interested in rates and hence need weaker conditions on f. In a similar way, we define a symmetric density K on $\mathbb {R}$ whose Fourier transform $\mathcal {L}(K)$ is continuous, bounded and with support included in $[-1,1]$. Let $\delta \in (0,1)$ and $K_\delta (x) = \frac{1}{\delta } K(x/\delta )$. Following the lines of the proof of Theorem 2 in [20], we find that

$$\begin{aligned} W_2^2( P, P_*) \le C ( || P * K_\delta - P_* * K_\delta ||_2^{3/4} + \delta ^2), \end{aligned}$$

(8)

where C is a constant (depending only on K), and that

$$\begin{aligned} || P * K_\delta - P_* * K_\delta ||_2\le & {} 2 d_{TV} (M_P, M_{P_*}) || g_\delta ||_2 \end{aligned}$$

where $g_\delta $ is the inverse Fourier transform of $ \frac{\mathcal {L}(K_\delta )}{\mathcal {L}(f)}$ and $d_{TV}$ the total variation distance. Now, using Plancherel’s identity it comes that

$$\begin{aligned} ||g_\delta ||_2^2\le & {} C \int \left| \frac{ \mathcal {L}(K_\delta )^2(t) }{\mathcal {L}(f)^2(t)} \right| ^2 dt \le C \int _{[-1/\delta , 1/\delta ]} \left| \frac{ \mathcal {L}(K_\delta )^2(t) }{\mathcal {L}(f)^2(t)} \right| ^2 dt \le C \sup \limits _{[-1/\delta , 1/\delta ] } |\mathcal {L}(f)|^{-2} \end{aligned}$$

where second line comes from the fact that the support of $\mathcal {L}(K_\delta )$ is in $[-1/\delta ,1/\delta ]$, and third line from the fact that it is bounded. Since $|\mathcal {L}(f)|$ is strictly positive (from assumptions) and continuous, it comes that $C_\delta ^2 = C \sup \limits _{[-1/ \delta , 1/ \delta ] } |\mathcal {L}(f)|^{-2} < +\infty $. Using the bound $d_{TV} \le \sqrt{2}\ h$, we can write

$$\begin{aligned} || P * K_\delta - P_* * K_\delta ||_2 \le C_\delta h(M_P, M_{P_*}), \end{aligned}$$

(9)

which together with (8) gives

$$ W_2^2( P, P_*) \le C_\delta h(M_P,M_{P^*}) ^{3/4}+ C \delta ^2 $$

Convergence of moments follows directly from [22] (Theorems 6.7 and 6.8).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ayed, F., Battiston, M., Di Benedetto, G. (2020). A Bayesian Nonparametric Approach to Differentially Private Data. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases. PSD 2020. Lecture Notes in Computer Science(), vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-57521-2_3
Published: 16 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57520-5
Online ISBN: 978-3-030-57521-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics