Differentially-Private “Draw and Discard” Machine Learning: Training Distributed Model from Enormous Crowds

Pihur, Vasyl; Korolova, Aleksandra; Liu, Frederick; Sankuratripati, Subhash; Yung, Moti; Huang, Dachuan; Zeng, Ruogu

doi:10.1007/978-3-031-07689-3_33

Vasyl Pihur¹⁰,
Aleksandra Korolova¹¹,
Frederick Liu¹⁰,
Subhash Sankuratripati¹⁰,
Moti Yung¹²,
Dachuan Huang¹⁰ &
…
Ruogu Zeng¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13301))

Included in the following conference series:

International Symposium on Cyber Security, Cryptology, and Machine Learning

933 Accesses
3 Citations

Abstract

The setting of our problem is a distributed architecture facing an enormous user set, where events are repeating and evolving over time, and we want to absorb the stream of events into the model: first local model, then absorb it in the global one, and also care about user privacy. Naturally, we learn a phenomenon which happens distributedly in many places (like malware spread over smartphones, user behavior to operation and UX of an app, or other such events). To this end, we considered a configuration where the learning server is built to deal with the possibly high frequency high-volume environment in a natural distributed fashion, while taking care of statistical convergence and privacy properties of the setting as well. We propose a novel framework for privacy-preserving client-distributed machine learning. It is based on the desire to allow differential privacy guarantees in the local model of privacy in a way that satisfies systems constraints using high number of asynchronous client-server communication (i.e., not much coordination among separate clients, which is a simple communication model to implement, which in some settings already exist, i.e., in apps facing users), and provides attractive model learning properties.

We develop a generic randomized learning algorithm “Draw and Discard” because it relies on random sampling and discarding of models for load distribution and scalability, which also provides additional server-side privacy protections and improved model quality through averaging. The model is general and we show its applicability to Generalized Linear models. We analyze the statistical stability and privacy guarantees provided by our approach against faults and against several types of adversaries. We then showcase experimental results. Our framework (first reported in [28]) has been experimentally deployed in a real industrial setting. We view the result as an initial combination of ML and of distributed systems, and we believe it poses numerous directions for further developments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This is only approximately correct, since in a high-throughput environment, another client request could have updated the same model in the meantime.

References

Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS 2016), pp. 308–318 (2016). http://doi.acm.org/10.1145/2976749.2978318
Apple Differential Privacy Team: Learning with Privacy at Scale, vol. 1. Apple Mach. Learn. J. (8) (2017). https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html
Bassily, R., Nissim, K., Stemmer, U., Thakurta, A.G.: Practical locally private heavy hitters. In: Advances in Neural Information Processing Systems, pp. 2285–2293 (2017)
Google Scholar
Bassily, R., Smith, A.: Local, private, efficient protocols for succinct histograms. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 127–135. ACM (2015)
Google Scholar
Bun, M., Nelson, J., Stemmer, U.: Heavy hitters and the structure of local privacy. arXiv preprint arXiv:1711.04740 (2017)
Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Advances in Neural Information Processing Systems, pp. 289–296 (2009)
Google Scholar
Delange, J.: Why using asynchronous communications? (2017). http://julien.gunnm.org/programming/linux/2017/04/15/comparison-sync-vs-async
Duchi, J.C., Jordan, M.I., Wainwright, M.J.: Local privacy and statistical minimax rates. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 429–438, 0272-5428 (2014). https://doi.org/10.1109/FOCS.2013.53
Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)
Article Google Scholar
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Chapter Google Scholar
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends${\text{\textregistered} }$ Theor. Comput. Sci. 9(3–4), 211–407 (2014)
Google Scholar
Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS 2014), pp. 1054–1067 (2014)
Google Scholar
European Association for Theoretical Computer Science. Gödel Prize (2017). https://eatcs.org/index.php/component/content/article/1-news/2450-2017-godel-prize
Fanti, G., Pihur, V., Erlingsson, Ú.: Building a RAPPOR with the unknown: privacy-preserving learning of associations and data dictionaries. Proc. Privacy Enhancing Technol. 3, 41–61 (2016)
Article Google Scholar
Feldman, V., Mironov, I., Talwar, K., Thakurta, A.: Privacy amplification by iteration. ArXiv e-prints abs/1808.06651, August 2018
Google. Google Safe Browsing (2018). https://safebrowsing.google.com/
Greenberg, A.: Apple’s Differential Privacy is About Collecting Your Data - But Not Your Data. In Wired, 13 June 2016
Google Scholar
Kairouz, P., Oh, S., Viswanath, P.: Extremal mechanisms for local differential privacy. In: Advances in Neural Information Processing Systems, pp. 2879–2887 (2014)
Google Scholar
Kenthapadi, K., Korolova, A., Mironov, I., Mishra, N.: Privacy via the Johnson-Lindenstrauss transform. J. Privacy Confidentiality 5(1), 39–71 (2013)
Google Scholar
Madden, M., Rainie, L.: Americans’ Attitudes About Privacy, Security and Surveillance. Pew Research Center (2015). http://www.pewinternet.org/2015/05/20/americans-attitudes-about-privacy-security-and-surveillance/
Madden, M., Rainie, L.: Privacy and Information Sharing. Pew Research Center (2016). http://www.pewinternet.org/2016/01/14/privacy-and-information-sharing/
Brendan McMahan, H., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: AISTATS (2017)
Google Scholar
Brendan McMahan, H., Ramage, D., Talwar, K., Zhang, L.: Learning differentially private language models without losing accuracy. CoRR abs/1710.06963 (2017). arxiv:1710.06963
McSherry, F.: Deep learning and differential privacy (2017). https://github.com/frankmcsherry/blog/blob/master/posts/2017-10-27.md
Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc. Ser. A General 135(1972), 370–384 (1972)
Article Google Scholar
Nissim, K., et al.: Differential privacy: a primer for a non-technical audience (Preliminary Version). Vanderbilt J. Entertainment Technol. Law (2018)
Google Scholar
Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. In: 5th International Conference on Learning Representations (2016)
Google Scholar
Pihur, V.: Differentially-Private “Draw and Discard” Machine Learning (2018). https://doi.org/10.48550/ARXIV.1807.04369
Portnoy, E., Gebhart, G., Grant, S.: In EFF DeepLinks Blog (2016). www.eff.org/deeplinks/2016/09/facial-recognition-differential-privacy-and-trade-offs-apples-latest-os-releases
Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, pp. 1310–1321 (2015)
Google Scholar
Song, S., Chaudhuri, K., Sarwate, A.D.: Stochastic gradient descent with differentially private updates. In: Global Conference on Signal and Information Processing (GlobalSIP), pp. 245–248. IEEE (2013)
Google Scholar
WWDC 2016. WWDC 2016 Keynote, June 2016. https://www.apple.com/apple-events/june-2016/

Download references

Author information

Authors and Affiliations

Snap Inc., Santa Monica, CA, 90405, USA
Vasyl Pihur, Frederick Liu, Subhash Sankuratripati, Dachuan Huang & Ruogu Zeng
USC, Los Angeles, CA, 90089, USA
Aleksandra Korolova
Columbia University, New York, NY, 10027, USA
Moti Yung

Authors

Vasyl Pihur
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Korolova
View author publications
You can also search for this author in PubMed Google Scholar
Frederick Liu
View author publications
You can also search for this author in PubMed Google Scholar
Subhash Sankuratripati
View author publications
You can also search for this author in PubMed Google Scholar
Moti Yung
View author publications
You can also search for this author in PubMed Google Scholar
Dachuan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ruogu Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moti Yung .

Editor information

Editors and Affiliations

Ben-Gurion University of the Negev, Be’er Sheva, Israel
Shlomi Dolev
University of Maryland, College Park, MD, USA
Jonathan Katz
Ben-Gurion University of the Negev, Be’er Sheva, Israel
Amnon Meisels

Appendices

Appendix

A Variance Stabilization Proof and Other Proofs

Theorem 1 Let $B = (B_1, \ldots , B_k)$ be a vector of k random variables (weights) with mean $\mu $ and variance $\frac{k}{2}\sigma ^2$. Selecting one of the weights at random, adding noise with mean 0 and variance $\sigma ^2$ and putting it back with replacement does not change the expected intra-model variance of B (i.e., weights remain distributed with variance $\frac{k}{2}\sigma ^2$).

Proof

We use the Law of Total Variance

$$ V(B) = E(V(B|X)) + V(E(B|X)) $$

three times as we think of B as a mixture of mixtures. In the first mixture, the expectation is taken over the distribution of whether the same or different model was updated ($X \in \{j \rightarrow j, j \rightarrow j'\}$). In the inner mixture, the expectation is taken over the partitioning of the k weights themselves (Z).

Total variance, therefore, is equal to

$$\begin{aligned} V(B)= & {} \frac{1}{k}V(B|j \rightarrow j) + \frac{k-1}{k} V(B | j \rightarrow j') \\&+ V(E(B|X)). \end{aligned}$$

Because we add noise with mean 0, in either case, the mean of B does not change, so $V(E(B|X)) = 0$.

Replacing the same weight partitions the k weights into two sets, a single weight updated and the rest of the weights. Partition means do not change, and the variance increases due to added noise in the first partition (mixture component). Thus, $V(B | j \rightarrow j)$ is given as

$$\begin{aligned} V(B| j \rightarrow j)= & {} \frac{1}{k}\left( \frac{k}{2} + 1\right) \sigma ^2 + \frac{k-1}{k}\frac{k}{2}\sigma ^2 \\= & {} \left( \frac{1}{2} + \frac{1}{k} + \frac{k-1}{2}\right) \sigma ^2 \\= & {} \left( \frac{k}{2} + \frac{1}{k}\right) \sigma ^2. \end{aligned}$$

Replacing a different weight partitions the weights space into 2 and $(k-2)$ subsets. Unlike in the first case, V(E(B|Z)) is non-zero due to a single weight essentially replicated twice in the first partition ($B_1$ has a mean of 0, but $B_2$ has a mean of $B_1$). After the update, the overall mean of B under Z becomes

$$\begin{aligned} \mu _{Z}= & {} \frac{2\mu _1 + (k-2)\mu }{k}, \end{aligned}$$

where $\mu _1$ is the mean of the model selected and model replaced and has a distribution with mean $\mu $ and variance $\frac{k}{2}\sigma ^2$. Note that the mean of B over sampling in X is still 0 as $E(B_1) = 0$ over X.

Then $V(B|j \rightarrow j')$

$$\begin{aligned}= & {} \frac{2}{k}\frac{1}{2}\sigma ^2 + \frac{k-2}{k} \frac{k}{2}\sigma ^2 \\&+ \frac{2E[(\mu _1 - \mu _{B_i})^2] + (k-2)E[(\mu -\mu _{B_i})^2]}{k-1} \\= & {} \frac{1}{k}\sigma ^2 + \frac{k-2}{2}\sigma ^2 \\&+ \frac{2}{k-1}\Bigl (\frac{k-2}{k}\Bigr )^2E[(\mu _1 - \mu )^2] \\&+ \frac{k-2}{k-1}\Bigl (\frac{2}{k}\Bigr )^2E[(\mu _1 - \mu )^2] \\= & {} \frac{1}{k}\sigma ^2 + \frac{k-2}{2}\sigma ^2 + \frac{2}{k-1}\left( \frac{k-2}{k}\right) ^2\frac{k}{2}\sigma ^2 \\&+ \frac{k-2}{k-1}\left( \frac{2}{k}\right) ^2\frac{k}{2}\sigma ^2 \\= & {} \left( \frac{1}{k} + \frac{k-2}{2} + \frac{(k-2)^2}{k(k-1)} + \frac{2(k-2)}{k(k-1)}\right) \sigma ^2 \\= & {} \frac{2k - 2 + k(k-2)(k-1) + 2(k-2)^2 + 4k - 8}{2k(k-1)}\sigma ^2 \\= & {} \frac{k^3 - k^2 - 2}{2k(k-1)}\sigma ^2. \end{aligned}$$

Note that the variance component must be computed with $k-1$ and not k because of the finite nature of k in this case.

Putting it all together,

$$\begin{aligned} V(B)= & {} \frac{1}{k}V(B|j \rightarrow j) + \frac{k-1}{k} V(B | j \rightarrow j') \\= & {} \frac{1}{k}\left( \frac{k}{2} + \frac{1}{k}\right) \sigma ^2 + \frac{k-1}{k}\frac{k^3 - k^2 - 2}{2k(k-1)}\sigma ^2 \\= & {} \left( \frac{1}{2} + \frac{1}{k^2} + \frac{k^3 - k^2 - 2}{2k^2} \right) \sigma ^2 \\= & {} \frac{k}{2}\sigma ^2. \end{aligned}$$

Lemma 1 DDML discards $1-\frac{1}{k}$ updates long-term.

Proof

Consider a Markov process on the states $0, 1, \ldots , k$, where each state represents the number of models in which a particular update can be found. Denote by $p_i$ - the probability to go from i to $i-1$ or $i+1$. In DDML, $p_i = \frac{(k-i)i}{k^2}$, for $1 \le i \le k-1$, and $p_0 = 0, p_k = 1$.

Let $q_i$ be the probability of eventually ending up at state 0 if you start in state i. By the set-up of DDML, for $1< i<k-1$ we have:

$q_i = p_iq_{i-1} + p_iq_{i+1}+(1-2p_i)q_i,$ or $2q_i = q_{i-1}+q_{i+1}$.

We also know that

$q_0=1, p_0=0$, $q_k=0, p_k=1$,
$q_1 = p_1q_0+p_1q_2+(1-2p_1)q_1$,
$q_{k-1} = p_{k-1}q_{k-2} + p_{k-1}q_k+(1-2p_{k-1})q_{k-1}$

Summing equations for $1 \le i \le k-1$ we have

$2q_1 + \cdots + 2 q_{k-1} = q_0 + q_1 + 2(q_2+\cdots +q_{k-2})+q_{k-1}+q_{k}$ Simplifying: $q_1+q_{k-1} = q_0+q_k$ or

$$\begin{aligned} q_1+q_{k-1} = 1 \end{aligned}$$

(5)

On the other hand,

$q_{k-2} = 2q_{k-1}$,
$q_{k-3} = 2q_{k-2} - q_{k-1} = 3q_{k-1}$
$q_{k-4}=2q_{k-3}-q_{k-2} = 4q_{k-1}$
$\cdots $
$$\begin{aligned} q_1 = 2q_2-q_3 = (k-1)q_{k-1} \end{aligned}$$
(6)

Combining (5) and (6), we have $(k-1)q_{k-1}+q_{k-1} = 1$ or $q_{k-1}=\frac{1}{k}$ and $q_1 = 1-\frac{1}{k}$. Since DDML is set-up that each particular contribution is initially in state 1, this completes the proof.

Theorem 2 With high probability, DDML guarantees a user $(\epsilon _T, \delta _T)$-differential privacy against adversary III, where $\epsilon _T = \frac{\epsilon }{\sqrt{2T}} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } $ and $\delta _T$ is an arbitrary small constant (typically chosen as O(1/ number of users)). T is the number of updates made to the model instance between when a user submitted his instance update and when the adversary observes the instances. The statement holds if T is sufficiently large.

Proof

We rely on the result from concurrent and independent work of [15] obtained in a different context to analyze the privacy amplification in this case. Specifically, their result states that for any contractive noisy process, privacy amplification is no worse than that for the identity contraction, which we analyze below.

The sum of T random variables drawn independently from the Laplace distribution with mean 0 will tend toward a normal distribution for sufficiently large T, by the Central Limit Theorem. In DDML’s case with Laplace noise, the variance of each random variable is $\frac{8\gamma ^2}{\epsilon ^2}$, therefore, if we assume that the adversary observes the model instance after T updates to it, the variance of the noise added will be $T \cdot \frac{8\gamma ^2}{\epsilon ^2}$. This corresponds to Gaussian with scale $\sigma = \frac{2\sqrt{2T}\gamma }{\epsilon }$.

Lemma 1 from [19] states that for points in p-dimensional space that differ by at most w in $\ell _2$, addition of noise drawn from $N^p(0, \sigma ^2_T)$, where $\sigma _T \ge w \frac{\sqrt{2\left( \ln \left( \frac{1}{2\delta _T}\right) +\epsilon _T\right) }}{\epsilon _T}$ and $\delta _T < \frac{1}{2}$ ensures $(\epsilon _T, \delta _T)$ differential privacy. We use the result of Lemma 1 from [19], rather than the more commonly referenced result from Theorem A.1 of [11], because the latter result holds only for $\epsilon _T \le 1$, which is not the privacy loss used in most practical applications.

We now ask the question: what approximate differential privacy guarantee is achieved by DDML against adversary III? To answer this, fix a desired level of $\delta _T$ and use the approximation obtained from the Central Limit theorem to solve for the $\epsilon _T$.

$$\begin{aligned} \frac{2\sqrt{2T}\gamma }{\epsilon } \ge w \frac{\sqrt{2\left( \ln \left( \frac{1}{2\delta _T}\right) +\epsilon _T\right) }}{\epsilon _T} \end{aligned}$$

$$\begin{aligned} T \cdot \frac{8\gamma ^2}{\epsilon ^2} \ge w^2 \frac{2\left( \ln \left( \frac{1}{2\delta _T}\right) +\epsilon _T\right) }{\epsilon ^2_T} \end{aligned}$$

$$\begin{aligned} T \cdot \frac{4\gamma ^2}{\epsilon ^2} \cdot \epsilon ^2_T - w^2 \epsilon _T - w^2 \ln \left( \frac{1}{2\delta _T}\right) \ge 0 \end{aligned}$$

Solving the quadratic inequality, we have:

$$\begin{aligned} D = w^4+4T \cdot \frac{4\gamma ^2}{\epsilon ^2} w^2 \ln \left( \frac{1}{2\delta _T}\right) \end{aligned}$$

$$\begin{aligned} \epsilon _T \ge \frac{w^2 + \sqrt{D}}{2T \cdot \frac{4\gamma ^2}{\epsilon ^2} } = \frac{\epsilon ^2 w^2}{8\gamma ^2T} \left[ 1 + \sqrt{1+16T\frac{\gamma ^2}{\epsilon ^2 w^2} \ln \left( \frac{1}{2\delta _T}\right) }\right] \end{aligned}$$

For large T, the additive term of 1 under the square root is negligible, so we have:

$$\begin{aligned} \epsilon _T \approx \frac{\epsilon ^2 w^2}{8\gamma ^2T} 4\sqrt{T} \frac{\gamma }{\epsilon w} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } = \frac{\epsilon w}{2\gamma \sqrt{T}} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } \end{aligned}$$

In DDML, $w=\sqrt{2}\gamma $, therefore,

$$ \epsilon _T \approx \frac{\epsilon }{\sqrt{2T}} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } $$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pihur, V. et al. (2022). Differentially-Private “Draw and Discard” Machine Learning: Training Distributed Model from Enormous Crowds. In: Dolev, S., Katz, J., Meisels, A. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2022. Lecture Notes in Computer Science, vol 13301. Springer, Cham. https://doi.org/10.1007/978-3-031-07689-3_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-07689-3_33
Published: 23 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07688-6
Online ISBN: 978-3-031-07689-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Differentially-Private “Draw and Discard” Machine Learning: Training Distributed Model from Enormous Crowds

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix

A Variance Stabilization Proof and Other Proofs

Proof

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation