Skip to main content

Differentially-Private “Draw and Discard” Machine Learning: Training Distributed Model from Enormous Crowds

  • Conference paper
  • First Online:
Book cover Cyber Security, Cryptology, and Machine Learning (CSCML 2022)

Abstract

The setting of our problem is a distributed architecture facing an enormous user set, where events are repeating and evolving over time, and we want to absorb the stream of events into the model: first local model, then absorb it in the global one, and also care about user privacy. Naturally, we learn a phenomenon which happens distributedly in many places (like malware spread over smartphones, user behavior to operation and UX of an app, or other such events). To this end, we considered a configuration where the learning server is built to deal with the possibly high frequency high-volume environment in a natural distributed fashion, while taking care of statistical convergence and privacy properties of the setting as well. We propose a novel framework for privacy-preserving client-distributed machine learning. It is based on the desire to allow differential privacy guarantees in the local model of privacy in a way that satisfies systems constraints using high number of asynchronous client-server communication (i.e., not much coordination among separate clients, which is a simple communication model to implement, which in some settings already exist, i.e., in apps facing users), and provides attractive model learning properties.

We develop a generic randomized learning algorithm “Draw and Discard” because it relies on random sampling and discarding of models for load distribution and scalability, which also provides additional server-side privacy protections and improved model quality through averaging. The model is general and we show its applicability to Generalized Linear models. We analyze the statistical stability and privacy guarantees provided by our approach against faults and against several types of adversaries. We then showcase experimental results. Our framework (first reported in [28]) has been experimentally deployed in a real industrial setting. We view the result as an initial combination of ML and of distributed systems, and we believe it poses numerous directions for further developments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is only approximately correct, since in a high-throughput environment, another client request could have updated the same model in the meantime.

References

  1. Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS 2016), pp. 308–318 (2016). http://doi.acm.org/10.1145/2976749.2978318

  2. Apple Differential Privacy Team: Learning with Privacy at Scale, vol. 1. Apple Mach. Learn. J. (8) (2017). https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html

  3. Bassily, R., Nissim, K., Stemmer, U., Thakurta, A.G.: Practical locally private heavy hitters. In: Advances in Neural Information Processing Systems, pp. 2285–2293 (2017)

    Google Scholar 

  4. Bassily, R., Smith, A.: Local, private, efficient protocols for succinct histograms. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 127–135. ACM (2015)

    Google Scholar 

  5. Bun, M., Nelson, J., Stemmer, U.: Heavy hitters and the structure of local privacy. arXiv preprint arXiv:1711.04740 (2017)

  6. Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Advances in Neural Information Processing Systems, pp. 289–296 (2009)

    Google Scholar 

  7. Delange, J.: Why using asynchronous communications? (2017). http://julien.gunnm.org/programming/linux/2017/04/15/comparison-sync-vs-async

  8. Duchi, J.C., Jordan, M.I., Wainwright, M.J.: Local privacy and statistical minimax rates. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 429–438, 0272-5428 (2014). https://doi.org/10.1109/FOCS.2013.53

  9. Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)

    Article  Google Scholar 

  10. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14

    Chapter  Google Scholar 

  11. Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends\({\text{\textregistered} }\) Theor. Comput. Sci. 9(3–4), 211–407 (2014)

    Google Scholar 

  12. Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS 2014), pp. 1054–1067 (2014)

    Google Scholar 

  13. European Association for Theoretical Computer Science. Gödel Prize (2017). https://eatcs.org/index.php/component/content/article/1-news/2450-2017-godel-prize

  14. Fanti, G., Pihur, V., Erlingsson, Ú.: Building a RAPPOR with the unknown: privacy-preserving learning of associations and data dictionaries. Proc. Privacy Enhancing Technol. 3, 41–61 (2016)

    Article  Google Scholar 

  15. Feldman, V., Mironov, I., Talwar, K., Thakurta, A.: Privacy amplification by iteration. ArXiv e-prints abs/1808.06651, August 2018

  16. Google. Google Safe Browsing (2018). https://safebrowsing.google.com/

  17. Greenberg, A.: Apple’s Differential Privacy is About Collecting Your Data - But Not Your Data. In Wired, 13 June 2016

    Google Scholar 

  18. Kairouz, P., Oh, S., Viswanath, P.: Extremal mechanisms for local differential privacy. In: Advances in Neural Information Processing Systems, pp. 2879–2887 (2014)

    Google Scholar 

  19. Kenthapadi, K., Korolova, A., Mironov, I., Mishra, N.: Privacy via the Johnson-Lindenstrauss transform. J. Privacy Confidentiality 5(1), 39–71 (2013)

    Google Scholar 

  20. Madden, M., Rainie, L.: Americans’ Attitudes About Privacy, Security and Surveillance. Pew Research Center (2015). http://www.pewinternet.org/2015/05/20/americans-attitudes-about-privacy-security-and-surveillance/

  21. Madden, M., Rainie, L.: Privacy and Information Sharing. Pew Research Center (2016). http://www.pewinternet.org/2016/01/14/privacy-and-information-sharing/

  22. Brendan McMahan, H., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: AISTATS (2017)

    Google Scholar 

  23. Brendan McMahan, H., Ramage, D., Talwar, K., Zhang, L.: Learning differentially private language models without losing accuracy. CoRR abs/1710.06963 (2017). arxiv:1710.06963

  24. McSherry, F.: Deep learning and differential privacy (2017). https://github.com/frankmcsherry/blog/blob/master/posts/2017-10-27.md

  25. Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc. Ser. A General 135(1972), 370–384 (1972)

    Article  Google Scholar 

  26. Nissim, K., et al.: Differential privacy: a primer for a non-technical audience (Preliminary Version). Vanderbilt J. Entertainment Technol. Law (2018)

    Google Scholar 

  27. Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. In: 5th International Conference on Learning Representations (2016)

    Google Scholar 

  28. Pihur, V.: Differentially-Private “Draw and Discard” Machine Learning (2018). https://doi.org/10.48550/ARXIV.1807.04369

  29. Portnoy, E., Gebhart, G., Grant, S.: In EFF DeepLinks Blog (2016). www.eff.org/deeplinks/2016/09/facial-recognition-differential-privacy-and-trade-offs-apples-latest-os-releases

  30. Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, pp. 1310–1321 (2015)

    Google Scholar 

  31. Song, S., Chaudhuri, K., Sarwate, A.D.: Stochastic gradient descent with differentially private updates. In: Global Conference on Signal and Information Processing (GlobalSIP), pp. 245–248. IEEE (2013)

    Google Scholar 

  32. WWDC 2016. WWDC 2016 Keynote, June 2016. https://www.apple.com/apple-events/june-2016/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moti Yung .

Editor information

Editors and Affiliations

Appendices

Appendix

A Variance Stabilization Proof and Other Proofs

Theorem 1 Let \(B = (B_1, \ldots , B_k)\) be a vector of k random variables (weights) with mean \(\mu \) and variance \(\frac{k}{2}\sigma ^2\). Selecting one of the weights at random, adding noise with mean 0 and variance \(\sigma ^2\) and putting it back with replacement does not change the expected intra-model variance of B (i.e., weights remain distributed with variance \(\frac{k}{2}\sigma ^2\)).

Proof

We use the Law of Total Variance

$$ V(B) = E(V(B|X)) + V(E(B|X)) $$

three times as we think of B as a mixture of mixtures. In the first mixture, the expectation is taken over the distribution of whether the same or different model was updated (\(X \in \{j \rightarrow j, j \rightarrow j'\}\)). In the inner mixture, the expectation is taken over the partitioning of the k weights themselves (Z).

Total variance, therefore, is equal to

$$\begin{aligned} V(B)= & {} \frac{1}{k}V(B|j \rightarrow j) + \frac{k-1}{k} V(B | j \rightarrow j') \\&+ V(E(B|X)). \end{aligned}$$

Because we add noise with mean 0, in either case, the mean of B does not change, so \(V(E(B|X)) = 0\).

Replacing the same weight partitions the k weights into two sets, a single weight updated and the rest of the weights. Partition means do not change, and the variance increases due to added noise in the first partition (mixture component). Thus, \(V(B | j \rightarrow j)\) is given as

$$\begin{aligned} V(B| j \rightarrow j)= & {} \frac{1}{k}\left( \frac{k}{2} + 1\right) \sigma ^2 + \frac{k-1}{k}\frac{k}{2}\sigma ^2 \\= & {} \left( \frac{1}{2} + \frac{1}{k} + \frac{k-1}{2}\right) \sigma ^2 \\= & {} \left( \frac{k}{2} + \frac{1}{k}\right) \sigma ^2. \end{aligned}$$

Replacing a different weight partitions the weights space into 2 and \((k-2)\) subsets. Unlike in the first case, V(E(B|Z)) is non-zero due to a single weight essentially replicated twice in the first partition (\(B_1\) has a mean of 0, but \(B_2\) has a mean of \(B_1\)). After the update, the overall mean of B under Z becomes

$$\begin{aligned} \mu _{Z}= & {} \frac{2\mu _1 + (k-2)\mu }{k}, \end{aligned}$$

where \(\mu _1\) is the mean of the model selected and model replaced and has a distribution with mean \(\mu \) and variance \(\frac{k}{2}\sigma ^2\). Note that the mean of B over sampling in X is still 0 as \(E(B_1) = 0\) over X.

Then \(V(B|j \rightarrow j')\)

$$\begin{aligned}= & {} \frac{2}{k}\frac{1}{2}\sigma ^2 + \frac{k-2}{k} \frac{k}{2}\sigma ^2 \\&+ \frac{2E[(\mu _1 - \mu _{B_i})^2] + (k-2)E[(\mu -\mu _{B_i})^2]}{k-1} \\= & {} \frac{1}{k}\sigma ^2 + \frac{k-2}{2}\sigma ^2 \\&+ \frac{2}{k-1}\Bigl (\frac{k-2}{k}\Bigr )^2E[(\mu _1 - \mu )^2] \\&+ \frac{k-2}{k-1}\Bigl (\frac{2}{k}\Bigr )^2E[(\mu _1 - \mu )^2] \\= & {} \frac{1}{k}\sigma ^2 + \frac{k-2}{2}\sigma ^2 + \frac{2}{k-1}\left( \frac{k-2}{k}\right) ^2\frac{k}{2}\sigma ^2 \\&+ \frac{k-2}{k-1}\left( \frac{2}{k}\right) ^2\frac{k}{2}\sigma ^2 \\= & {} \left( \frac{1}{k} + \frac{k-2}{2} + \frac{(k-2)^2}{k(k-1)} + \frac{2(k-2)}{k(k-1)}\right) \sigma ^2 \\= & {} \frac{2k - 2 + k(k-2)(k-1) + 2(k-2)^2 + 4k - 8}{2k(k-1)}\sigma ^2 \\= & {} \frac{k^3 - k^2 - 2}{2k(k-1)}\sigma ^2. \end{aligned}$$

Note that the variance component must be computed with \(k-1\) and not k because of the finite nature of k in this case.

Putting it all together,

$$\begin{aligned} V(B)= & {} \frac{1}{k}V(B|j \rightarrow j) + \frac{k-1}{k} V(B | j \rightarrow j') \\= & {} \frac{1}{k}\left( \frac{k}{2} + \frac{1}{k}\right) \sigma ^2 + \frac{k-1}{k}\frac{k^3 - k^2 - 2}{2k(k-1)}\sigma ^2 \\= & {} \left( \frac{1}{2} + \frac{1}{k^2} + \frac{k^3 - k^2 - 2}{2k^2} \right) \sigma ^2 \\= & {} \frac{k}{2}\sigma ^2. \end{aligned}$$

Lemma 1 DDML discards \(1-\frac{1}{k}\) updates long-term.

Proof

Consider a Markov process on the states \(0, 1, \ldots , k\), where each state represents the number of models in which a particular update can be found. Denote by \(p_i\) - the probability to go from i to \(i-1\) or \(i+1\). In DDML, \(p_i = \frac{(k-i)i}{k^2}\), for \(1 \le i \le k-1\), and \(p_0 = 0, p_k = 1\).

Let \(q_i\) be the probability of eventually ending up at state 0 if you start in state i. By the set-up of DDML, for \(1< i<k-1\) we have:

\(q_i = p_iq_{i-1} + p_iq_{i+1}+(1-2p_i)q_i,\) or \(2q_i = q_{i-1}+q_{i+1}\).

We also know that

  • \(q_0=1, p_0=0\), \(q_k=0, p_k=1\),

  • \(q_1 = p_1q_0+p_1q_2+(1-2p_1)q_1\),

  • \(q_{k-1} = p_{k-1}q_{k-2} + p_{k-1}q_k+(1-2p_{k-1})q_{k-1}\)

Summing equations for \(1 \le i \le k-1\) we have

\(2q_1 + \cdots + 2 q_{k-1} = q_0 + q_1 + 2(q_2+\cdots +q_{k-2})+q_{k-1}+q_{k}\) Simplifying: \(q_1+q_{k-1} = q_0+q_k\) or

$$\begin{aligned} q_1+q_{k-1} = 1 \end{aligned}$$
(5)

On the other hand,

  • \(q_{k-2} = 2q_{k-1}\),

  • \(q_{k-3} = 2q_{k-2} - q_{k-1} = 3q_{k-1}\)

  • \(q_{k-4}=2q_{k-3}-q_{k-2} = 4q_{k-1}\)

  • \(\cdots \)

    $$\begin{aligned} q_1 = 2q_2-q_3 = (k-1)q_{k-1} \end{aligned}$$
    (6)

Combining (5) and (6), we have \((k-1)q_{k-1}+q_{k-1} = 1\) or \(q_{k-1}=\frac{1}{k}\) and \(q_1 = 1-\frac{1}{k}\). Since DDML is set-up that each particular contribution is initially in state 1, this completes the proof.

Theorem 2 With high probability, DDML guarantees a user \((\epsilon _T, \delta _T)\)-differential privacy against adversary III, where \(\epsilon _T = \frac{\epsilon }{\sqrt{2T}} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } \) and \(\delta _T\) is an arbitrary small constant (typically chosen as O(1/ number of users)). T is the number of updates made to the model instance between when a user submitted his instance update and when the adversary observes the instances. The statement holds if T is sufficiently large.

Proof

We rely on the result from concurrent and independent work of [15] obtained in a different context to analyze the privacy amplification in this case. Specifically, their result states that for any contractive noisy process, privacy amplification is no worse than that for the identity contraction, which we analyze below.

The sum of T random variables drawn independently from the Laplace distribution with mean 0 will tend toward a normal distribution for sufficiently large T, by the Central Limit Theorem. In DDML’s case with Laplace noise, the variance of each random variable is \(\frac{8\gamma ^2}{\epsilon ^2}\), therefore, if we assume that the adversary observes the model instance after T updates to it, the variance of the noise added will be \(T \cdot \frac{8\gamma ^2}{\epsilon ^2}\). This corresponds to Gaussian with scale \(\sigma = \frac{2\sqrt{2T}\gamma }{\epsilon }\).

Lemma 1 from [19] states that for points in p-dimensional space that differ by at most w in \(\ell _2\), addition of noise drawn from \(N^p(0, \sigma ^2_T)\), where \(\sigma _T \ge w \frac{\sqrt{2\left( \ln \left( \frac{1}{2\delta _T}\right) +\epsilon _T\right) }}{\epsilon _T}\) and \(\delta _T < \frac{1}{2}\) ensures \((\epsilon _T, \delta _T)\) differential privacy. We use the result of Lemma 1 from [19], rather than the more commonly referenced result from Theorem A.1 of [11], because the latter result holds only for \(\epsilon _T \le 1\), which is not the privacy loss used in most practical applications.

We now ask the question: what approximate differential privacy guarantee is achieved by DDML against adversary III? To answer this, fix a desired level of \(\delta _T\) and use the approximation obtained from the Central Limit theorem to solve for the \(\epsilon _T\).

$$\begin{aligned} \frac{2\sqrt{2T}\gamma }{\epsilon } \ge w \frac{\sqrt{2\left( \ln \left( \frac{1}{2\delta _T}\right) +\epsilon _T\right) }}{\epsilon _T} \end{aligned}$$
$$\begin{aligned} T \cdot \frac{8\gamma ^2}{\epsilon ^2} \ge w^2 \frac{2\left( \ln \left( \frac{1}{2\delta _T}\right) +\epsilon _T\right) }{\epsilon ^2_T} \end{aligned}$$
$$\begin{aligned} T \cdot \frac{4\gamma ^2}{\epsilon ^2} \cdot \epsilon ^2_T - w^2 \epsilon _T - w^2 \ln \left( \frac{1}{2\delta _T}\right) \ge 0 \end{aligned}$$

Solving the quadratic inequality, we have:

$$\begin{aligned} D = w^4+4T \cdot \frac{4\gamma ^2}{\epsilon ^2} w^2 \ln \left( \frac{1}{2\delta _T}\right) \end{aligned}$$
$$\begin{aligned} \epsilon _T \ge \frac{w^2 + \sqrt{D}}{2T \cdot \frac{4\gamma ^2}{\epsilon ^2} } = \frac{\epsilon ^2 w^2}{8\gamma ^2T} \left[ 1 + \sqrt{1+16T\frac{\gamma ^2}{\epsilon ^2 w^2} \ln \left( \frac{1}{2\delta _T}\right) }\right] \end{aligned}$$

For large T, the additive term of 1 under the square root is negligible, so we have:

$$\begin{aligned} \epsilon _T \approx \frac{\epsilon ^2 w^2}{8\gamma ^2T} 4\sqrt{T} \frac{\gamma }{\epsilon w} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } = \frac{\epsilon w}{2\gamma \sqrt{T}} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } \end{aligned}$$

In DDML, \(w=\sqrt{2}\gamma \), therefore,

$$ \epsilon _T \approx \frac{\epsilon }{\sqrt{2T}} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } $$

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pihur, V. et al. (2022). Differentially-Private “Draw and Discard” Machine Learning: Training Distributed Model from Enormous Crowds. In: Dolev, S., Katz, J., Meisels, A. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2022. Lecture Notes in Computer Science, vol 13301. Springer, Cham. https://doi.org/10.1007/978-3-031-07689-3_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07689-3_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07688-6

  • Online ISBN: 978-3-031-07689-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics