Abstract
The setting of our problem is a distributed architecture facing an enormous user set, where events are repeating and evolving over time, and we want to absorb the stream of events into the model: first local model, then absorb it in the global one, and also care about user privacy. Naturally, we learn a phenomenon which happens distributedly in many places (like malware spread over smartphones, user behavior to operation and UX of an app, or other such events). To this end, we considered a configuration where the learning server is built to deal with the possibly high frequency high-volume environment in a natural distributed fashion, while taking care of statistical convergence and privacy properties of the setting as well. We propose a novel framework for privacy-preserving client-distributed machine learning. It is based on the desire to allow differential privacy guarantees in the local model of privacy in a way that satisfies systems constraints using high number of asynchronous client-server communication (i.e., not much coordination among separate clients, which is a simple communication model to implement, which in some settings already exist, i.e., in apps facing users), and provides attractive model learning properties.
We develop a generic randomized learning algorithm “Draw and Discard” because it relies on random sampling and discarding of models for load distribution and scalability, which also provides additional server-side privacy protections and improved model quality through averaging. The model is general and we show its applicability to Generalized Linear models. We analyze the statistical stability and privacy guarantees provided by our approach against faults and against several types of adversaries. We then showcase experimental results. Our framework (first reported in [28]) has been experimentally deployed in a real industrial setting. We view the result as an initial combination of ML and of distributed systems, and we believe it poses numerous directions for further developments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This is only approximately correct, since in a high-throughput environment, another client request could have updated the same model in the meantime.
References
Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS 2016), pp. 308–318 (2016). http://doi.acm.org/10.1145/2976749.2978318
Apple Differential Privacy Team: Learning with Privacy at Scale, vol. 1. Apple Mach. Learn. J. (8) (2017). https://machinelearning.apple.com/2017/12/06/learning-with-privacy-at-scale.html
Bassily, R., Nissim, K., Stemmer, U., Thakurta, A.G.: Practical locally private heavy hitters. In: Advances in Neural Information Processing Systems, pp. 2285–2293 (2017)
Bassily, R., Smith, A.: Local, private, efficient protocols for succinct histograms. In: Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pp. 127–135. ACM (2015)
Bun, M., Nelson, J., Stemmer, U.: Heavy hitters and the structure of local privacy. arXiv preprint arXiv:1711.04740 (2017)
Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: Advances in Neural Information Processing Systems, pp. 289–296 (2009)
Delange, J.: Why using asynchronous communications? (2017). http://julien.gunnm.org/programming/linux/2017/04/15/comparison-sync-vs-async
Duchi, J.C., Jordan, M.I., Wainwright, M.J.: Local privacy and statistical minimax rates. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 429–438, 0272-5428 (2014). https://doi.org/10.1109/FOCS.2013.53
Dwork, C.: A firm foundation for private data analysis. Commun. ACM 54(1), 86–95 (2011)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878_14
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends\({\text{\textregistered} }\) Theor. Comput. Sci. 9(3–4), 211–407 (2014)
Erlingsson, Ú., Pihur, V., Korolova, A.: RAPPOR: randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS 2014), pp. 1054–1067 (2014)
European Association for Theoretical Computer Science. Gödel Prize (2017). https://eatcs.org/index.php/component/content/article/1-news/2450-2017-godel-prize
Fanti, G., Pihur, V., Erlingsson, Ú.: Building a RAPPOR with the unknown: privacy-preserving learning of associations and data dictionaries. Proc. Privacy Enhancing Technol. 3, 41–61 (2016)
Feldman, V., Mironov, I., Talwar, K., Thakurta, A.: Privacy amplification by iteration. ArXiv e-prints abs/1808.06651, August 2018
Google. Google Safe Browsing (2018). https://safebrowsing.google.com/
Greenberg, A.: Apple’s Differential Privacy is About Collecting Your Data - But Not Your Data. In Wired, 13 June 2016
Kairouz, P., Oh, S., Viswanath, P.: Extremal mechanisms for local differential privacy. In: Advances in Neural Information Processing Systems, pp. 2879–2887 (2014)
Kenthapadi, K., Korolova, A., Mironov, I., Mishra, N.: Privacy via the Johnson-Lindenstrauss transform. J. Privacy Confidentiality 5(1), 39–71 (2013)
Madden, M., Rainie, L.: Americans’ Attitudes About Privacy, Security and Surveillance. Pew Research Center (2015). http://www.pewinternet.org/2015/05/20/americans-attitudes-about-privacy-security-and-surveillance/
Madden, M., Rainie, L.: Privacy and Information Sharing. Pew Research Center (2016). http://www.pewinternet.org/2016/01/14/privacy-and-information-sharing/
Brendan McMahan, H., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: AISTATS (2017)
Brendan McMahan, H., Ramage, D., Talwar, K., Zhang, L.: Learning differentially private language models without losing accuracy. CoRR abs/1710.06963 (2017). arxiv:1710.06963
McSherry, F.: Deep learning and differential privacy (2017). https://github.com/frankmcsherry/blog/blob/master/posts/2017-10-27.md
Nelder, J.A., Wedderburn, R.W.M.: Generalized linear models. J. R. Stat. Soc. Ser. A General 135(1972), 370–384 (1972)
Nissim, K., et al.: Differential privacy: a primer for a non-technical audience (Preliminary Version). Vanderbilt J. Entertainment Technol. Law (2018)
Papernot, N., Abadi, M., Erlingsson, Ú., Goodfellow, I., Talwar, K.: Semi-supervised knowledge transfer for deep learning from private training data. In: 5th International Conference on Learning Representations (2016)
Pihur, V.: Differentially-Private “Draw and Discard” Machine Learning (2018). https://doi.org/10.48550/ARXIV.1807.04369
Portnoy, E., Gebhart, G., Grant, S.: In EFF DeepLinks Blog (2016). www.eff.org/deeplinks/2016/09/facial-recognition-differential-privacy-and-trade-offs-apples-latest-os-releases
Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, pp. 1310–1321 (2015)
Song, S., Chaudhuri, K., Sarwate, A.D.: Stochastic gradient descent with differentially private updates. In: Global Conference on Signal and Information Processing (GlobalSIP), pp. 245–248. IEEE (2013)
WWDC 2016. WWDC 2016 Keynote, June 2016. https://www.apple.com/apple-events/june-2016/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix
A Variance Stabilization Proof and Other Proofs
Theorem 1 Let \(B = (B_1, \ldots , B_k)\) be a vector of k random variables (weights) with mean \(\mu \) and variance \(\frac{k}{2}\sigma ^2\). Selecting one of the weights at random, adding noise with mean 0 and variance \(\sigma ^2\) and putting it back with replacement does not change the expected intra-model variance of B (i.e., weights remain distributed with variance \(\frac{k}{2}\sigma ^2\)).
Proof
We use the Law of Total Variance
three times as we think of B as a mixture of mixtures. In the first mixture, the expectation is taken over the distribution of whether the same or different model was updated (\(X \in \{j \rightarrow j, j \rightarrow j'\}\)). In the inner mixture, the expectation is taken over the partitioning of the k weights themselves (Z).
Total variance, therefore, is equal to
Because we add noise with mean 0, in either case, the mean of B does not change, so \(V(E(B|X)) = 0\).
Replacing the same weight partitions the k weights into two sets, a single weight updated and the rest of the weights. Partition means do not change, and the variance increases due to added noise in the first partition (mixture component). Thus, \(V(B | j \rightarrow j)\) is given as
Replacing a different weight partitions the weights space into 2 and \((k-2)\) subsets. Unlike in the first case, V(E(B|Z)) is non-zero due to a single weight essentially replicated twice in the first partition (\(B_1\) has a mean of 0, but \(B_2\) has a mean of \(B_1\)). After the update, the overall mean of B under Z becomes
where \(\mu _1\) is the mean of the model selected and model replaced and has a distribution with mean \(\mu \) and variance \(\frac{k}{2}\sigma ^2\). Note that the mean of B over sampling in X is still 0 as \(E(B_1) = 0\) over X.
Then \(V(B|j \rightarrow j')\)
Note that the variance component must be computed with \(k-1\) and not k because of the finite nature of k in this case.
Putting it all together,
Lemma 1 DDML discards \(1-\frac{1}{k}\) updates long-term.
Proof
Consider a Markov process on the states \(0, 1, \ldots , k\), where each state represents the number of models in which a particular update can be found. Denote by \(p_i\) - the probability to go from i to \(i-1\) or \(i+1\). In DDML, \(p_i = \frac{(k-i)i}{k^2}\), for \(1 \le i \le k-1\), and \(p_0 = 0, p_k = 1\).
Let \(q_i\) be the probability of eventually ending up at state 0 if you start in state i. By the set-up of DDML, for \(1< i<k-1\) we have:
\(q_i = p_iq_{i-1} + p_iq_{i+1}+(1-2p_i)q_i,\) or \(2q_i = q_{i-1}+q_{i+1}\).
We also know that
-
\(q_0=1, p_0=0\), \(q_k=0, p_k=1\),
-
\(q_1 = p_1q_0+p_1q_2+(1-2p_1)q_1\),
-
\(q_{k-1} = p_{k-1}q_{k-2} + p_{k-1}q_k+(1-2p_{k-1})q_{k-1}\)
Summing equations for \(1 \le i \le k-1\) we have
\(2q_1 + \cdots + 2 q_{k-1} = q_0 + q_1 + 2(q_2+\cdots +q_{k-2})+q_{k-1}+q_{k}\) Simplifying: \(q_1+q_{k-1} = q_0+q_k\) or
On the other hand,
-
\(q_{k-2} = 2q_{k-1}\),
-
\(q_{k-3} = 2q_{k-2} - q_{k-1} = 3q_{k-1}\)
-
\(q_{k-4}=2q_{k-3}-q_{k-2} = 4q_{k-1}\)
-
\(\cdots \)
$$\begin{aligned} q_1 = 2q_2-q_3 = (k-1)q_{k-1} \end{aligned}$$(6)
Combining (5) and (6), we have \((k-1)q_{k-1}+q_{k-1} = 1\) or \(q_{k-1}=\frac{1}{k}\) and \(q_1 = 1-\frac{1}{k}\). Since DDML is set-up that each particular contribution is initially in state 1, this completes the proof.
Theorem 2 With high probability, DDML guarantees a user \((\epsilon _T, \delta _T)\)-differential privacy against adversary III, where \(\epsilon _T = \frac{\epsilon }{\sqrt{2T}} \sqrt{\ln \left( \frac{1}{2\delta _T}\right) } \) and \(\delta _T\) is an arbitrary small constant (typically chosen as O(1/ number of users)). T is the number of updates made to the model instance between when a user submitted his instance update and when the adversary observes the instances. The statement holds if T is sufficiently large.
Proof
We rely on the result from concurrent and independent work of [15] obtained in a different context to analyze the privacy amplification in this case. Specifically, their result states that for any contractive noisy process, privacy amplification is no worse than that for the identity contraction, which we analyze below.
The sum of T random variables drawn independently from the Laplace distribution with mean 0 will tend toward a normal distribution for sufficiently large T, by the Central Limit Theorem. In DDML’s case with Laplace noise, the variance of each random variable is \(\frac{8\gamma ^2}{\epsilon ^2}\), therefore, if we assume that the adversary observes the model instance after T updates to it, the variance of the noise added will be \(T \cdot \frac{8\gamma ^2}{\epsilon ^2}\). This corresponds to Gaussian with scale \(\sigma = \frac{2\sqrt{2T}\gamma }{\epsilon }\).
Lemma 1 from [19] states that for points in p-dimensional space that differ by at most w in \(\ell _2\), addition of noise drawn from \(N^p(0, \sigma ^2_T)\), where \(\sigma _T \ge w \frac{\sqrt{2\left( \ln \left( \frac{1}{2\delta _T}\right) +\epsilon _T\right) }}{\epsilon _T}\) and \(\delta _T < \frac{1}{2}\) ensures \((\epsilon _T, \delta _T)\) differential privacy. We use the result of Lemma 1 from [19], rather than the more commonly referenced result from Theorem A.1 of [11], because the latter result holds only for \(\epsilon _T \le 1\), which is not the privacy loss used in most practical applications.
We now ask the question: what approximate differential privacy guarantee is achieved by DDML against adversary III? To answer this, fix a desired level of \(\delta _T\) and use the approximation obtained from the Central Limit theorem to solve for the \(\epsilon _T\).
Solving the quadratic inequality, we have:
For large T, the additive term of 1 under the square root is negligible, so we have:
In DDML, \(w=\sqrt{2}\gamma \), therefore,
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Pihur, V. et al. (2022). Differentially-Private “Draw and Discard” Machine Learning: Training Distributed Model from Enormous Crowds. In: Dolev, S., Katz, J., Meisels, A. (eds) Cyber Security, Cryptology, and Machine Learning. CSCML 2022. Lecture Notes in Computer Science, vol 13301. Springer, Cham. https://doi.org/10.1007/978-3-031-07689-3_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-07689-3_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07688-6
Online ISBN: 978-3-031-07689-3
eBook Packages: Computer ScienceComputer Science (R0)