Two Models are Better Than One: Federated Learning is Not Private for Google GBoard Next Word Prediction

Suliman, Mohamed; Leith, Douglas

doi:10.1007/978-3-031-51482-1_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14347))

Included in the following conference series:

European Symposium on Research in Computer Security

718 Accesses

Abstract

In this paper we present new attacks against federated learning when used to train natural language text models. We illustrate the effectiveness of the attacks against the next word prediction model used in Google’s GBoard app, a widely used mobile keyboard app that has been an early adopter of federated learning for production use. We demonstrate that the words a user types on their mobile handset, e.g. when sending text messages, can be recovered with high accuracy under a wide range of conditions and that counter-measures such a use of mini-batches and adding local noise are ineffective. We also show that the word order (and so the actual sentences typed) can be reconstructed with high fidelity. This raises obvious privacy concerns, particularly since GBoard is in production use.

M. Suliman—Now at IBM Research Europe - Dublin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Pretraining Federated Text Models for Next Word Prediction

Practical Implementation of Federated Learning for Detecting Backdoor Attacks in a Next-word Prediction Model

Article Open access 17 January 2025

Modeling Without Sharing Privacy: Federated Neural Machine Translation

Notes

1.
DP aims to protect the aggregate training data/model against query-based attacks, whereas our attack targets the individual updates. Nevertheless, we note that DP is sometimes suggested as a potential defence against the type of attack carried out here.
2.
Google’s Secure Aggregation approach [5] is a prominent example of an approach requiring trust in the server, or more specifically in the PKI infrastructure which in practice is operated by the same organisation that runs the FL server since it involves authentication/verification of clients. We note also that Secure Aggregation is not currently deployed in the GBoard app despite being proposed 6 years ago.
3.
It is perhaps worth noting that we studied a variety of reconstruction attacks, e.g., using Monte Carlo Tree Search to perform a smart search over all words sequences, but found the attack method described here to be simple, efficient and highly effective.
4.
Note that in DPSGD the added noise is multiplied by the learning rate $\eta $, and so this factor needs to be taken into account when comparing the $\sigma $ values used in DPSGD above and with single noise addition. This means added noise with standard deviation $\sigma $ for DPSGD corresponds roughly to a standard deviation of $\eta \sqrt{EB}\sigma $ with single noise addition. For $\eta =0.001$, $E=1000$, $B=32$, $\sigma =0.1$ the corresponding single noise addition standard deviation is 0.018.

References

Gboard – the Google Keyboard (2022). https://play.google.com/store/apps/details?id=com.google.android.inputmethod.latin. Accessed 24 Oct 2022
Abadi, M., et al.: Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (2016)
Google Scholar
Ball, J.: NSA collects millions of text messages daily in ‘untargeted’ global sweep (2014)
Google Scholar
Boenisch, F., Dziedzic, A., Schuster, R., Shamsabadi, A.S., Shumailov, I., Papernot, N.: When the curious abandon honesty: Federated learning is not private. arXiv preprint arXiv:2112.02918 (2021)
Bonawitz, K., et al.: Practical secure aggregation for federated learning on user-held data. arXiv preprint arXiv:1611.04482 (2016)
Carlini, N., Liu, C., Erlingsson, Ú., Kos, J., Song, D.: The secret sharer: evaluating and testing unintended memorization in neural networks. In: Proceedings of the 28th USENIX Conference on Security Symposium, SEC 2019, USA, pp. 267–284. USENIX Association (2019)
Google Scholar
Carlini, N., et al.: Extracting training data from large language models. In: 30th USENIX Security Symposium (USENIX Security 2021), pp. 2633–2650 (2021)
Google Scholar
Deng, J., et al.: Tag: gradient attack on transformer-based language models. arXiv preprint arXiv:2103.06819 (2021)
Geiping, J., Bauermeister, H., Dröge, H., Moeller, M.: Inverting gradients - how easy is it to break privacy in federated learning? In: Advances in Neural Information Processing Systems (2020)
Google Scholar
Greff, K., Srivastava, R.K., Koutník, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2017)
Article MathSciNet Google Scholar
Hard, A., et al.: Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604 (2018)
Jin, X., Chen, P.-Y., Hsu, C.-Y., Yu, C.M., Chen, T.: Catastrophic data leakage in vertical federated learning. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Kairouz, P., McMahan, B., Song, S., Thakkar, O., Thakurta, A., Xu, Z.: Practical and private (deep) learning without sampling or shuffling. arXiv preprint arXiv:2103.00039 (2021)
Leith, D.J.: Mobile handset privacy: measuring the data iOS and android send to apple and google. In: Proceedings of Securecomm (2021)
Google Scholar
Leith, D.J., Farrell, S.: Contact tracing app privacy: what data is shared by Europe’s GAEN contact tracing apps. In: Proceedings of IEEE INFOCOM (2021)
Google Scholar
Marzal, A., Vidal, E.: Computation of normalized edit distance and applications. IEEE Trans. Pattern Anal. Mach. Intell. 15(9), 926–932 (1993)
Article Google Scholar
McMahan, B., Moore, E., Ramage, D., Hampson, S., y Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics (2017)
Google Scholar
McMahan, H.B., Ramage, D., Talwar, K., Zhang, L.: Learning differentially private recurrent language models. In: International Conference on Learning Representations (2018)
Google Scholar
O’Day, D.R., Calix, R.A.: Text message corpus: applying natural language processing to mobile device forensics. In: 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW) (2013)
Google Scholar
Pan, X., Zhang, M., Yan, Y., Zhu, J., Yang, M.: Theory-oriented deep leakage from gradients via linear equation solver. arXiv preprint arXiv:2010.13356 (2020)
Pasquini, D., Francati, D., Ateniese, G.: Eluding secure aggregation in federated learning via model inconsistency. arXiv preprint arXiv:2111.07380 (2021)
Press, O., Wolf, L.: Using the output embedding to improve language models. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 157–163. Association for Computational Linguistics (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Wang, Y., et al.: Sapag: a self-adaptive privacy attack from gradients. arXiv preprint arXiv:2009.06228 (2020)
Yin, H., Mallya, A., Vahdat, A., Alvarez, J.M., Kautz, J., Molchanov, P.: See through gradients: image batch recovery via gradinversion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16337–16346 (2021)
Google Scholar
Zhao, B., Mopuri, K.R., Bilen, H.: idlg: improved deep leakage from gradients. arXiv preprint arXiv:2001.02610 (2020)
Zhu, J., Blaschko, M.: R-gap: recursive gradient attack on privacy. arXiv preprint arXiv:2010.07733 (2020)
Zhu, L., Liu, Z., Han, S.: Deep leakage from gradients. In: Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Trinity College Dublin, The University of Dublin, College Green, Dublin 2, D02 PN40, Ireland
Mohamed Suliman & Douglas Leith

Authors

Mohamed Suliman
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Leith
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Suliman .

Editor information

Editors and Affiliations

University of California, Irvine, CA, USA
Gene Tsudik
University of Padua, Padua, Italy
Mauro Conti
Delft University of Technology, Delft, The Netherlands
Kaitai Liang
Delft University of Technology, Delft, The Netherlands
Georgios Smaragdakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Suliman, M., Leith, D. (2024). Two Models are Better Than One: Federated Learning is Not Private for Google GBoard Next Word Prediction. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-51482-1_6
Published: 11 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51481-4
Online ISBN: 978-3-031-51482-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Two Models are Better Than One: Federated Learning is Not Private for Google GBoard Next Word Prediction