Skip to main content

Security of NVMe Offloaded Data in Large-Scale Machine Learning

  • Conference paper
  • First Online:
Computer Security – ESORICS 2023 (ESORICS 2023)

Abstract

Large-scale machine learning (LSML) models, such as the GPT-3.5 that powers the well-known ChatGPT chatbot, have revolutionized our perception of AI by enabling more natural, context-aware, and interactive experiences. Yet, training such large models nowadays requires multiple months of computation on expensive hardware, including GPUs, orchestrated by specialized software, so-called LSML frameworks. Due to the model size, neither the on-device memory of GPUs nor the RAM is capable of holding all parameters simultaneously during training. Therefore, LSML frameworks dynamically offload data to NVMe storage and reload the information just in time.

In this paper, we investigate the security of NVMe offloaded data in LSML against poisoning attacks and present NVMevade, the first untargeted poisoning attack on NVMe offloads. NVMevade allows the attacker to reduce the model performance, as well as slow down or even stall the training process. For instance, we demonstrate that an attacker can achieve a stealthy increase of 182% in training time, thus, inflating costs for model training. To address this vulnerability, we develop NVMensure, the first defense that guarantees the integrity and freshness of NVMe offloaded data in LSML. By conducting a large-scale study, we demonstrate the robustness of NVMensure against poisoning attacks and explore runtime efficiency and security trade-offs it can provide. We tested 22 different NVMensure configurations and report an overhead between 9.8% and 64.2%, depending on the selected security level. We also note that NVMensure is going to be effective against targeted poisoning attacks which do not exist yet but might be developed in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    A NVIDIA A100 GPU [28] provides 80 GB of on-device memory.

  2. 2.

    NVMe is an interface specification for PCIe attached Flash and SSD storage devices.

  3. 3.

    We posit that the attacker has effectively circumvented OS access controls, thus obtaining necessary user-level permissions to access and manipulate NVMe files.

  4. 4.

    The replay attack naturally also affects the model performance, but the effect was very marginal and not recognizable in our experiments.

  5. 5.

    For stealthiness evaluation, we compared the execution log (console output) with and without attack. We only found different timestamps and minimal loss value changes, which is normal in different runs.

  6. 6.

    In our experiments the training then stopped. Certainly, real-world scenarios can roll back to a certain checkpoint and continue training automatically.

  7. 7.

    Since LSML models are normally trained on publicly available data that can be scrutinized by experts, dataset privacy is not a concern.

References

  1. Aumasson, J.-P., Neves, S., Wilcox-O’Hearn, Z., Winnerlein, C.: BLAKE2: simpler, smaller, fast as MD5. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 119–135. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38980-1_8

    Chapter  Google Scholar 

  2. Bagdasaryan, E., Shmatikov, V.: Blind backdoors in deep learning models. In: USENIX Security (2021)

    Google Scholar 

  3. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  4. Bojar, O., et al.: Findings of the 2016 conference on machine translation. In: Proceedings of the First Conference on Machine Translation (2016)

    Google Scholar 

  5. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  6. Bubeck, S., et al.: Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023)

  7. Chen, H., Fu, C., Zhao, J., Koushanfar, F.: ProFlip: targeted trojan attack with progressive bit flips. In: IEEE/CVF ICCV (2021)

    Google Scholar 

  8. Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv preprint arXiv:1712.05526 (2017)

  9. Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: ICML (2008)

    Google Scholar 

  10. Costan, V., Devadas, S.: Intel SGX explained. Cryptology ePrint Archive (2016)

    Google Scholar 

  11. El Merabet, H., Hajraoui, A.: A survey of malware detection techniques based on machine learning. IJACSA (2019)

    Google Scholar 

  12. Fan, B., Andersen, D.G., Kaminsky, M., Mitzenmacher, M.D.: Cuckoo filter: practically better than bloom. In: CoNEXT (2014)

    Google Scholar 

  13. Gallagher, P., Director, A.: Secure Hash Standard (SHS). FIPS PUB (1995)

    Google Scholar 

  14. Goldblum, M., et al.: Dataset security for machine learning: data poisoning, backdoor attacks, and defenses. IEEE PAMI 45(2), 1563–1580 (2022)

    Article  Google Scholar 

  15. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP (2013)

    Google Scholar 

  16. Guo, D., Liu, Y., Li, X., Yang, P.: False negative problem of counting bloom filter. IEEE Trans. Knowl. Data Eng. 22(5), 651–664 (2010)

    Article  Google Scholar 

  17. Hilal, W., Gadsden, S.A., Yawney, J.: Financial fraud: a review of anomaly detection techniques and recent advances. Expert Syst. Appl. 193, 116429 (2022)

    Article  Google Scholar 

  18. International Organization for Standardization: Information processing — Use of longitudinal parity to detect errors in information messages. ISO Standard ISO 1155, ISO (2001)

    Google Scholar 

  19. Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., Li, B.: Manipulating machine learning: poisoning attacks and countermeasures for regression learning. In: IEEE S &P (2018)

    Google Scholar 

  20. Jang, I., Tang, A., Kim, T., Sethumadhavan, S., Huh, J.: Heterogeneous isolated execution for commodity GPUs. In: ASPLOS (2019)

    Google Scholar 

  21. Kinney, S.L.: Trusted Platform Module Basics: Using TPM in Embedded Systems. Elsevier (2006)

    Google Scholar 

  22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NeurIPS (2017)

    Google Scholar 

  23. Le Quoc, D., Gregor, F., Singh, J., Fetzer, C.: SGX-PySpark: secure distributed data analytics. In: WWW (2019)

    Google Scholar 

  24. Mechanics, M.: What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich (2023). https://youtu.be/Rk3nTUfRZmo

  25. Microsoft Research: Turing NLG: A 17 Billion Parameter Language Model by Microsoft (2021). https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/

  26. Mutlu, O., Kim, J.S.: RowHammer: a retrospective. IEEE TCAD 39(8), 1555–1571 (2020)

    Google Scholar 

  27. Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: ACM SOSP (2019)

    Google Scholar 

  28. Nvidia: A100 GPU (2023). https://www.nvidia.com/en-us/data-center/a100/

  29. Nvidia: DGX Systems (2023). https://www.nvidia.com/de-de/data-center/dgx-systems/

  30. OpenAI: Chatgpt (2023). https://openai.com/research/chatgpt

  31. OpenAI: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)

  32. Orenbach, M., Lifshits, P., Minkin, M., Silberstein, M.: Eleos: ExitLess OS services for SGX enclaves. In: EuroSys (2017)

    Google Scholar 

  33. Ozga, W., Quoc, D.L., Fetzer, C.: Perun: Secure Multi-Stakeholder Machine Learning Framework with GPU Support. arXiv preprint arXiv:2103.16898 (2021)

  34. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  35. Paudice, A., Muñoz-González, L., Gyorgy, A., Lupu, E.C.: Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection. arXiv preprint arXiv:1802.03041 (2018)

  36. Peri, N., et al.: Deep k-NN defense against clean-label data poisoning attacks. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12535, pp. 55–70. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66415-2_4

    Chapter  Google Scholar 

  37. Peterson, W.W., Brown, D.T.: Cyclic codes for error detection. In: Proceedings of the IRE (1961)

    Google Scholar 

  38. Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers (2018)

    Google Scholar 

  39. Quoc, D.L., Gregor, F., Arnautov, S., Kunkel, R., Bhatotia, P., Fetzer, C.: SecureTF: a secure TensorFlow framework. In: ACM/IFIP Middleware (2020)

    Google Scholar 

  40. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. In: OpenAI (2018)

    Google Scholar 

  41. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog (2019)

    Google Scholar 

  42. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(1), 5485–5551 (2020)

    MathSciNet  Google Scholar 

  43. Rajbhandari, S., et al.: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv preprint arXiv:2201.05596 (2022)

  44. Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: ZeRO: memory optimizations toward training trillion parameter models. In: SC 2020 (2020)

    Google Scholar 

  45. Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., He, Y.: ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv preprint arXiv:2104.07857 (2021)

  46. Rakin, A.S., He, Z., Fan, D.: TBT: targeted neural network attack with bit trojan. In: IEEE/CVF CVPR (2020)

    Google Scholar 

  47. Rakin, A.S., He, Z., Li, J., Yao, F., Chakrabarti, C., Fan, D.: T-BFA: targeted bit-flip adversarial weight attack. IEEE PAMI 44(11), 7928–7939 (2022)

    Article  Google Scholar 

  48. Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: SIGKDD (2020)

    Google Scholar 

  49. Ren, J., et al.: ZeRO-offload: democratizing billion-scale model training. In: USENIX ATC (2021)

    Google Scholar 

  50. Rivest, R.: The MD5 Message-Digest Algorithm. IETF (1992)

    Google Scholar 

  51. Saha, A., Subramanya, A., Pirsiavash, H.: Hidden trigger backdoor attacks. In: AAAI (2020)

    Google Scholar 

  52. Sarwate, D.V.: Computation of cyclic redundancy checks via table look-up. Commun. ACM 31(8), 1008–1013 (1988)

    Article  Google Scholar 

  53. Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053 (2020)

  54. Smith, S., et al.: Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (2022)

    Google Scholar 

  55. Tian, Z., Cui, L., Liang, J., Yu, S.: A comprehensive survey on poisoning attacks and countermeasures in machine learning. ACM Comput. Surv. 55(8), 1–35 (2022)

    Article  Google Scholar 

  56. Tian, Z., Cui, L., Liang, J., Yu, S.: A comprehensive survey on poisoning attacks and countermeasures in machine learning. ACM CSUR (2022)

    Google Scholar 

  57. Tramèr, F., Boneh, D.: Slalom: fast, verifiable and private execution of neural networks in trusted hardware. In: ICLR (2018)

    Google Scholar 

  58. Tsai, C.C., Porter, D.E., Vij, M.: Graphene-SGX: a practical library OS for unmodified applications on SGX. In: USENIX ATC (2017)

    Google Scholar 

  59. Volos, S., Vaswani, K., Bruno, R.: Graviton: trusted execution environments on GPUs. In: USENIX OSDI (2018)

    Google Scholar 

  60. Xia, G., Chen, J., Yu, C., Ma, J.: Poisoning attacks in federated learning: a survey. IEEE Access 11, 10708–10722 (2023)

    Article  Google Scholar 

  61. Xiao, H., Biggio, B., Brown, G., Fumera, G., Eckert, C., Roli, F.: Is feature selection secure against training data poisoning? PMLR (2015)

    Google Scholar 

  62. Yang, C., Wu, Q., Li, H., Chen, Y.: Generative Poisoning Attack Method Against Neural Networks. arXiv preprint arXiv:1703.01340 (2017)

  63. Yao, F., Rakin, A.S., Fan, D.: DeepHammer: depleting the intelligence of deep neural networks through targeted chain of bit flips. In: USENIX Security (2020)

    Google Scholar 

  64. Zhu, J., et al.: Enabling rack-scale confidential computing using heterogeneous trusted execution environment. IEEE S&P (2020)

    Google Scholar 

Download references

Acknowledgment

We thank the Private AI Collaborate Research Institute which is co-sponsored by Intel Labs (www.private-ai.org) for partially supporting this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Torsten Krauß .

Editor information

Editors and Affiliations

A Instantiation for DeepSpeed

A Instantiation for DeepSpeed

Below, we provide implementation details of our approaches for DeepSpeed  [48].

1.1 A.1 NVMevade

Table 6 depicts the parameters of NVMevade for attack modes 1 and 2. Further, we want to report, that our scanning process confirmed that DeepSpeed offloads three types of data: model parameters, optimizer states, and gradients. Following, we provide additional information for the different attack modes.

Table 6. NVMevade’s parameters for bit-flips in attack mode 1 and 2.

Mode 1 and 2. During the poisoning process via bit-flips, we simultaneously attacked all offloaded data. The most impactful perturbations were made in intermediate training parameters, namely gradients and optimizer states, leading to immediate gradient overflows. In benign training, DeepSpeed scales gradients using a scaling factor to address gradient underflows. However, in NVMevade’s implanted overflow scenario, the current training step is skipped, the scaling factor is halved, and the training is halted. When the scaling factor reaches a configured minimum, the entire DeepSpeed process is interrupted with an exception, typically occurring within a few steps of malicious overflows. When adjusting the parameters for mode 2, the objective is to minimize the occurrence of overflows while ensuring a significant level of detrimental modifications, thereby preventing the training process from abruptly terminating.

Mode 3. Due to DeepSpeed’s implementation of a basic sanity check on file sizes, adjusting file size might be necessary if a newly offloaded file with the same name but different size appears during replay attacks. The adjustment can be done via Python’s truncate function to reduce size or zero-padding to increase it. Yet, our experiments didn’t encounter this scenario.

1.2 A.2 NVMensure

NVMensure is implemented in C++ within the DeepNVMe  [45] library. The runtime of integrity mechanisms varies depending on the implementation. MD5 and BLAKE2bp employ C++ reference implementations, while SHA-256 utilizes a highly efficient kernel implementation accessed through the kernel’s crypto API. We implemented CRC-32  [52] and LRC  [18] by ourselves in C++.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Krauß, T., Götz, R., Dmitrienko, A. (2024). Security of NVMe Offloaded Data in Large-Scale Machine Learning. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-51482-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-51481-4

  • Online ISBN: 978-3-031-51482-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics