Security of NVMe Offloaded Data in Large-Scale Machine Learning

Krauß, Torsten; Götz, Raphael; Dmitrienko, Alexandra

doi:10.1007/978-3-031-51482-1_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14347))

Included in the following conference series:

European Symposium on Research in Computer Security

676 Accesses

Abstract

Large-scale machine learning (LSML) models, such as the GPT-3.5 that powers the well-known ChatGPT chatbot, have revolutionized our perception of AI by enabling more natural, context-aware, and interactive experiences. Yet, training such large models nowadays requires multiple months of computation on expensive hardware, including GPUs, orchestrated by specialized software, so-called LSML frameworks. Due to the model size, neither the on-device memory of GPUs nor the RAM is capable of holding all parameters simultaneously during training. Therefore, LSML frameworks dynamically offload data to NVMe storage and reload the information just in time.

In this paper, we investigate the security of NVMe offloaded data in LSML against poisoning attacks and present NVMevade, the first untargeted poisoning attack on NVMe offloads. NVMevade allows the attacker to reduce the model performance, as well as slow down or even stall the training process. For instance, we demonstrate that an attacker can achieve a stealthy increase of 182% in training time, thus, inflating costs for model training. To address this vulnerability, we develop NVMensure, the first defense that guarantees the integrity and freshness of NVMe offloaded data in LSML. By conducting a large-scale study, we demonstrate the robustness of NVMensure against poisoning attacks and explore runtime efficiency and security trade-offs it can provide. We tested 22 different NVMensure configurations and report an overhead between 9.8% and 64.2%, depending on the selected security level. We also note that NVMensure is going to be effective against targeted poisoning attacks which do not exist yet but might be developed in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Adversarial Vulnerability of Active Transfer Learning

Rethinking security: the resilience of shallow ML models

Article Open access 18 October 2024

Data Poisoning Attacks Against Federated Learning Systems

Notes

1.
A NVIDIA A100 GPU [28] provides 80 GB of on-device memory.
2.
NVMe is an interface specification for PCIe attached Flash and SSD storage devices.
3.
We posit that the attacker has effectively circumvented OS access controls, thus obtaining necessary user-level permissions to access and manipulate NVMe files.
4.
The replay attack naturally also affects the model performance, but the effect was very marginal and not recognizable in our experiments.
5.
For stealthiness evaluation, we compared the execution log (console output) with and without attack. We only found different timestamps and minimal loss value changes, which is normal in different runs.
6.
In our experiments the training then stopped. Certainly, real-world scenarios can roll back to a certain checkpoint and continue training automatically.
7.
Since LSML models are normally trained on publicly available data that can be scrutinized by experts, dataset privacy is not a concern.

References

Aumasson, J.-P., Neves, S., Wilcox-O’Hearn, Z., Winnerlein, C.: BLAKE2: simpler, smaller, fast as MD5. In: Jacobson, M., Locasto, M., Mohassel, P., Safavi-Naini, R. (eds.) ACNS 2013. LNCS, vol. 7954, pp. 119–135. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38980-1_8
Chapter Google Scholar
Bagdasaryan, E., Shmatikov, V.: Blind backdoors in deep learning models. In: USENIX Security (2021)
Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Article Google Scholar
Bojar, O., et al.: Findings of the 2016 conference on machine translation. In: Proceedings of the First Conference on Machine Translation (2016)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Bubeck, S., et al.: Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712 (2023)
Chen, H., Fu, C., Zhao, J., Koushanfar, F.: ProFlip: targeted trojan attack with progressive bit flips. In: IEEE/CVF ICCV (2021)
Google Scholar
Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv preprint arXiv:1712.05526 (2017)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: ICML (2008)
Google Scholar
Costan, V., Devadas, S.: Intel SGX explained. Cryptology ePrint Archive (2016)
Google Scholar
El Merabet, H., Hajraoui, A.: A survey of malware detection techniques based on machine learning. IJACSA (2019)
Google Scholar
Fan, B., Andersen, D.G., Kaminsky, M., Mitzenmacher, M.D.: Cuckoo filter: practically better than bloom. In: CoNEXT (2014)
Google Scholar
Gallagher, P., Director, A.: Secure Hash Standard (SHS). FIPS PUB (1995)
Google Scholar
Goldblum, M., et al.: Dataset security for machine learning: data poisoning, backdoor attacks, and defenses. IEEE PAMI 45(2), 1563–1580 (2022)
Article Google Scholar
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP (2013)
Google Scholar
Guo, D., Liu, Y., Li, X., Yang, P.: False negative problem of counting bloom filter. IEEE Trans. Knowl. Data Eng. 22(5), 651–664 (2010)
Article Google Scholar
Hilal, W., Gadsden, S.A., Yawney, J.: Financial fraud: a review of anomaly detection techniques and recent advances. Expert Syst. Appl. 193, 116429 (2022)
Article Google Scholar
International Organization for Standardization: Information processing — Use of longitudinal parity to detect errors in information messages. ISO Standard ISO 1155, ISO (2001)
Google Scholar
Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., Li, B.: Manipulating machine learning: poisoning attacks and countermeasures for regression learning. In: IEEE S &P (2018)
Google Scholar
Jang, I., Tang, A., Kim, T., Sethumadhavan, S., Huh, J.: Heterogeneous isolated execution for commodity GPUs. In: ASPLOS (2019)
Google Scholar
Kinney, S.L.: Trusted Platform Module Basics: Using TPM in Embedded Systems. Elsevier (2006)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NeurIPS (2017)
Google Scholar
Le Quoc, D., Gregor, F., Singh, J., Fetzer, C.: SGX-PySpark: secure distributed data analytics. In: WWW (2019)
Google Scholar
Mechanics, M.: What runs ChatGPT? Inside Microsoft’s AI supercomputer | Featuring Mark Russinovich (2023). https://youtu.be/Rk3nTUfRZmo
Microsoft Research: Turing NLG: A 17 Billion Parameter Language Model by Microsoft (2021). https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
Mutlu, O., Kim, J.S.: RowHammer: a retrospective. IEEE TCAD 39(8), 1555–1571 (2020)
Google Scholar
Narayanan, D., et al.: PipeDream: generalized pipeline parallelism for DNN training. In: ACM SOSP (2019)
Google Scholar
Nvidia: A100 GPU (2023). https://www.nvidia.com/en-us/data-center/a100/
Nvidia: DGX Systems (2023). https://www.nvidia.com/de-de/data-center/dgx-systems/
OpenAI: Chatgpt (2023). https://openai.com/research/chatgpt
OpenAI: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)
Orenbach, M., Lifshits, P., Minkin, M., Silberstein, M.: Eleos: ExitLess OS services for SGX enclaves. In: EuroSys (2017)
Google Scholar
Ozga, W., Quoc, D.L., Fetzer, C.: Perun: Secure Multi-Stakeholder Machine Learning Framework with GPU Support. arXiv preprint arXiv:2103.16898 (2021)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Paudice, A., Muñoz-González, L., Gyorgy, A., Lupu, E.C.: Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection. arXiv preprint arXiv:1802.03041 (2018)
Peri, N., et al.: Deep k-NN defense against clean-label data poisoning attacks. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12535, pp. 55–70. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66415-2_4
Chapter Google Scholar
Peterson, W.W., Brown, D.T.: Cyclic codes for error detection. In: Proceedings of the IRE (1961)
Google Scholar
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation: Research Papers (2018)
Google Scholar
Quoc, D.L., Gregor, F., Arnautov, S., Kunkel, R., Bhatotia, P., Fetzer, C.: SecureTF: a secure TensorFlow framework. In: ACM/IFIP Middleware (2020)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. In: OpenAI (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog (2019)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21(1), 5485–5551 (2020)
MathSciNet Google Scholar
Rajbhandari, S., et al.: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arXiv preprint arXiv:2201.05596 (2022)
Rajbhandari, S., Rasley, J., Ruwase, O., He, Y.: ZeRO: memory optimizations toward training trillion parameter models. In: SC 2020 (2020)
Google Scholar
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., He, Y.: ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. arXiv preprint arXiv:2104.07857 (2021)
Rakin, A.S., He, Z., Fan, D.: TBT: targeted neural network attack with bit trojan. In: IEEE/CVF CVPR (2020)
Google Scholar
Rakin, A.S., He, Z., Li, J., Yao, F., Chakrabarti, C., Fan, D.: T-BFA: targeted bit-flip adversarial weight attack. IEEE PAMI 44(11), 7928–7939 (2022)
Article Google Scholar
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In: SIGKDD (2020)
Google Scholar
Ren, J., et al.: ZeRO-offload: democratizing billion-scale model training. In: USENIX ATC (2021)
Google Scholar
Rivest, R.: The MD5 Message-Digest Algorithm. IETF (1992)
Google Scholar
Saha, A., Subramanya, A., Pirsiavash, H.: Hidden trigger backdoor attacks. In: AAAI (2020)
Google Scholar
Sarwate, D.V.: Computation of cyclic redundancy checks via table look-up. Commun. ACM 31(8), 1008–1013 (1988)
Article Google Scholar
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., Catanzaro, B.: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv preprint arXiv:1909.08053 (2020)
Smith, S., et al.: Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (2022)
Google Scholar
Tian, Z., Cui, L., Liang, J., Yu, S.: A comprehensive survey on poisoning attacks and countermeasures in machine learning. ACM Comput. Surv. 55(8), 1–35 (2022)
Article Google Scholar
Tian, Z., Cui, L., Liang, J., Yu, S.: A comprehensive survey on poisoning attacks and countermeasures in machine learning. ACM CSUR (2022)
Google Scholar
Tramèr, F., Boneh, D.: Slalom: fast, verifiable and private execution of neural networks in trusted hardware. In: ICLR (2018)
Google Scholar
Tsai, C.C., Porter, D.E., Vij, M.: Graphene-SGX: a practical library OS for unmodified applications on SGX. In: USENIX ATC (2017)
Google Scholar
Volos, S., Vaswani, K., Bruno, R.: Graviton: trusted execution environments on GPUs. In: USENIX OSDI (2018)
Google Scholar
Xia, G., Chen, J., Yu, C., Ma, J.: Poisoning attacks in federated learning: a survey. IEEE Access 11, 10708–10722 (2023)
Article Google Scholar
Xiao, H., Biggio, B., Brown, G., Fumera, G., Eckert, C., Roli, F.: Is feature selection secure against training data poisoning? PMLR (2015)
Google Scholar
Yang, C., Wu, Q., Li, H., Chen, Y.: Generative Poisoning Attack Method Against Neural Networks. arXiv preprint arXiv:1703.01340 (2017)
Yao, F., Rakin, A.S., Fan, D.: DeepHammer: depleting the intelligence of deep neural networks through targeted chain of bit flips. In: USENIX Security (2020)
Google Scholar
Zhu, J., et al.: Enabling rack-scale confidential computing using heterogeneous trusted execution environment. IEEE S&P (2020)
Google Scholar

Download references

Acknowledgment

We thank the Private AI Collaborate Research Institute which is co-sponsored by Intel Labs (www.private-ai.org) for partially supporting this research.

Author information

Authors and Affiliations

University of Würzburg, Sanderring 2, 97070, Würzburg, Germany
Torsten Krauß, Raphael Götz & Alexandra Dmitrienko

Authors

Torsten Krauß
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Götz
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Dmitrienko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Torsten Krauß .

Editor information

Editors and Affiliations

University of California, Irvine, CA, USA
Gene Tsudik
University of Padua, Padua, Italy
Mauro Conti
Delft University of Technology, Delft, The Netherlands
Kaitai Liang
Delft University of Technology, Delft, The Netherlands
Georgios Smaragdakis

A Instantiation for DeepSpeed

Below, we provide implementation details of our approaches for DeepSpeed [48].

1.1 A.1 NVMevade

Table 6 depicts the parameters of NVMevade for attack modes 1 and 2. Further, we want to report, that our scanning process confirmed that DeepSpeed offloads three types of data: model parameters, optimizer states, and gradients. Following, we provide additional information for the different attack modes.

Table 6. NVMevade’s parameters for bit-flips in attack mode 1 and 2.

Full size table

Mode 1 and 2. During the poisoning process via bit-flips, we simultaneously attacked all offloaded data. The most impactful perturbations were made in intermediate training parameters, namely gradients and optimizer states, leading to immediate gradient overflows. In benign training, DeepSpeed scales gradients using a scaling factor to address gradient underflows. However, in NVMevade’s implanted overflow scenario, the current training step is skipped, the scaling factor is halved, and the training is halted. When the scaling factor reaches a configured minimum, the entire DeepSpeed process is interrupted with an exception, typically occurring within a few steps of malicious overflows. When adjusting the parameters for mode 2, the objective is to minimize the occurrence of overflows while ensuring a significant level of detrimental modifications, thereby preventing the training process from abruptly terminating.

Mode 3. Due to DeepSpeed’s implementation of a basic sanity check on file sizes, adjusting file size might be necessary if a newly offloaded file with the same name but different size appears during replay attacks. The adjustment can be done via Python’s truncate function to reduce size or zero-padding to increase it. Yet, our experiments didn’t encounter this scenario.

1.2 A.2 NVMensure

NVMensure is implemented in C++ within the DeepNVMe [45] library. The runtime of integrity mechanisms varies depending on the implementation. MD5 and BLAKE2bp employ C++ reference implementations, while SHA-256 utilizes a highly efficient kernel implementation accessed through the kernel’s crypto API. We implemented CRC-32 [52] and LRC [18] by ourselves in C++.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krauß, T., Götz, R., Dmitrienko, A. (2024). Security of NVMe Offloaded Data in Large-Scale Machine Learning. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-51482-1_8
Published: 11 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51481-4
Online ISBN: 978-3-031-51482-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Security of NVMe Offloaded Data in Large-Scale Machine Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Adversarial Vulnerability of Active Transfer Learning

Rethinking security: the resilience of shallow ML models

Data Poisoning Attacks Against Federated Learning Systems

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Instantiation for DeepSpeed

A Instantiation for DeepSpeed

1.1 A.1 NVMevade

1.2 A.2 NVMensure

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us