Skip to main content

Data Poisoning Attack Against Reinforcement Learning from Human Feedback in Robot Control Tasks

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2024)

Abstract

Reinforcement Learning from Human Feedback (RLHF) technology provides a method for agents to learn human preferences and perform actions that satisfy human desires. This technology was originally used to complete robot control tasks in situations where it was difficult to design a reward function. However, in the process of collecting human feedback, malicious human annotators may launch attacks on RLHF, posing a significant challenge to the technology. Previous research has mostly focused on studying the harm caused by different attacks on RLHF in fine-tuning large language models (LLMs). However, our study is focused on the field of robot control. We designed two data poisoning attacks against human feedback datasets and implemented our attacks in three different offline reinforcement learning (RL) environments. The experimental results show that RLHF is vulnerable to data poisoning attacks in robot control tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. An, G., Lee, J., Zuo, X., Kosaka, N., Kim, K.M., Song, H.O.: Direct preference-based policy optimization without reward modeling. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)

    Google Scholar 

  3. Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

  4. Bai, Y., et al.: Constitutional AI: harmlessness from AI feedback. arXiv preprint arXiv:2212.08073 (2022)

  5. Böhm, F., Gao, Y., Meyer, C.M., Shapira, O., Dagan, I., Gurevych, I.: Better rewards yield better summaries: learning to summarise without references. arXiv preprint arXiv:1909.01214 (2019)

  6. Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: the method of paired comparisons. Biometrika 39(3–4), 324–345 (1952). https://doi.org/10.1093/biomet/39.3-4.324

    Article  MathSciNet  MATH  Google Scholar 

  7. Brockman, G., et al.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)

  8. Casper, S., et al.: Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 (2023)

  9. Chang, W., Zhu, T.: Gradient-based defense methods for data leakage in vertical federated learning. Comput. Secur. 139, 103744 (2024). https://doi.org/10.1016/j.cose.2024.103744

    Article  MATH  Google Scholar 

  10. Chen, A., et al.: Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749 (2023)

  11. Cheng, J., Xiong, G., Dai, X., Miao, Q., Lv, Y., Wang, F.Y.: RIME: robust preference-based reinforcement learning with noisy preferences. arXiv preprint arXiv:2402.17257 (2024)

  12. Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)

    Google Scholar 

  13. Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  14. Early, J., Bewley, T., Evers, C., Ramchurn, S.: Non-Markovian reward modelling from trajectory labels via interpretable multiple instance learning. Adv. Neural. Inf. Process. Syst. 35, 27652–27663 (2022)

    MATH  Google Scholar 

  15. Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)

  16. Gong, C., et al.: Mind your data! hiding backdoors in offline reinforcement learning datasets. arXiv preprint arXiv:2210.04688 (2022)

  17. Gupta, A., Kumar, V., Lynch, C., Levine, S., Hausman, K.: Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956 (2019)

  18. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)

    Article  MATH  Google Scholar 

  19. Kang, Y., Shi, D., Liu, J., He, L., Wang, D.: Beyond reward: offline preference-guided policy optimization. arXiv preprint arXiv:2305.16217 (2023)

  20. Kim, C., Park, J., Shin, J., Lee, H., Abbeel, P., Lee, K.: Preference transformer: modeling human preferences using transformers for RL. arXiv preprint arXiv:2303.00957 (2023)

  21. Kim, S., et al.: Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735 (2023)

  22. Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169 (2021)

  23. Kreutzer, J., Uyheng, J., Riezler, S.: Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627 (2018)

  24. Lee, H., et al.: RLAIF: scaling reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267 (2023)

  25. Lee, K., Smith, L., Abbeel, P.: PEBBLE: feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021)

  26. Orekondy, T., Schiele, B., Fritz, M.: Knockoff Nets: stealing functionality of black-box models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4954–4963 (2019)

    Google Scholar 

  27. Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)

    MATH  Google Scholar 

  28. Park, J., Seo, Y., Shin, J., Lee, H., Abbeel, P., Lee, K.: SURF: semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050 (2022)

  29. Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)

    Google Scholar 

  30. Shi, J., Liu, Y., Zhou, P., Sun, L.: BadGPT: exploring security vulnerabilities of ChatGPT via backdoor attacks to InstructGPT. arXiv preprint arXiv:2304.12298 (2023)

  31. Stiennon, N., et al.: Learning to summarize with human feedback. Adv. Neural. Inf. Process. Syst. 33, 3008–3021 (2020)

    MATH  Google Scholar 

  32. Tian, H., Liu, B., Zhu, T., Zhou, W., Philip, S.Y.: MultiFair: model fairness with multiple sensitive attributes. IEEE Trans. Neural Netw. Learn. Syst. (2024)

    Google Scholar 

  33. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)

    Google Scholar 

  34. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  35. Wang, J., Wu, J., Chen, M., Vorobeychik, Y., Xiao, C.: On the exploitability of reinforcement learning with human feedback for large language models. arXiv preprint arXiv:2311.09641 (2023)

  36. Wu, J., et al.: Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862 (2021)

  37. Xu, Y., Zeng, Q., Singh, G.: Efficient reward poisoning attacks on online deep reinforcement learning. arXiv preprint arXiv:2205.14842 (2022)

  38. Xue, W., An, B., Yan, S., Xu, Z.: Reinforcement learning from diverse human preferences. arXiv preprint arXiv:2301.11774 (2023)

  39. Yang, C., Liang, P., Ajoudani, A., Li, Z., Bicchi, A.: Development of a robotic teaching interface for human to human skill transfer. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 710–716. IEEE (2016)

    Google Scholar 

  40. Ye, D., Zhu, T., Gao, K., Zhou, W.: Defending against label-only attacks via meta-reinforcement learning. IEEE Trans. Inf. Forensics Secur. (2024)

    Google Scholar 

  41. Zhang, X., Ma, Y., Singla, A., Zhu, X.: Adaptive reward-poisoning attacks against reinforcement learning. In: International Conference on Machine Learning, pp. 11225–11234. PMLR (2020)

    Google Scholar 

  42. Zhou, S., Zhu, T., Ye, D., Zhou, W., Zhao, W.: Inversion-guided defense: detecting model stealing attacks by output inverting. IEEE Trans. Inf. Forensics Secur. (2024)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zihui Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, Z., Gao, Y., Qi, M. (2025). Data Poisoning Attack Against Reinforcement Learning from Human Feedback in Robot Control Tasks. In: Zhu, T., Li, J., Castiglione, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2024. Lecture Notes in Computer Science, vol 15251. Springer, Singapore. https://doi.org/10.1007/978-981-96-1525-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-1525-4_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-1524-7

  • Online ISBN: 978-981-96-1525-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics