Data Poisoning Attack Against Reinforcement Learning from Human Feedback in Robot Control Tasks

Zhou, Zihui; Gao, Yutong; Qi, Minfeng

doi:10.1007/978-981-96-1525-4_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15251))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

114 Accesses

Abstract

Reinforcement Learning from Human Feedback (RLHF) technology provides a method for agents to learn human preferences and perform actions that satisfy human desires. This technology was originally used to complete robot control tasks in situations where it was difficult to design a reward function. However, in the process of collecting human feedback, malicious human annotators may launch attacks on RLHF, posing a significant challenge to the technology. Previous research has mostly focused on studying the harm caused by different attacks on RLHF in fine-tuning large language models (LLMs). However, our study is focused on the field of robot control. We designed two data poisoning attacks against human feedback datasets and implemented our attacks in three different offline reinforcement learning (RL) environments. The experimental results show that RLHF is vulnerable to data poisoning attacks in robot control tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
An, G., Lee, J., Zuo, X., Kosaka, N., Kim, K.M., Song, H.O.: Direct preference-based policy optimization without reward modeling. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Google Scholar
Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)
Bai, Y., et al.: Constitutional AI: harmlessness from AI feedback. arXiv preprint arXiv:2212.08073 (2022)
Böhm, F., Gao, Y., Meyer, C.M., Shapira, O., Dagan, I., Gurevych, I.: Better rewards yield better summaries: learning to summarise without references. arXiv preprint arXiv:1909.01214 (2019)
Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: the method of paired comparisons. Biometrika 39(3–4), 324–345 (1952). https://doi.org/10.1093/biomet/39.3-4.324
Article MathSciNet MATH Google Scholar
Brockman, G., et al.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)
Casper, S., et al.: Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 (2023)
Chang, W., Zhu, T.: Gradient-based defense methods for data leakage in vertical federated learning. Comput. Secur. 139, 103744 (2024). https://doi.org/10.1016/j.cose.2024.103744
Article MATH Google Scholar
Chen, A., et al.: Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749 (2023)
Cheng, J., Xiong, G., Dai, X., Miao, Q., Lv, Y., Wang, F.Y.: RIME: robust preference-based reinforcement learning with noisy preferences. arXiv preprint arXiv:2402.17257 (2024)
Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Google Scholar
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Early, J., Bewley, T., Evers, C., Ramchurn, S.: Non-Markovian reward modelling from trajectory labels via interpretable multiple instance learning. Adv. Neural. Inf. Process. Syst. 35, 27652–27663 (2022)
MATH Google Scholar
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)
Gong, C., et al.: Mind your data! hiding backdoors in offline reinforcement learning datasets. arXiv preprint arXiv:2210.04688 (2022)
Gupta, A., Kumar, V., Lynch, C., Levine, S., Hausman, K.: Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956 (2019)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)
Article MATH Google Scholar
Kang, Y., Shi, D., Liu, J., He, L., Wang, D.: Beyond reward: offline preference-guided policy optimization. arXiv preprint arXiv:2305.16217 (2023)
Kim, C., Park, J., Shin, J., Lee, H., Abbeel, P., Lee, K.: Preference transformer: modeling human preferences using transformers for RL. arXiv preprint arXiv:2303.00957 (2023)
Kim, S., et al.: Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735 (2023)
Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169 (2021)
Kreutzer, J., Uyheng, J., Riezler, S.: Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627 (2018)
Lee, H., et al.: RLAIF: scaling reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267 (2023)
Lee, K., Smith, L., Abbeel, P.: PEBBLE: feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021)
Orekondy, T., Schiele, B., Fritz, M.: Knockoff Nets: stealing functionality of black-box models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4954–4963 (2019)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
MATH Google Scholar
Park, J., Seo, Y., Shin, J., Lee, H., Abbeel, P., Lee, K.: SURF: semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050 (2022)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Shi, J., Liu, Y., Zhou, P., Sun, L.: BadGPT: exploring security vulnerabilities of ChatGPT via backdoor attacks to InstructGPT. arXiv preprint arXiv:2304.12298 (2023)
Stiennon, N., et al.: Learning to summarize with human feedback. Adv. Neural. Inf. Process. Syst. 33, 3008–3021 (2020)
MATH Google Scholar
Tian, H., Liu, B., Zhu, T., Zhou, W., Philip, S.Y.: MultiFair: model fairness with multiple sensitive attributes. IEEE Trans. Neural Netw. Learn. Syst. (2024)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, J., Wu, J., Chen, M., Vorobeychik, Y., Xiao, C.: On the exploitability of reinforcement learning with human feedback for large language models. arXiv preprint arXiv:2311.09641 (2023)
Wu, J., et al.: Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862 (2021)
Xu, Y., Zeng, Q., Singh, G.: Efficient reward poisoning attacks on online deep reinforcement learning. arXiv preprint arXiv:2205.14842 (2022)
Xue, W., An, B., Yan, S., Xu, Z.: Reinforcement learning from diverse human preferences. arXiv preprint arXiv:2301.11774 (2023)
Yang, C., Liang, P., Ajoudani, A., Li, Z., Bicchi, A.: Development of a robotic teaching interface for human to human skill transfer. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 710–716. IEEE (2016)
Google Scholar
Ye, D., Zhu, T., Gao, K., Zhou, W.: Defending against label-only attacks via meta-reinforcement learning. IEEE Trans. Inf. Forensics Secur. (2024)
Google Scholar
Zhang, X., Ma, Y., Singla, A., Zhu, X.: Adaptive reward-poisoning attacks against reinforcement learning. In: International Conference on Machine Learning, pp. 11225–11234. PMLR (2020)
Google Scholar
Zhou, S., Zhu, T., Ye, D., Zhou, W., Zhao, W.: Inversion-guided defense: detecting model stealing attacks by output inverting. IEEE Trans. Inf. Forensics Secur. (2024)
Google Scholar

Download references

Author information

Authors and Affiliations

China University of Geosciences(Wuhan), Wuhan, 430074, China
Zihui Zhou
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance, Ministry of Education, Minzu University of China, Beijing, 100081, China
Yutong Gao
Hainan International College of Minzu University of China, Li’an International Education Innovation pilot Zone, Hainan, 572499, China
Yutong Gao
City University of Macau, Macau, 999078, China
Minfeng Qi

Authors

Zihui Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yutong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Minfeng Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zihui Zhou .

Editor information

Editors and Affiliations

City University of Macau, Macau, China
Tianqing Zhu
Guangzhou University, Guangzhou, China
Jin Li
University of Salerno, Fisciano, Italy
Aniello Castiglione

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Z., Gao, Y., Qi, M. (2025). Data Poisoning Attack Against Reinforcement Learning from Human Feedback in Robot Control Tasks. In: Zhu, T., Li, J., Castiglione, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2024. Lecture Notes in Computer Science, vol 15251. Springer, Singapore. https://doi.org/10.1007/978-981-96-1525-4_7

Download citation

DOI: https://doi.org/10.1007/978-981-96-1525-4_7
Published: 17 February 2025
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-1524-7
Online ISBN: 978-981-96-1525-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Poisoning Attack Against Reinforcement Learning from Human Feedback in Robot Control Tasks