Abstract
Reinforcement Learning from Human Feedback (RLHF) technology provides a method for agents to learn human preferences and perform actions that satisfy human desires. This technology was originally used to complete robot control tasks in situations where it was difficult to design a reward function. However, in the process of collecting human feedback, malicious human annotators may launch attacks on RLHF, posing a significant challenge to the technology. Previous research has mostly focused on studying the harm caused by different attacks on RLHF in fine-tuning large language models (LLMs). However, our study is focused on the field of robot control. We designed two data poisoning attacks against human feedback datasets and implemented our attacks in three different offline reinforcement learning (RL) environments. The experimental results show that RLHF is vulnerable to data poisoning attacks in robot control tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
An, G., Lee, J., Zuo, X., Kosaka, N., Kim, K.M., Song, H.O.: Direct preference-based policy optimization without reward modeling. In: Thirty-Seventh Conference on Neural Information Processing Systems (2023)
Bai, Y., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)
Bai, Y., et al.: Constitutional AI: harmlessness from AI feedback. arXiv preprint arXiv:2212.08073 (2022)
Böhm, F., Gao, Y., Meyer, C.M., Shapira, O., Dagan, I., Gurevych, I.: Better rewards yield better summaries: learning to summarise without references. arXiv preprint arXiv:1909.01214 (2019)
Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: the method of paired comparisons. Biometrika 39(3–4), 324–345 (1952). https://doi.org/10.1093/biomet/39.3-4.324
Brockman, G., et al.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016)
Casper, S., et al.: Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217 (2023)
Chang, W., Zhu, T.: Gradient-based defense methods for data leakage in vertical federated learning. Comput. Secur. 139, 103744 (2024). https://doi.org/10.1016/j.cose.2024.103744
Chen, A., et al.: Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749 (2023)
Cheng, J., Xiong, G., Dai, X., Miao, Q., Lv, Y., Wang, F.Y.: RIME: robust preference-based reinforcement learning with noisy preferences. arXiv preprint arXiv:2402.17257 (2024)
Chowdhery, A., et al.: PaLM: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. Adv. Neural Inf. Process. Syst. 30 (2017)
Early, J., Bewley, T., Evers, C., Ramchurn, S.: Non-Markovian reward modelling from trajectory labels via interpretable multiple instance learning. Adv. Neural. Inf. Process. Syst. 35, 27652–27663 (2022)
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S.: D4RL: datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219 (2020)
Gong, C., et al.: Mind your data! hiding backdoors in offline reinforcement learning datasets. arXiv preprint arXiv:2210.04688 (2022)
Gupta, A., Kumar, V., Lynch, C., Levine, S., Hausman, K.: Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956 (2019)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)
Kang, Y., Shi, D., Liu, J., He, L., Wang, D.: Beyond reward: offline preference-guided policy optimization. arXiv preprint arXiv:2305.16217 (2023)
Kim, C., Park, J., Shin, J., Lee, H., Abbeel, P., Lee, K.: Preference transformer: modeling human preferences using transformers for RL. arXiv preprint arXiv:2303.00957 (2023)
Kim, S., et al.: Aligning large language models through synthetic feedback. arXiv preprint arXiv:2305.13735 (2023)
Kostrikov, I., Nair, A., Levine, S.: Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169 (2021)
Kreutzer, J., Uyheng, J., Riezler, S.: Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627 (2018)
Lee, H., et al.: RLAIF: scaling reinforcement learning from human feedback with AI feedback. arXiv preprint arXiv:2309.00267 (2023)
Lee, K., Smith, L., Abbeel, P.: PEBBLE: feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091 (2021)
Orekondy, T., Schiele, B., Fritz, M.: Knockoff Nets: stealing functionality of black-box models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4954–4963 (2019)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Park, J., Seo, Y., Shin, J., Lee, H., Abbeel, P., Lee, K.: SURF: semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. arXiv preprint arXiv:2203.10050 (2022)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 36 (2024)
Shi, J., Liu, Y., Zhou, P., Sun, L.: BadGPT: exploring security vulnerabilities of ChatGPT via backdoor attacks to InstructGPT. arXiv preprint arXiv:2304.12298 (2023)
Stiennon, N., et al.: Learning to summarize with human feedback. Adv. Neural. Inf. Process. Syst. 33, 3008–3021 (2020)
Tian, H., Liu, B., Zhu, T., Zhou, W., Philip, S.Y.: MultiFair: model fairness with multiple sensitive attributes. IEEE Trans. Neural Netw. Learn. Syst. (2024)
Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wang, J., Wu, J., Chen, M., Vorobeychik, Y., Xiao, C.: On the exploitability of reinforcement learning with human feedback for large language models. arXiv preprint arXiv:2311.09641 (2023)
Wu, J., et al.: Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862 (2021)
Xu, Y., Zeng, Q., Singh, G.: Efficient reward poisoning attacks on online deep reinforcement learning. arXiv preprint arXiv:2205.14842 (2022)
Xue, W., An, B., Yan, S., Xu, Z.: Reinforcement learning from diverse human preferences. arXiv preprint arXiv:2301.11774 (2023)
Yang, C., Liang, P., Ajoudani, A., Li, Z., Bicchi, A.: Development of a robotic teaching interface for human to human skill transfer. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 710–716. IEEE (2016)
Ye, D., Zhu, T., Gao, K., Zhou, W.: Defending against label-only attacks via meta-reinforcement learning. IEEE Trans. Inf. Forensics Secur. (2024)
Zhang, X., Ma, Y., Singla, A., Zhu, X.: Adaptive reward-poisoning attacks against reinforcement learning. In: International Conference on Machine Learning, pp. 11225–11234. PMLR (2020)
Zhou, S., Zhu, T., Ye, D., Zhou, W., Zhao, W.: Inversion-guided defense: detecting model stealing attacks by output inverting. IEEE Trans. Inf. Forensics Secur. (2024)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhou, Z., Gao, Y., Qi, M. (2025). Data Poisoning Attack Against Reinforcement Learning from Human Feedback in Robot Control Tasks. In: Zhu, T., Li, J., Castiglione, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2024. Lecture Notes in Computer Science, vol 15251. Springer, Singapore. https://doi.org/10.1007/978-981-96-1525-4_7
Download citation
DOI: https://doi.org/10.1007/978-981-96-1525-4_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-1524-7
Online ISBN: 978-981-96-1525-4
eBook Packages: Computer ScienceComputer Science (R0)