Abstract
In the era of Large Language Models (LLMs), developers establish content review conditions to comply with legal, policy, and societal requirements, aiming to prevent the generation of sensitive or restricted content due to considerations like social security, privacy, and criminal justice. However, persistent attempts by attackers and security researchers to bypass content security measures have led to the emergence of various jailbreak technologies, including role-playing, adversarial suffixes, encryption, and more.
This paper presents a novel LLM black-box jailbreak framework called IntentObfuscator, designed to obscure the true intention of user prompts and thereby elicit restricted content during content generation. Two examples, namely Obscure Intention and Create Ambiguity, are presented within this framework, outlining the implementation method. Experimental results highlight the effectiveness of the proposed method, which significantly improves the attack strategy against LLM content security mechanisms, referred to as the “Red Team” attack.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abcarter: What is the size of the training set for gpt-3 (2023). https://community.openai.com/t/what-is-the-size-of-the-training-set-for-gpt-3/360896/1
AlexalBERT: Hypothetical response. https://www.jailbreakchat.com/prompt/b1fe938b-4541-41c8-96e7-b1c659ec4ef9
Bai, J., et al.: Qwen Technical report. arXiv preprint arXiv:2309.16609 (2023)
Beckerich, M., Plein, L., Coronado, S.: RatGPT: turning online LLMs into proxies for malware attacks. arXiv preprint arXiv:2308.09183 (2023)
Cao, Q., Kojima, T., Matsuo, Y., Iwasawa, Y.: Unnatural error correction: GPT-4 can almost perfectly handle unnatural scrambled text (2023)
Chen, B., Wang, G., Guo, H., Wang, Y., Yan, Q.: Understanding multi-turn toxic behaviors in open-domain chatbots. In: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 282–296 (2023)
Chen, Y., Arunasalam, A., Celik, Z.B.: Can large language models provide security & privacy advice? Measuring the ability of LLMs to refute misconceptions. In: Proceedings of the 39th Annual Computer Security Applications Conference, pp. 366–378 (2023)
Chin, Z.Y., Jiang, C.M., Huang, C.C., Chen, P.Y., Chiu, W.C.: Prompting4debugging: red-teaming text-to-image diffusion models by finding problematic prompts. arXiv preprint arXiv:2309.06135 (2023)
Chu, J., Liu, Y., Yang, Z., Shen, X., Backes, M., Zhang, Y.: Comprehensive assessment of jailbreak attacks against LLMs. arXiv preprint arXiv:2402.05668 (2024)
Deng, B., Wang, W., Feng, F., Deng, Y., Wang, Q., He, X.: Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505 (2023)
Deng, G., et al.: Masterkey: automated jailbreaking of large language model chatbots. In: Proceedings of ISOC NDSS (2024)
Ghafouri, V., Agarwal, V., Zhang, Y., Sastry, N., Such, J., Suarez-Tangil, G.: AI in the gray: exploring moderation policies in dialogic large language models vs. human answers in controversial topics. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 556–565 (2023)
Google, Inc: Google’s secure AI framework (SAIF) (2023). https://safety.google/cybersecurity-advancements/saif/t
Gupta, M., Akiri, C., Aryal, K., Parker, E., Praharaj, L.: From ChatGPT to threatGPT: impact of generative AI in cybersecurity and privacy. IEEE Access (2023)
Hazell, J.: Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972 (2023)
Jiang, S., Chen, X., Tang, R.: Prompt packer: deceiving LLMs through compositional instruction with hidden attacks (2023)
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., Hashimoto, T.: Exploiting programmatic behavior of LLMs: dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023)
Li, H., Guo, D., Fan, W., Xu, M., Song, Y.: Multi-step jailbreaking privacy attacks on ChatGPT. arXiv preprint arXiv:2304.05197 (2023)
Liu, Y., et al.: Jailbreaking ChatGPT via prompt engineering: an empirical study (2023)
OpenAI, Inc: Content policy (2022). https://labs.openai.com/policies/content-policy
OpenAI, Inc: GPT-3.5 Turbo. https://platform.openai.com/docs/models/gpt-3-5
OpenAI, Inc: GPT-4 and GPT-4 Turbo. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo
Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual adversarial examples jailbreak aligned large language models (2023)
Reynolds, L., McDonell, K.: Prompt programming for large language models: beyond the few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7 (2021)
Shanahan, M., McDonell, K., Reynolds, L.: Role play with large language models. Nature, 1–6 (2023)
Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825 (2023)
Si, W.M., et al.: Why so toxic? Measuring and triggering toxic behavior in open-domain chatbots. In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 2659–2673 (2022)
Wolf, Y., Wies, N., Levine, Y., Shashua, A.: Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 (2023)
Yang, A., et al.: Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023)
Yao, D., Zhang, J., Harris, I.G., Carlsson, M.: FuzzLLM: a novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. arXiv preprint arXiv:2309.05274 (2023)
Yu, J., Lin, X., Xing, X.: GPTfuzzer: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023)
Zamfirescu-Pereira, J., Wong, R.Y., Hartmann, B., Yang, Q.: Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–21 (2023)
Zhang, M., Pan, X., Yang, M.: Jade: a linguistics-based safety evaluation platform for LLM (2023)
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)
Acknowledgement
This research is supported by the National Natural Science Foundation of China under Grant No. 62202466 and Youth Innovation Promotion Association CAS under Grant No. 2022159. This research was also supported by Key Laboratory of Network Assessment Technology, Chinese Academy of Sciences, and Beijing Key Laboratory of Network Security and Protection Technology.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shang, S. et al. (2024). IntentObfuscator: A Jailbreaking Method via Confusing LLM with Prompts. In: Garcia-Alfaro, J., Kozik, R., Choraś, M., Katsikas, S. (eds) Computer Security – ESORICS 2024. ESORICS 2024. Lecture Notes in Computer Science, vol 14985. Springer, Cham. https://doi.org/10.1007/978-3-031-70903-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-70903-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70902-9
Online ISBN: 978-3-031-70903-6
eBook Packages: Computer ScienceComputer Science (R0)