Skip to main content

IntentObfuscator: A Jailbreaking Method via Confusing LLM with Prompts

  • Conference paper
  • First Online:
Computer Security – ESORICS 2024 (ESORICS 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14985))

Included in the following conference series:

Abstract

In the era of Large Language Models (LLMs), developers establish content review conditions to comply with legal, policy, and societal requirements, aiming to prevent the generation of sensitive or restricted content due to considerations like social security, privacy, and criminal justice. However, persistent attempts by attackers and security researchers to bypass content security measures have led to the emergence of various jailbreak technologies, including role-playing, adversarial suffixes, encryption, and more.

This paper presents a novel LLM black-box jailbreak framework called IntentObfuscator, designed to obscure the true intention of user prompts and thereby elicit restricted content during content generation. Two examples, namely Obscure Intention and Create Ambiguity, are presented within this framework, outlining the implementation method. Experimental results highlight the effectiveness of the proposed method, which significantly improves the attack strategy against LLM content security mechanisms, referred to as the “Red Team” attack.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv.

References

  1. Abcarter: What is the size of the training set for gpt-3 (2023). https://community.openai.com/t/what-is-the-size-of-the-training-set-for-gpt-3/360896/1

  2. AlexalBERT: Hypothetical response. https://www.jailbreakchat.com/prompt/b1fe938b-4541-41c8-96e7-b1c659ec4ef9

  3. Bai, J., et al.: Qwen Technical report. arXiv preprint arXiv:2309.16609 (2023)

  4. Beckerich, M., Plein, L., Coronado, S.: RatGPT: turning online LLMs into proxies for malware attacks. arXiv preprint arXiv:2308.09183 (2023)

  5. Cao, Q., Kojima, T., Matsuo, Y., Iwasawa, Y.: Unnatural error correction: GPT-4 can almost perfectly handle unnatural scrambled text (2023)

    Google Scholar 

  6. Chen, B., Wang, G., Guo, H., Wang, Y., Yan, Q.: Understanding multi-turn toxic behaviors in open-domain chatbots. In: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 282–296 (2023)

    Google Scholar 

  7. Chen, Y., Arunasalam, A., Celik, Z.B.: Can large language models provide security & privacy advice? Measuring the ability of LLMs to refute misconceptions. In: Proceedings of the 39th Annual Computer Security Applications Conference, pp. 366–378 (2023)

    Google Scholar 

  8. Chin, Z.Y., Jiang, C.M., Huang, C.C., Chen, P.Y., Chiu, W.C.: Prompting4debugging: red-teaming text-to-image diffusion models by finding problematic prompts. arXiv preprint arXiv:2309.06135 (2023)

  9. Chu, J., Liu, Y., Yang, Z., Shen, X., Backes, M., Zhang, Y.: Comprehensive assessment of jailbreak attacks against LLMs. arXiv preprint arXiv:2402.05668 (2024)

  10. Deng, B., Wang, W., Feng, F., Deng, Y., Wang, Q., He, X.: Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505 (2023)

  11. Deng, G., et al.: Masterkey: automated jailbreaking of large language model chatbots. In: Proceedings of ISOC NDSS (2024)

    Google Scholar 

  12. Ghafouri, V., Agarwal, V., Zhang, Y., Sastry, N., Such, J., Suarez-Tangil, G.: AI in the gray: exploring moderation policies in dialogic large language models vs. human answers in controversial topics. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 556–565 (2023)

    Google Scholar 

  13. Google, Inc: Google’s secure AI framework (SAIF) (2023). https://safety.google/cybersecurity-advancements/saif/t

  14. Gupta, M., Akiri, C., Aryal, K., Parker, E., Praharaj, L.: From ChatGPT to threatGPT: impact of generative AI in cybersecurity and privacy. IEEE Access (2023)

    Google Scholar 

  15. Hazell, J.: Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972 (2023)

  16. Jiang, S., Chen, X., Tang, R.: Prompt packer: deceiving LLMs through compositional instruction with hidden attacks (2023)

    Google Scholar 

  17. Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., Hashimoto, T.: Exploiting programmatic behavior of LLMs: dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023)

  18. Li, H., Guo, D., Fan, W., Xu, M., Song, Y.: Multi-step jailbreaking privacy attacks on ChatGPT. arXiv preprint arXiv:2304.05197 (2023)

  19. Liu, Y., et al.: Jailbreaking ChatGPT via prompt engineering: an empirical study (2023)

    Google Scholar 

  20. OpenAI, Inc: Content policy (2022). https://labs.openai.com/policies/content-policy

  21. OpenAI, Inc: GPT-3.5 Turbo. https://platform.openai.com/docs/models/gpt-3-5

  22. OpenAI, Inc: GPT-4 and GPT-4 Turbo. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo

  23. Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual adversarial examples jailbreak aligned large language models (2023)

    Google Scholar 

  24. Reynolds, L., McDonell, K.: Prompt programming for large language models: beyond the few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7 (2021)

    Google Scholar 

  25. Shanahan, M., McDonell, K., Reynolds, L.: Role play with large language models. Nature, 1–6 (2023)

    Google Scholar 

  26. Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825 (2023)

  27. Si, W.M., et al.: Why so toxic? Measuring and triggering toxic behavior in open-domain chatbots. In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 2659–2673 (2022)

    Google Scholar 

  28. Wolf, Y., Wies, N., Levine, Y., Shashua, A.: Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 (2023)

  29. Yang, A., et al.: Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023)

  30. Yao, D., Zhang, J., Harris, I.G., Carlsson, M.: FuzzLLM: a novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. arXiv preprint arXiv:2309.05274 (2023)

  31. Yu, J., Lin, X., Xing, X.: GPTfuzzer: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023)

  32. Zamfirescu-Pereira, J., Wong, R.Y., Hartmann, B., Yang, Q.: Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–21 (2023)

    Google Scholar 

  33. Zhang, M., Pan, X., Yang, M.: Jade: a linguistics-based safety evaluation platform for LLM (2023)

    Google Scholar 

  34. Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

Download references

Acknowledgement

This research is supported by the National Natural Science Foundation of China under Grant No. 62202466 and Youth Innovation Promotion Association CAS under Grant No. 2022159. This research was also supported by Key Laboratory of Network Assessment Technology, Chinese Academy of Sciences, and Beijing Key Laboratory of Network Security and Protection Technology.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhongjiang Yao or Yepeng Yao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shang, S. et al. (2024). IntentObfuscator: A Jailbreaking Method via Confusing LLM with Prompts. In: Garcia-Alfaro, J., Kozik, R., Choraś, M., Katsikas, S. (eds) Computer Security – ESORICS 2024. ESORICS 2024. Lecture Notes in Computer Science, vol 14985. Springer, Cham. https://doi.org/10.1007/978-3-031-70903-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70903-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70902-9

  • Online ISBN: 978-3-031-70903-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics