IntentObfuscator: A Jailbreaking Method via Confusing LLM with Prompts

Shang, Shang; Yao, Zhongjiang; Yao, Yepeng; Su, Liya; Fan, Zijing; Zhang, Xiaodan; Jiang, Zhengwei

doi:10.1007/978-3-031-70903-6_8

Shang Shang^11,12,
Zhongjiang Yao¹¹,
Yepeng Yao^11,12,
Liya Su¹³,
Zijing Fan¹¹,
Xiaodan Zhang¹¹ &
…
Zhengwei Jiang^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14985))

Included in the following conference series:

European Symposium on Research in Computer Security

1009 Accesses
2 Citations

Abstract

In the era of Large Language Models (LLMs), developers establish content review conditions to comply with legal, policy, and societal requirements, aiming to prevent the generation of sensitive or restricted content due to considerations like social security, privacy, and criminal justice. However, persistent attempts by attackers and security researchers to bypass content security measures have led to the emergence of various jailbreak technologies, including role-playing, adversarial suffixes, encryption, and more.

This paper presents a novel LLM black-box jailbreak framework called IntentObfuscator, designed to obscure the true intention of user prompts and thereby elicit restricted content during content generation. Two examples, namely Obscure Intention and Create Ambiguity, are presented within this framework, outlining the implementation method. Experimental results highlight the effectiveness of the proposed method, which significantly improves the attack strategy against LLM content security mechanisms, referred to as the “Red Team” attack.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FirewaLLM: A Portable Data Protection and Recovery Framework for LLM Services

Eliciting Offensive Responses from Large Language Models: A Genetic Algorithm Approach

Phishing and Social Engineering in the Age of LLMs

Notes

1.
https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv.

References

Abcarter: What is the size of the training set for gpt-3 (2023). https://community.openai.com/t/what-is-the-size-of-the-training-set-for-gpt-3/360896/1
AlexalBERT: Hypothetical response. https://www.jailbreakchat.com/prompt/b1fe938b-4541-41c8-96e7-b1c659ec4ef9
Bai, J., et al.: Qwen Technical report. arXiv preprint arXiv:2309.16609 (2023)
Beckerich, M., Plein, L., Coronado, S.: RatGPT: turning online LLMs into proxies for malware attacks. arXiv preprint arXiv:2308.09183 (2023)
Cao, Q., Kojima, T., Matsuo, Y., Iwasawa, Y.: Unnatural error correction: GPT-4 can almost perfectly handle unnatural scrambled text (2023)
Google Scholar
Chen, B., Wang, G., Guo, H., Wang, Y., Yan, Q.: Understanding multi-turn toxic behaviors in open-domain chatbots. In: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, pp. 282–296 (2023)
Google Scholar
Chen, Y., Arunasalam, A., Celik, Z.B.: Can large language models provide security & privacy advice? Measuring the ability of LLMs to refute misconceptions. In: Proceedings of the 39th Annual Computer Security Applications Conference, pp. 366–378 (2023)
Google Scholar
Chin, Z.Y., Jiang, C.M., Huang, C.C., Chen, P.Y., Chiu, W.C.: Prompting4debugging: red-teaming text-to-image diffusion models by finding problematic prompts. arXiv preprint arXiv:2309.06135 (2023)
Chu, J., Liu, Y., Yang, Z., Shen, X., Backes, M., Zhang, Y.: Comprehensive assessment of jailbreak attacks against LLMs. arXiv preprint arXiv:2402.05668 (2024)
Deng, B., Wang, W., Feng, F., Deng, Y., Wang, Q., He, X.: Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505 (2023)
Deng, G., et al.: Masterkey: automated jailbreaking of large language model chatbots. In: Proceedings of ISOC NDSS (2024)
Google Scholar
Ghafouri, V., Agarwal, V., Zhang, Y., Sastry, N., Such, J., Suarez-Tangil, G.: AI in the gray: exploring moderation policies in dialogic large language models vs. human answers in controversial topics. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 556–565 (2023)
Google Scholar
Google, Inc: Google’s secure AI framework (SAIF) (2023). https://safety.google/cybersecurity-advancements/saif/t
Gupta, M., Akiri, C., Aryal, K., Parker, E., Praharaj, L.: From ChatGPT to threatGPT: impact of generative AI in cybersecurity and privacy. IEEE Access (2023)
Google Scholar
Hazell, J.: Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972 (2023)
Jiang, S., Chen, X., Tang, R.: Prompt packer: deceiving LLMs through compositional instruction with hidden attacks (2023)
Google Scholar
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., Hashimoto, T.: Exploiting programmatic behavior of LLMs: dual-use through standard security attacks. arXiv preprint arXiv:2302.05733 (2023)
Li, H., Guo, D., Fan, W., Xu, M., Song, Y.: Multi-step jailbreaking privacy attacks on ChatGPT. arXiv preprint arXiv:2304.05197 (2023)
Liu, Y., et al.: Jailbreaking ChatGPT via prompt engineering: an empirical study (2023)
Google Scholar
OpenAI, Inc: Content policy (2022). https://labs.openai.com/policies/content-policy
OpenAI, Inc: GPT-3.5 Turbo. https://platform.openai.com/docs/models/gpt-3-5
OpenAI, Inc: GPT-4 and GPT-4 Turbo. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo
Qi, X., Huang, K., Panda, A., Henderson, P., Wang, M., Mittal, P.: Visual adversarial examples jailbreak aligned large language models (2023)
Google Scholar
Reynolds, L., McDonell, K.: Prompt programming for large language models: beyond the few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7 (2021)
Google Scholar
Shanahan, M., McDonell, K., Reynolds, L.: Role play with large language models. Nature, 1–6 (2023)
Google Scholar
Shen, X., Chen, Z., Backes, M., Shen, Y., Zhang, Y.: “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825 (2023)
Si, W.M., et al.: Why so toxic? Measuring and triggering toxic behavior in open-domain chatbots. In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 2659–2673 (2022)
Google Scholar
Wolf, Y., Wies, N., Levine, Y., Shashua, A.: Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 (2023)
Yang, A., et al.: Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023)
Yao, D., Zhang, J., Harris, I.G., Carlsson, M.: FuzzLLM: a novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models. arXiv preprint arXiv:2309.05274 (2023)
Yu, J., Lin, X., Xing, X.: GPTfuzzer: red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253 (2023)
Zamfirescu-Pereira, J., Wong, R.Y., Hartmann, B., Yang, Q.: Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–21 (2023)
Google Scholar
Zhang, M., Pan, X., Yang, M.: Jade: a linguistics-based safety evaluation platform for LLM (2023)
Google Scholar
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)

Download references

Acknowledgement

This research is supported by the National Natural Science Foundation of China under Grant No. 62202466 and Youth Innovation Promotion Association CAS under Grant No. 2022159. This research was also supported by Key Laboratory of Network Assessment Technology, Chinese Academy of Sciences, and Beijing Key Laboratory of Network Security and Protection Technology.

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Shang Shang, Zhongjiang Yao, Yepeng Yao, Zijing Fan, Xiaodan Zhang & Zhengwei Jiang
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Shang Shang, Yepeng Yao & Zhengwei Jiang
Security Lab, JD Cloud, Beijing, China
Liya Su

Authors

Shang Shang
View author publications
You can also search for this author in PubMed Google Scholar
Zhongjiang Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yepeng Yao
View author publications
You can also search for this author in PubMed Google Scholar
Liya Su
View author publications
You can also search for this author in PubMed Google Scholar
Zijing Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaodan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhengwei Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhongjiang Yao or Yepeng Yao .

Editor information

Editors and Affiliations

Institut Polytechnique de Paris, Palaiseau, France
Joaquin Garcia-Alfaro
Bydgoszcz University of Science and Technology, Bydgoszcz, Poland
Rafał Kozik
Bydgoszcz University of Science and Technology, Bydgoszcz, Poland
Michał Choraś
Norwegian University of Science and Technology - NTNU, Gjøvik, Norway
Sokratis Katsikas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shang, S. et al. (2024). IntentObfuscator: A Jailbreaking Method via Confusing LLM with Prompts. In: Garcia-Alfaro, J., Kozik, R., Choraś, M., Katsikas, S. (eds) Computer Security – ESORICS 2024. ESORICS 2024. Lecture Notes in Computer Science, vol 14985. Springer, Cham. https://doi.org/10.1007/978-3-031-70903-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-70903-6_8
Published: 05 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70902-9
Online ISBN: 978-3-031-70903-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

IntentObfuscator: A Jailbreaking Method via Confusing LLM with Prompts