This is the official repository for the ACM CCS 2024 paper "Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models by Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang.
这是 ACM CCS 2024 论文“Do Anything Now”:表征和评估大型语言模型的野外越狱提示,作者为 Xinyue Shen、Zeyuan Chen、Michael Backes、Yun Shen 和 Yang Zhang。
In this project, employing our new framework JailbreakHub, we conduct the first measurement study on jailbreak prompts in the wild, with 15,140 prompts collected from December 2022 to December 2023 (including 1,405 jailbreak prompts).
在这个项目中,我们利用我们的新框架JailbreakHub,对野外越狱提示进行了首次测量研究,从2022年12月到2023年12月收集了15,140条提示(其中1,405条越狱提示)。
Check out our website here.
请在此处查看我们的网站。
Disclaimer. This repo contains examples of harmful language. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.
免责声明。该存储库包含有害语言的示例。建议读者自行斟酌。该存储库仅用于研究目的。严禁任何滥用行为。
Overall, we collect 15,140 prompts from four platforms (Reddit, Discord, websites, and open-source datasets) during Dec 2022 to Dec 2023. Among these prompts, we identify 1,405 jailbreak prompts. To the best of our knowledge, this dataset serves as the largest collection of in-the-wild jailbreak prompts.
总体而言,我们在 2022 年 12 月至 2023 年 12 月期间从四个平台(Reddit、Discord、网站和开源数据集)收集了 15,140 个提示。在这些提示中,我们确定了 1,405 个越狱提示。据我们所知,该数据集是最大的野外越狱提示集合。
The data are provided here.
此处提供了数据。
Statistics of our data source. (Adv) UA refers to (adversarial) user accounts.
我们的数据源统计。 (Adv) UA 是指(对抗性)用户帐户。
Platform 平台 | Source 来源 | # Posts # 帖子 | # UA | # Adv UA # 高级用户代理 | # Prompts # 提示 | # Jailbreaks # 越狱 | Prompt Time Range 提示时间范围 |
---|---|---|---|---|---|---|---|
Reddit 红迪网 | r/ChatGPT | 163549 | 147 | 147 | 176 | 176 | 2023.02-2023.11 |
Reddit 红迪网 | r/ChatGPTPromptGenius r/ChatGPTPPromptGenius | 3536 | 305 | 21 | 654 | 24 | 2022.12-2023.11 |
Reddit 红迪网 | r/ChatGPTJailbreak r/ChatGPT 越狱 | 1602 | 183 | 183 | 225 | 225 | 2023.02-2023.11 |
Discord 不和谐 | ChatGPT 聊天GPT | 609 | 259 | 106 | 544 | 214 | 2023.02-2023.12 |
Discord 不和谐 | ChatGPT Prompt Engineering ChatGPT 提示工程 |
321 | 96 | 37 | 278 | 67 | 2022.12-2023.12 |
Discord 不和谐 | Spreadsheet Warriors 电子表格勇士 | 71 | 3 | 3 | 61 | 61 | 2022.12-2023.09 |
Discord 不和谐 | AI Prompt Sharing AI即时分享 | 25 | 19 | 13 | 24 | 17 | 2023.03-2023.04 |
Discord 不和谐 | LLM Promptwriting LLM即兴写作 | 184 | 64 | 41 | 167 | 78 | 2023.03-2023.12 |
Discord 不和谐 | BreakGPT 破坏GPT | 36 | 10 | 10 | 32 | 32 | 2023.04-2023.09 |
Website 网站 | AIPRM | - | 2777 | 23 | 3930 | 25 | 2023.01-2023.06 |
Website 网站 | FlowGPT 流GPT | - | 3505 | 254 | 8754 | 405 | 2022.12-2023.12 |
Website 网站 | JailbreakChat 越狱聊天 | - | - | - | 79 | 79 | 2023.02-2023.05 |
Dataset 数据集 | AwesomeChatGPTPrompts AwesomeChatGTP提示 | - | - | - | 166 | 2 | - |
Dataset 数据集 | OCR-Prompts OCR 提示 | - | - | - | 50 | 0 | - |
Total 全部的 | 169,933 | 7,308 | 803 | 15,140 | 1,405 | 2022.12-2023.12 |
To evaluate the effectiveness of jailbreak prompts, we construct a question set comprising 390 questions across 13 forbidden scenarios adopted from OpenAI Usage Policy.
为了评估越狱提示的有效性,我们构建了一个问题集,其中包含 390 个问题,涉及 OpenAI 使用政策中采用的 13 个禁止场景。
We exclude Child Sexual Abuse
scenario from our evaluation and focus on the rest 13 scenarios, including Illegal Activity
, Hate Speech
, Malware Generation
, Physical Harm
, Economic Harm
, Fraud
, Pornography
, Political Lobbying
, Privacy Violence
, Legal Opinion
, Financial Advice
, Health Consultation
, and Government Decision
.
我们从评估中排除了 Child Sexual Abuse
场景,并重点关注其余 13 个场景,包括 Illegal Activity
、 Hate Speech
、 Malware Generation
、 Physical Harm
、 Fraud
、 Pornography
、 Political Lobbying
、 Privacy Violence
、 Legal Opinion
、 Financial Advice
、 Health Consultation
和 Government Decision
。
The question set is provided here.
此处提供了问题集。
cd code/ChatGLMEval
<!-- add data paths in df_path_list of run_evaluator.py -->
python run_evaluator.py
Check code/semantics_visualization/visualize.ipynb
检查 code/semantics_visualization/visualize.ipynb
We acknowledge that data collected online can contain personal information. Thus, we adopt standard best practices to guarantee that our study follows ethical principles, such as not trying to deanonymize any user and reporting results on aggregate. Since this study only involved publicly available data and had no interactions with participants, it is not regarded as human subjects research by our Institutional Review Boards (IRB). Nonetheless, since one of our goals is to measure the risk of LLMs in answering harmful questions, it is inevitable to disclose how a model can generate hateful content. This can bring up worries about potential misuse. However, we strongly believe that raising awareness of the problem is even more crucial, as it can inform LLM vendors and the research community to develop stronger safeguards and contribute to the more responsible release of these models.
我们承认在线收集的数据可能包含个人信息。因此,我们采用标准最佳实践来保证我们的研究遵循道德原则,例如不尝试对任何用户进行去匿名化并报告总体结果。由于这项研究仅涉及公开数据并且没有与参与者互动,因此我们的机构审查委员会(IRB)不将其视为人类受试者研究。尽管如此,由于我们的目标之一是衡量回答有害问题时LLMs的风险,因此不可避免地要披露模型如何生成仇恨内容。这可能会引发对潜在滥用的担忧。然而,我们坚信,提高对该问题的认识更为重要,因为它可以告知LLM供应商和研究社区制定更强有力的保护措施,并为更负责任地发布这些模型做出贡献。
We have responsibly disclosed our findings to related LLM vendors.
我们已负责任地向相关LLM供应商披露了我们的调查结果。
If you find this useful in your research, please consider citing:
如果您发现这对您的研究有用,请考虑引用:
@inproceedings{SCBSZ24,
author = {Xinyue Shen and Zeyuan Chen and Michael Backes and Yun Shen and Yang Zhang},
title = {{``Do Anything Now'': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models}},
booktitle = {{ACM SIGSAC Conference on Computer and Communications Security (CCS)}},
publisher = {ACM},
year = {2024}
}
jailbreak_llms
is licensed under the terms of the MIT license. See LICENSE for more details.
jailbreak_llms
根据 MIT 许可证条款获得许可。有关更多详细信息,请参阅许可证。