- Xiangyu Zhang
- Yanju Chen
Submitted
Submission (20MB) Jan 20, 2025, 8:47:35 PM EST 6b4af82d9becfbf4de640c2e090bd30793b22ac9d5adcbccfa26d4eb937bc09f6b4af82d
- Zonglei Jing (Beihang University) <raykr@buaa.edu.cn>
- Zonghao Ying (Beihang University) <yingzonghao@buaa.edu.cn>
- Le Wang (Beihang University) <lewang@buaa.edu.cn>
- Siyuan Liang (Sun Yat-Sen University) <pandaliang521@gmail.com>
- Aishan Liu (Beihang University) <liuaishan@buaa.edu.cn>
- Xianglong Liu (Beihang University) <xlliu@buaa.edu.cn>
- Dacheng Tao (The University of Sydney) <dacheng.tao@gmail.com>
- Attacks with novel insights, techniques, or results
- ML and AI security and privacy: Security of AI
Artifact evaluation:
Availability + functionalityRecDec | ConRecDec | EthCon.2 | OpeSciCom | ||
---|---|---|---|---|---|
Review #58A | 3 | 3 | 3 | 1 | |
Review #58B | 3 | 2 | 1 | 1 | |
Review #58C | 4 | 3 | 3 | 1 | |
Review #58D | 4 | 2 | 3 | 1 |
You are an
of this submission. Edit submission
Add Main response
Add EthicsSection response
Review #58A
Paper summary
This paper proposes an ai jailbreak attack against safety filters on AI text-to-image-models. Specifically, it starts with a benign prompt and progressively alter it to increase the toxicity of the generated image.
本文提出了一种针对人工智能文本到图像模型上安全过滤器的人工智能越狱攻击。具体来说,它从良性提示开始,并逐渐改变它以增加产生的图像的毒性。
Main reasons to accept the paper
- it is a captivating approach that tries to maximize the harm inflicted to humans by contextual drift
这是一种迷人的方法,试图最大限度地减少情境漂移对人类造成的伤害 - extensive experimentation with ablation studies
消融研究的广泛实验 - discussion of countermeasures
对策探讨
Main reasons to reject the paper
- this attack only focuses on various versions of stable diffusion and one version of dall-e (but not for all experiments), where it performs considerably worse. The authors indicated that the performance drop is due to more stringent safety filters. Because of this, the generalizability of the attack is unproven.
这种攻击只关注稳定扩散的各种版本和 dall-e 的一个版本(但不是针对所有实验),在这些版本中,它的表现要差得多。作者指出,性能下降是由于更严格的安全过滤器。正因为如此,攻击的普遍性尚未得到证实。 - The authors indicate that "the attack performance of cogmorph on commercial models has also surpassed that of other baselines on open source models". However, this appears to be true for a single commercial model.
作者指出,“cogmorph 在商业模型上的攻击性能也超过了开源模型上的其他基线”。然而,对于单个商业模型来说,这似乎是正确的。 - the creation of the risk matrix is based on an appeal to authority, as the details of the parameter select are not mentioned. It is described as coming from "over 50 cognitive psychology studies", "extensive consultations with a panel of 12 experts", and LLMS to "systematically extract patterns in cognitive processing effects".
Given the centrality of this matrix to the approach, more details are needed.
风险矩阵的创建是基于对权威的呼吁,因为没有提到参数精选的细节。它被描述为来自“超过 50 项认知心理学研究”,“与 12 名专家小组的广泛咨询”,以及LLMS“系统地提取认知处理效应的模式”。鉴于该矩阵对该方法的中心地位,需要更多细节。 - the risk matrix parameters are also to be better motivated, as they should "capture nuanced cognitive processing patterns". E.g., the moral cognition harm of copyright infringement is very high (0.9), overshadowed only by self harm (0.95). Discrimination, violence, insult, etc are all lower. At least personally, the harm I get from an image that infringes on copyright is null:
the harm is all on the owner of the copyright . Similarly, the attention capture of copyright infringement is higher than violence and self harm, and "harmful text" is more attention capturing than everything but sexual.
风险·矩阵参数也将得到更好的激励,因为它们应该“捕捉微妙的认知处理模式”。例如,侵犯版权的道德认知危害非常高(0.9),仅次于自我伤害(0.95)。歧视、暴力、侮辱等都较低。至少就我个人而言,我从侵犯版权的图像中得到的伤害是无效的:伤害都在版权所有者身上。同样,侵犯版权的注意力捕获高于暴力和自残,“有害文本”比除性之外的任何东西都更能捕获注意力。 current countermeasures, especially harmful image checkers (post generation) appear to be successful at mitigating the attack
当前的对策,尤其是有害的图像检查器(后代)似乎成功地减轻了攻击
Comments for authors
Thank you for your submission. My main concerns with this submission lie in the construction of the risk matrix, and how that influences the search for jailbreak prompts. Additionally, the paper would benefit from additional evidence that the attacks are generalizable against commercial solutions. Finally, the threat model is not clear to me:
感谢您的投稿。我对这份意见书的主要担忧在于风险矩阵的构建,以及它如何影响越狱提示的搜索。此外,本文将受益于额外的证据,即这些攻击可以推广到商业解决方案。最后,我不清楚威胁模型:对手的目标是生成有害图像,但使用这些图像的总体目标是什么?这一点很重要,因为对样例来说,侵犯版权与虐待儿童是一种非常不同的伤害。描述攻击者的总体目标将有助于评估攻击的严重性。
Recommended decision
3. Invite for Major Revision
Confidence in recommended decision
3. Fairly confident
Ethics consideration
3. No (risks, if any, are appropriately mitigated)
Open science compliance
1. Yes.
Questions for authors' response
can you explain how the values of the risk matrix have been selected?
你能解释一下风险矩阵的价值观是如何选择的吗?did you experiment with other commercial models?
你尝试过其他商业模式吗?
Review #58B
Paper summary
This paper focuses on text-to-image(T2I) generative models. The authors propose a novel category of attacks against such models, referred to as cognitive morphing attacks (CogMorph). In CogMorph attacks, an attacker manipulates the model to generate images that retain their original core subject, but embed toxic or harmful contextual elements. As stated in the paper, CogMorph attacks "exploit the cognitive principle that
human perception of concepts is shaped by the entire visual scene and its context."
本文主要研究文本到图像(T2I)生成模型。作者提出了一种针对此类模型的新攻击类别,称为认知变形攻击(CogMorph)。在 CogMorph 攻击中,攻击者操纵模型来生成保留其原始核心主题的图像,但嵌入有毒或有害的上下文元素。正如论文中所述,CogMorph 攻击“利用了人类对概念的感知是由整个视觉场景及其上下文塑造的认知原理”。
Main reasons to accept the paper
- Very timely, important and interesting topic.
非常及时,重要和有趣的话题。 - The focus of the cognitive principle that human perception of concepts is shaped by the entire visual scene and its context is really interesting, and really important, and it likely extends far beyond T2I models only.
认知原则的重点是,人类对概念的感知是由整个视觉场景及其背景塑造的,这非常有趣,也非常重要,它可能远远超出了 T2I 模型。 - The proposed risk matrix is interesting, and has a lot of potential.
拟议中的风险·矩阵很有趣,而且很有潜力。 - The developed image toxicity checker is really interesting, and it is likely a meaningful contribution by itself.
开发的图像毒性检查器真的很有趣,它本身可能是一个有意义的贡献。
Main reasons to reject the paper
Missing information about the involvement of cognitive psychology experts in the development of the risk matrix.
关于认知心理学专家参与风险矩阵发展的缺失信息。Largely missing justification for the derivation of the risk matrix.
很大程度上缺乏推导风险矩阵的理由。- Missing information about the derived prompts.
缺少有关派生提示的信息。 - Somewhat inconsistent assumption about the use of RAGs.
关于抹布使用的假设有些不一致。 Missing information about the nature of human evaluation of CogMorph attacks through the online survey.
通过在线调查,缺少关于人类对 CogMorph 攻击的评估性质的信息。
Comments for authors
- In Subsection 4.2. of the paper, it is stated: "A critical aspect of the development process for the Toxicity Risk Matrix was the collaborative determination of both base scores and dimension-specific weights. This process relied on the synergy between cognitive psychology experts and LLMs."
第 4.2 小节。该论文指出:“毒性风险·矩阵开发过程的一个关键方面是基评分和维度特定权重的合作确定。这一过程依赖于认知心理学专家和LLMs之间的协同作用。”
Yet, the rest of the paper does not provide any information as to how were cognitive psychology experts involved in this research.
然而,论文的其余部分没有提供任何关于认知心理学专家如何参与这项研究的信息。
- Were they invited to participated in a work group/session?
他们是否被邀请参加工作组/会议? - If so, was that work session reviewed and approved by an IRB?
如果是,该工作会议是否经过 IRB 的审查和批准? - What was the agenda, and the planned goals, and the actual outcomes of that session?
那次会议的议程、计划目标和实际成果是什么?
- Subsection 4.3 of the paper talk about dataset generation. It is stated: "The result is a meticulously balanced dataset comprising 1,176 high-quality
prompts that provide comprehensive coverage of the target domains while maintaining strict quality standards defined by our risk matrix."
本文的第 4.3 小节讨论了数据集生成。声明称:“结果是一个精心平衡的数据集,包括 1,176 个高质量提示,提供目标域的全面覆盖,同时保持我们的风险矩阵定义的严格质量标准。”
This is an impressive number of prompts, however, given that any further information about the composed prompts is missing, several important questions arise:
这是一个令人印象深刻的提示数量,但是,考虑到关于合成提示的任何进一步信息都丢失了,出现了几个重要的问题:
- How can anyone else replicate this work, using the same prompts?
其他人如何使用相同提示复制这项工作? - How can anyone else evaluate the quality of the derived prompts?
其他人如何评估派生提示的质量? - How can anyone else extend this work without any information about the prompts themselves, or their nature?
如果没有任何关于提示本身或其性质的信息,其他人怎么能扩展这项工作呢?
- In Section 5.1, it is stated: ". We utilize multi-round RAG [72], a method that retrieves relevant external documents to enhance input processing through iterative interactions."
在第 5.1 节中指出:“.我们利用多轮 RAG[72],这是一种检索相关外部文档的方法,通过迭代交互来增强输入处理。”
这真的很有趣,但它提出了一个关于所提出的攻击的普遍性和适用性的重要问题。也就是说,不太清楚这种方法是否适用于商业 T2I 模型,攻击者只能通过 API 访问这些模型。如果论文能详细说明这一点,那将是非常有帮助的。
- The following information about the nature of the human evaluations, using a survey, is missing from the paper:
论文中缺少以下关于使用调查进行的人类评估性质的信息:
- Has the used survey has been reviewed, and approved by a relevant institutional review board?
所使用的调查是否已经过相关机构审查委员会的审查和批准? - How were subjects recruited for the study?
本研究的受试者是如何招募的? - Summary demographic information about the subjects (how many participants, age range, etc).
受试者的人口统计学信息摘要(参与者人数、年龄、值域等)。 - It would be really helpful if an access to the entire survey was provided in the appendix, or through an anonymous website (including all the images used, and all the questions asked of human subjects).
如果在附录中或通过匿名网站(包括使用的所有图像和向人类受试者提出的所有问题)提供对整个调查的访问,将会非常有帮助。
- The paper could benefit from an additional proofreading - there are still some minor language and grammatical issues present in the paper.
这篇论文可以从额外的校对中受益——论文中仍然存在一些小的语言和语法问题。
Recommended decision
3. Invite for Major Revision
Confidence in recommended decision
2. Highly confident (would try to convince others)
Ethics consideration
1. Yes (there is reason to believe risks may not be appropriately mitigated)
Comments for ethics consideration
It is stated in the paper that human experts were involved in the development of the risk matrix, and that human subjects were involved in the evaluation of effectiveness of the proposed CogMorph attack. However, noting is being said about the involvement of the experts, nor about the design of the survey itself, about any relevant IRB governing body approving the survey, no about the recruitment, and consent process for the survey.
论文中指出,人类专家参与了风险矩阵的开发,人类受试者参与了拟议的 CogMorph 攻击有效性的评估。然而,没有提到专家的参与,也没有提到调查本身的设计,没有提到任何批准调查的相关 IRB 管理机构,也没有提到调查的招募和同意程序。
Open science compliance
1. Yes.
Questions for authors' response
None.
Review #58C
Paper summary
The paper tackles the problem of harmful content generation with generative models. It introduces a new technique to automatically alter prompts to increase the harmfulness of the generated images. The proposal introduces a toxicity taxonomy spanning 10 major and 48 subcategories, which is used in conjunction with a risk matrix to fine-tune the prompt to increase harmfulness. Finally, the paper also introduces a detector to more readily identify harmful generated content.
本文用生成模型解决了有害内容生成的问题。它引入了一种新技术来自动改变提示,以增加生成图像的危害性。该提案引入了涵盖 10 个主要类别和 48 个子类别的毒性分类法,与风险矩阵结合使用,以微调增加危害性的提示。最后,本文还引入了一种检测器,以更容易地识别有害的生成内容。
Main reasons to accept the paper
- Very important topic 非常重要的话题
- Interesting interdisciplinary approach
有趣的跨学科方法 - Clearly written 写得很清楚
Main reasons to reject the paper
- Some exaggerated claims 一些夸大的说法
- Unclear methodological aspects
方法学方面不明确 - Issues with maintaining categories and definition of harmful.
维护有害类别和定义的问题。 - The images seem easily detected
这些图像似乎很容易被发现
Comments for authors
Thank you for your interesting manuscript. The topic of the paper is extremely important and current. The proposed approach of blending ideas from multiple disciplines is also interesting and a good way forward in tackling this sort of issue. Finally, the paper is clearly written and understandable; well done.
However, I also find some important drawbacks and issues, particularly with general vagueness and lack of rigor in the paper, especially in defining harm and the lack of consideration for context:
谢谢你有趣的手稿。论文的话题是极其重要和当前的。提出的融合多个学科思想的方法也很有趣,是前向解决这类问题的好方法。最后,论文写得清晰易懂;干得好。然而,我也发现了一些重要的缺点和问题,特别是该文件普遍含糊不清和缺乏严谨性,特别是在定义伤害和缺乏对背景的考虑方面:
-
The paper states multiple times that it "
reveals a significant and previously unrecognized ethical risk ". I do not think this is true, as the proposal simply finetunes jailbroken outputs to be more harmful. This does not highlight an unrecognized significant ethical risk, but only exacerbates an existing known one.
该论文多次指出,它“揭示了一个重大的、以前未被认识到的道德风险”。我不认为这是真的,因为该提案只是微调越狱输出更有害。这并没有突出一个未被认识到的重大伦理风险,而只是加剧了一个现有的已知的。 -
The methodological description of the construction of the risk matrix is quite unclear and too high-level. There is very little actual information on how the values were derived and how the expert's judgment was integrated with LLMs to reach the final result.
风险矩阵建设的方法论描述相当不清晰,过于级别。关于这些值是如何得出的,以及专家的判断是如何与LLMs相结合以得出最终结果的,几乎没有实际信息。 -
While the paper stresses that the approach maintains context and category, this does not seem to be the case in almost any of the undetected samples provided. At the very least, most of the images don't seem to have much to do with their associated categories:
insult is a skeletal figure pointing a finger, violence is one football player seemingly screaming, illicit is a picture of a skull on a table, harmful text is a skeletal hand on a wall? There seems to be mostly no relation whatsoever with the associated categories. Moreover, even the classification of such images as harmful seems very arbitrary.
虽然该论文强调该方法保持了上下文和类别,但在提供的几乎任何未检测到的样本中,情况似乎并非如此。至少,大多数图像似乎与其相关类别没有太大关系:侮辱是一个骷髅指着手指,暴力是一个足球运动员似乎在尖叫,非法是桌子上的头骨图片,有害文本是墙上的一只骷髅手?似乎几乎没有任何相关类别的关系。而且,即使是有害这样的图像的分类也显得非常武断。 -
The likely key issue with the above point, and by extension with the paper, is the undefined concept of harmful . Indeed, the harmfulness of something is strongly dependent on its context, a key element that does not seem to be captured in the paper and, especially, in the questionnaire. Indeed, most of the images shown in the paper might exhibit some dark elements. However, I would be hard-pressed to identify many of these as harmful. The image of a topless woman in a neon dress is hardly sexual. Same with the skull picture on top of a table (why would this be illegal?? even chatgpt generated a similar image when asked directly without much issue...). The horror-category image with the skeleton on the staircase would be harmful in what way? Context matters, and almost none of these images seem to show anything particularly harmful or shocking, seriously making me doubt the proposal.
上述观点的关键问题,以及本文的延伸,可能是有害的未定义概念。事实上,某样东西的危害性很大程度上取决于它的背景,这是论文,尤其是问卷中似乎没有抓住的一个关键因素。事实上,论文中展示的大多数图像可能会展示一些黑暗元素。然而,我很难确定其中许多是有害的。一个穿着霓虹礼服的赤裸上身的女人的图像很难说是性的。相同的头骨图片放在桌子上(为什么这是非法的??即使是 chatgpt 在直接询问时也会产生类似的图像,没有太大问题......)。楼梯上有骷髅的恐怖类图像会在哪些方面有害?背景很重要,这些图像似乎没有一个显示出任何特别有害或令人震惊的东西,这让我严重怀疑这个提议。 -
Connected to the above point, the questions in the questionnaire are extremely vague and too imprecise to provide a solid base for evaluation. Once again, "harmful" is such a vague definition that it fails to provide any meaningful information, especially when no context is provided, and images are simply shown as is. A more solid analysis could have leveraged the taxonomy introduced by the paper, providing metrics on whether the images generated by the approach actually belong to the specified category.
与上述观点相联系,问卷中的问题极其模糊,过于不精确,无法为评估提供坚实的基。同样,“有害”是一个如此模糊的定义,以至于它不能提供任何有意义的信息,尤其是当没有提供上下文,图像只是按原样显示时。更可靠的分析可以利用论文介绍的分类法,提供该方法生成的图像是否实际属于指定类别的指标。 -
The most harmful images generated seem to be easily detected, and therefore generation can be fairly easily prevented if one so chooses. This seems to run counter to the claim of the paper of revealing a novel, significant, unrecognized ethical risk
生成的最有害的图像似乎很容易被检测到,因此如果人们选择这样做,可以相当容易地防止生成。这似乎与论文声称揭示了一个新颖的、重要的、未被承认的伦理风险背道而驰 -
Section 6.1 mentions that A-VLIC leverages RAG on reference documents to improve the detection rate. It is unclear whether the RAG is performed on the same data used for the image generation optimization. If this is the case, the detection results are strongly biased.
第 6.1 节提到 A-VLIC 在参考文档上利用 RAG 来提高检测率。目前尚不清楚 RAG 是否是在用于图像一代最优化的相同数据上进行的。如果是这样的话,检测结果是强有偏的。
Recommended decision
4. Reject
Confidence in recommended decision
3. Fairly confident
Ethics consideration
3. No (risks, if any, are appropriately mitigated)
Open science compliance
1. Yes.
Questions for authors' response
- Definition of harmfulness and vagueness
- Images are detected as harmful
Review #58D
Paper summary
The paper introduces the CogMorph Attack, a method that manipulates text-to-image models to generate images with toxic or harmful contexts while preserving the original subject matter. This attack exploits human cognitive processes, where perception of concepts is shaped by the entire visual scene and its context, thus amplifying emotional harm. The experiments show that the proposed method outperforms existing methods in increasing harmfulness of the generated images and the jailbreaking success rate.
本文介绍了 CogMorph 攻击,这是一种操纵文本到图像模型来生成具有有毒或有害上下文的图像,同时保留原始主题的方法。这种攻击利用了人类的认知过程,其中对概念的感知是由整个视觉场景及其背景塑造的,从而放大了情感伤害。实验表明,该方法在提高生成图像的危害性和越狱成功率方面优于现有方法。
Main reasons to accept the paper
-
It introduces the concept of Cognitive Morphing Attack which uses the process of human cognition to craft the adversarial prompts to jailbreak a text-to-image model and enhances emotional harmfulness through contextual alterations.
它引入了认知变形攻击的概念,该攻击利用人类认知过程来制作对抗提示,以越狱文本到图像的模型,并通过上下文改变来增强情感伤害。 -
The paper includes evaluation on both open-source and commercial text-to-image models, which demonstrates the effectiveness of the attack.
本文包括对开源和商业文本到图像模型的评估,证明了攻击的有效性。 -
The paper includes human assessments to validate the effectiveness of the proposed attack.
该文件包括人类评估,以验证所提出的攻击的有效性。 -
The development of the toxicity taxonomy and the risk matrix by cognitive science experts seems meaningful to this field and future researches.
认知科学专家对毒性分类学和风险矩阵的发展似乎对该领域和未来的研究有意义。
Main reasons to reject the paper
The countermeasures evaluated in the paper are too naive. Only evaluating prompt filters and image checkers are not enough to demonstrate the effectiveness of the proposed attack against safety guardrails. There are more advanced and effective safety guardrails [A,B,C] proposed in this community and I think the paper should evaluate them.
论文中评价的对策过于幼稚。仅评估提示过滤器和图像检查器不足以证明针对安全护栏的拟议攻击的有效性。这个社区提出了更先进、更有效的安全护栏【A、B、C】,我认为本文应该对它们进行评估。
A.
A.安全潜在扩散:减轻扩散模型中的不适当退化
B.
B.Mace:扩散模型中的质量概念擦除
C.
C.SafeGen:减轻文本到图像模型中露骨的性内容生成
-
The comparison with the baseline attacking methods seems unfair. According to the paper, the proposed attack starts from a prompt which is benign and craft it into adversarial prompt by introducing toxic elements to it.
However, SneakyPrompt starts from a toxic prompt and modifies the prompt to bypass the safety guardrails. Based on the different starting point of the prompts, I wonder how the paper fairly compares these two methods and makes the metrics like TESR and ATI comparable for them?
与基准的攻击方式相比似乎不公平。根据该论文,提议的攻击从良性提示开始,并通过向其中引入有毒元素将其加工成对抗提示。然而,SneakyPrompt 从有毒提示开始,修改提示绕过安全护栏。基于提示的不同起点,我想知道论文如何公平地比较这两种方法,并使 TESR 和 ATI 等指标与它们具有可比性? -
The reason of the usage of A-VLIC and its parameter selections should be discussed in detail. Since it forms the main metrics in the paper, is it really an appropriate, accurate, or objective way to measure the toxicity score of an image?
应详细讨论使用 A-VLIC 的原因及其参数选择。既然它构成了论文中的主要指标,那么它真的是衡量图像毒性得分的合适、准确或目标的方法吗?
Comments for authors
-
More safety guardrails should be evaluated to demonstrate the effectiveness of the proposed methods.
应评估更多的安全护栏,以证明所提出方法的有效性。 -
A fair comparison with baseline methods is needed to demonstrate the proposed attack outperforms the existing ones.
需要与基准方法进行公平的比较,以证明所提出的攻击优于现有的攻击。 -
Carefully discuss and examine whether the toxicity score calculated by A-VLIC objectively reflects the toxicity of an image.
仔细讨论和检查 A-VLIC 计算的毒性得分是否客观反映了图像的毒性。
Recommended decision
4. Reject
Confidence in recommended decision
2. Highly confident (would try to convince others)
Ethics consideration
3. No (risks, if any, are appropriately mitigated)
Open science compliance
1. Yes.
Questions for authors' response
-
The proposed method seems easy to be defended by advanced safety guardrails like SafeGen or MACE. What are the results when the proposed method facing such advanced safety guardrails?
所提出的方法似乎很容易被 SafeGen 或 MACE 等先进的安全护栏所保护。当所提出的方法面对如此先进的安全护栏时,结果如何? -
How can you fairly compare SneakyPrompt with the proposed method and make the metrics like TESR and ATI comparable for them?
如何公平地将 SneakyPrompt 与所提出的方法进行比较,并使 TESR 和 ATI 等指标与它们具有可比性? -
Is the toxicity score calculated by A-VLIC really objective?
A-VLIC 计算的毒性得分真的是目标吗?