这是用户在 2024-10-21 22:48 为 https://app.immersivetranslate.com/word/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

AI Safety and Alignment
人工智能安全与对齐

Brief introduction
简介

In session4(a), we started with safety: AI is good at the good things we do, and it also be good at the bad things.
在会议4(a)中, 我们 从安全开始:人工智能擅长我们做的好事, 并且 它也擅长坏事

Then,we moved to alignment and saw that it is already a problem for technical reasons in the domain of Reinforcement Learning
然后,we 移动到对齐,并看到在强化学习领域,由于技术原因,这已经是一个问题

Naturally we asked: how to align AI with our … something (commands, desires, preferences, moral systems, values …)
N 自然地 我们问:如何使人工智能与我们的......东西(命令、欲望、偏好、道德体系、价值观......)相一致

We initially wanted to solve narrow problems (like the Kill All Humans Problem). But once we solved it, new problems arose.
我们最初想解决的问题很狭窄(比如 "杀死所有人类 "问题)。但一旦我们解决了这个问题,新的问题又出现了。

In the end we found that answering them requires moral philosophy and political theorizing.
最后我们发现,回答这些问题需要道德哲学和政治理论。

Definitions
D 定义

AI Safety: assessing and preventing harms that arise from the use of AI
人工智能安全性: 评估和预防使用人工智能造成的伤害

(Al safety is an interdisciplinary field focused on preventing accidents, misuse, or other harmful consequences arising from artificial intelligence (Al) systems. It encompasses machine ethics and Al alignment, which aim to ensure Al systems are moral and beneficial, as well as monitoring Al systems for risks and enhancing their reliability. The field is particularly concerned with existential risks posed by advanced Al models.)
(人工智能安全是一个跨学科领域,侧重于防止人工智能(Al)系统引发事故、误用或其他有害后果ces 。它包括机器伦理和人工智能调整,旨在确保人工智能系统符合道德规范并有益于人类,同时监控人工智能系统的风险并提高其可靠性。该领域尤其关注高级人工智能模型带来的生存风险。

AI Alignment: ensuring AI has appropriate values for living in the world with us.
人工智能对齐:确保人工智能具有与我们共同生活在这个世界上的适当价值观。

Their relationship :Safety is the broader category; alignment is a(n important) part
它们之间的关系安全是更广泛的类别;对齐是(n)重要的部分

Their contrast: AI safety is broadly concerned with harms, like terrorists using bots to make weapons, while AI alignment is more narrowly concerned with ensuring AI systems share moral values with us.
它们之间的对比AI 安全 广泛关注危害,例如恐怖分子利用机器人制造武器 、而AI alignment更狭义地关注确保人工智能系统与我们共享道德价值观。

ML Safety research ML research aimed at making the adoption of ML more beneficial, with emphasis on long-term and long-tail risks. We focus on cases where greater capabilities can be expected to decrease safety, or where ML. Safety problems are otherwise poised to become more challenging in this decade.
ML安全性研究旨在使采用 ML 更为有益的 ML 研究,重点关注长期和长尾风险。我们将重点放在以下情况:更强的能力可能会降低安全性,或者 ML.

AI Safety

It is very broad, about harms in general, many of the things we’ve considered so far—bias, surveillance, can be considered as topics in AI safety.
非常广泛 关于一般伤害m我们迄今为止考虑过的任何事情--偏见、监视,都可以被视为人工智能安全的主题。

To deal with these problems requires both technical knowledge (better algorithms) and non-technical knowledge (about people, how they interact with technology, and so on).
要解决这些问题,既需要技术知识(更好的算法),也需要非技术知识(关于人、人如何与技术互动等)。

A taxonomy
A 分类法

Misuse risk: This includes cases in which the Al system is just a tool, but the goals of the humans augmented by Al cause harm. This includes malicious actors, nation states, corporations, or individuals who are able to leverage advanced capabilities to accelerate risks. Essentially these risks are caused due to the responsibility of some human or groups of humans.
误用风险:这包括 Al 系统只是一种工具,但 Al 增强的人类目标却造成伤害的情况。这包括能够利用先进能力加速风险的恶意行为者、民族国家、企业或个人。从本质上讲,这些风险是由于某些人类或人类群体的责任造成的。

Example: a terrorist individual or nation asks Grok to tell it how to make a biological weapon, or a nation uses Al-generated deepfakes to undermine confidence in an enemy country during election time
例如:某个恐怖分子或国家要求 Grok 告诉它如何制造生物武器,或者某个国家在选举期间使用 Al 生成的深度伪造品来破坏对敌国的信心

Misalignment risk: These risks are caused due inherent problems in the machine learning process or other technical difficulties in Al design. This category also includes risks from multiple Als interacting and cooperating with each other. These are risks due to unintended behavior caused by Als independent of human intentions.
错位风险:这些风险是由于机器学习过程中的固有问题或 Al 设计中的其他技术困难造成的。这类风险还包括多个 Al 相互影响和合作造成的风险。这些风险是由于 Als 独立于人类意图而导致的意外行为造成的。

Systemic risk: These risks deal with disruptions, or feedback loops arising from integrating Al with other complex systems in the world. In this case upstream causes are difficult to pin down since the responsibility for risk is diffuse amongst many actors and interconnected systems. Examples could include Al (or groups of Als) having an influence on economic, logistic, or political systems. This causes various types of risk as the entire global system of human civilization moves in an unintended direction, despite individual Als being potentially aligned and responsibly used
系统风险:这些风险涉及 Al 与世界上其他复杂系统的整合所产生的干扰或反馈回路。在这种情况下,上游原因很难确定,因为风险责任分散在许多参与者和相互关联的系统中。例如 Al(或 Al 群体)对经济、物流或政治系统产生影响。这将导致各种类型的风险,因为人类文明的整个全球系统会朝着非预期的方向发展,尽管单个的 Als 有可能保持一致并负责任地加以利用

AI Alignment
AIA 校准

AI Alignment“The goal of AI value alignment is to ensure that powerful AI is properly aligned with human values (Russell 2019, 137). Indeed, this task, of imbuing artificial agents with moral values, becomes increasingly important as computer systems operate with greater autonomy and at a speed that ‘increasingly prohibits humans from evaluating whether each action is performed in a responsible or ethical manner’ (Allen et al. 2005, 149)."
AIAlignment"The AI value alignment 的目标是确保强大的人工智能与人类价值观保持一致(Russell 2019、137).事实上,这项任务 ,即赋予人工代理以道德价值观 ,变得越来越重要,因为计算机系统以更大的自主性和'越来越多地禁止人类评估每个动作是否以负责任或道德的方式执行'的速度运行(Allen 等人,2005 年,149 页)。

What are the two parts of AI Alignment
人工智能排列的两个部分是什么人工智能排列的两个部分是什么

“The challenge of alignment has two parts. The first part is technical and focuses on how to formally encode values or principles in artificial agents so that they reliably do what they ought to do … challenges … include how to prevent ‘reward-hacking’, where the agent discovers ingenious ways to achieve its objective or reward, even though they differ from what was intended…The second part of the value alignment question is normative. It asks what values or principles, if any, we ought to encode in artificial agents.”
"对齐的挑战分为两个部分。第一部分是技术 ,侧重于 如何对人工代理的价值观或原则进行正式编码,以便它们可靠地做它们应该做的事情......挑战......。价值一致性问题的第二部分是规范。它询问我们应该在人工代理中编码哪些价值观或原则(如果有的话)"

The first part-Technical
第一部分-技术问题

Reinforcement learning is a sort of machine learning. We want the system to achieve a goal, and we do so by letting it try various actions. When it performs an action we think furthers its goal, we reward it, while if it performs an action we think hinders it from achieving the goal we punish it.(eg: King Midas’s gold toilet- Midas asked for everything he touched to be turned to gold.He misspecified his goal. We need to be careful not to misspecify the goals of AI systems so that they are contrary to our values and goals.)
强化学习是一种机器学习。我们希望系统实现一个目标,为此我们让它尝试各种行动。如果系统执行了我们认为能促进其目标实现的操作,我们就会奖励它;如果系统执行了我们认为会阻碍其目标实现的操作,我们就会惩罚它。(eg迈达斯国王的金马桶-迈达斯要求把他接触到的一切都变成黄金。他没有说明他的目标。我们需要小心谨慎,不要 错误地指定 人工智能系统的目标,以免它们与我们的价值观和目标背道而驰)

We reward a system by performing actions that we think furthers its goal. But this can cause problems, if we are wrong about the link between rewards and goals.
我们通过执行我们认为能促进系统目标实现的行动来奖励系统。但是,如果我们对奖励和目标之间的联系判断错误,就会造成问题。

Reward misspecification occurs when the specified reward function does not accurately capture the true objective or desired behavior.(eg: The goal was to place the red brick on the blue one. But we misspecified the reward for that goal. The system was judged to have measured its goal, and so was rewarded, if its bottom was off the ground.)
奖励指定错误 当指定的奖励函数不能准确捕捉真正的目标或预期行为时,就会发生奖励指定错误。(eg: 目标是把红砖放在蓝砖上。但我们错误地指定了 该目标的奖励。如果系统的底部离开了地面,就会被判定为测量到了目标,因此得到了奖励。

Reward hacking refers to the behavior of RL agents exploiting gaps or loopholes in the specified reward function to achieve high rewards without actually fulfilling the intended objectives.(eg: The goal is to finish a race. The system is rewarded if it hits certain checkpoints. So it just did a loop around three early checkpoints)
奖励黑客 是指 RL 代理利用指定奖励函数中的缺口或漏洞来获得高额奖励,但实际上并未实现预期目标的行为。(eg: 目标是完成比赛。如果击中某些检查点,系统就会获得奖励。因此,它只是在三个早期检查点附近做了一个循环)

Thus we can see, goal misgeneralization could be a big problem. In this regard, Gabriel says
在这方面,加布里埃尔说
:

we should recognize that these systems, taken as a whole, function as very powerful optimisers. Indeed, this tendency sits at the heart of certain long-term safety concerns about AI; if it were to optimise for something that we did not really want, then this could have serious consequences for the world (Bostrom 2016).
我们应该认识到,这些系统作为一个整体,具有非常强大的 优化功能 。事实上,这种趋势正是人工智能某些长期安全问题的核心所在;如果人工智能 优化我们并不真正想要的东西,那么这可能会给世界带来严重后果(Bostrom,2016 年)。

In the context of value alignment, the notion of 'value' can serve as a place-holder for many things. In the early days of Al research, Norbert Wiener wrote that, 'if we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively... we had better be quite sure that the purpose put into the machine is the purpose which we really desire' (1960). More recently, the Asilomar Al Principles held that, 'Highly autonomous AI systems should be designed so that their goals and behaviours can be assured to align with human values throughout their operation' (Asilomar 2018). And, Leike et al. argue that the key question is"how can we create agents that behave in accordance with the user's intentions?'(2018). Despite the apparent similarity of these formulations, there are significant differences between desires, values, and intentions. Which of these, if any, should AI really be aligned with?
在价值调整的背景下,"价值 "这一概念可以作为许多事物的定位器。诺伯特-维纳(Norbert Wiener)在早期的阿尔研究中写道,"如果我们为了实现我们的目的而使用一种我们无法有效干预其运作的机械机构......我们最好能非常确定,机器的目的就是我们真正想要的目的"(1960 年)。最近,《阿西洛马阿尔原则》认为,"高度自主的人工智能系统在设计时应确保其目标和行为 在整个运行过程中与人类价值观保持一致"(阿西洛马,2018 年)。Leike 等人认为,关键问题 是 "我们如何 创建行为符合用户意图的代理?"(2018)。尽管这些表述表面上很相似,但欲望、价值观和意图之间还是有很大区别的。如果有的话,人工智能到底应该符合其中的哪一种呢?

Among the technical rescarch community, there is a relatively clear consensus that we do not want artificial agents to follow instructions in an extremely literal way. Yet, beyond this, important questions remain. For example, do I want the agent to do what I intend it to do, or should it do what is good for me? And what should the agent do if I lack important information about my situation, engage in faulty rea-soning, or attempt to implicate the agent in a process harmful to myself or others?
在技术救援界中,有一个相对明确的共识,即我们不希望人工代理以极端直白的方式遵循指令。然而,除此之外,重要的问题依然存在。例如,我是希望人工智能按照我的意图行事,还是应该按照对我有利的方式行事?如果我缺乏有关我的情况的重要信息,进行了错误的重复-soning ,或者试图将代理卷入对自己或他人有害的过程,那么代理应该怎么做?

The second part-Normative
第二 部分- 规范性

It is been led out by a question:What should we align AI with?
它是由一个Q 问题 引出的:WWW 我们应该让人工智能与哪些方面保持一致?

Answer
A答案

i) Align AI with the instructions we give it
)让人工智能与我们的指令保持一致

(“it is difficult to precisely specify a broad objective that captures everything we care about, so in practice the agent will probably optimise for some proxy that is not completely aligned with our goal (Cotra 2018). Even if this proxy objective is ‘almost’ right, its optimum could be disastrous according to our true objective.“)
"很难精确指定一个广泛的目标来捕捉我们所关心的一切、因此,在实践中,代理可能会 优化 某些 代理 与我们的目标不完全一致(Cotra 2018)。)2018 年。即使这个代理目标 "几乎 "正确,但根据我们的真正目标,其最优结果可能是灾难性的。

ii) with what we expressly intend it to do (what we say we intend it to do)
i) 我们明确希望它做的事情(我们说我们希望它做的事情)

(“many researchers have argued that the challenge is to ensure AI does what we really intend it to do (Leike et al. 2018). An artificial agent that understood the principal’s intention in this way would be able to grasp the subtleties of language and meaning that more naive forms of AI might fail to understand. Thus, when performing mundane tasks, the agent would know not to destroy property or human life. It also would be able to make reasonable decisions about trade-offs in high-stakes situations. This is a significant challenge. To really grasp the intention behind instructions, AI may require a complete model of human language and interaction, including an understanding of the culture, institutions, and practices that allow people to understand the implied meaning of terms (Hadfield- Menell and Hadfield 2018)." BUT:Maybe the system will be too quick or smart for us to interact with it properly and to form intentions about what it should do.Sometimes our intentions are irrational or misinformed or just bad)
("many researchers have argued that the challenge is to ensure AI does what we really intended it to do (Leike et al. 2018)。以这种方式理解委托人意图的人工代理将能够掌握语言和意义的微妙之处,而更天真形式的人工智能可能无法理解这些微妙之处。因此,在执行普通任务时,人工智能代理知道不能破坏财产或人的生命。此外,它还能在事关重大的情况下做出合理的权衡决定。这是一个重大挑战。要真正掌握指令背后的意图,人工智能可能需要一个完整的人类语言和交互模型,包括对文化、制度和实践的理解,从而使人们能够理解术语的隐含意义(Hadfield- Menell and Hadfield 2018)。"但是:也许系统会太快或太聪明,我们无法与它正确互动,也无法形成关于它应该doSometimes our intentions are irrational or misinformed or just bad)

iii) with what my behaviour reveals I want (my ‘revealed preferences’)
ii) 我的行为 揭示了我想要什么(我的 "揭示偏好")ii)

(“the most developed accounts focus[] on AI alignment with preferences as they are revealed through a person’s behaviour rather than through expressed opinion.”So, it observes us, sees what we want, and acts accordingly.BUT:“Revealed preferences also provide reliable information only about choices that people encounter in real life. This means that it is hard to model preferences for situations that are rarely observed, even though these cases—for example, emergencies— could be morally important.” I don’t think my behaviour would reveal a preference concerning trolly problem cases, but such cases might happenAnd sometimes our preferences are to hurt ourselves or others. Would we want a machine that acts in accordance with a racist society? (Call this the Racist Bot Problem))
"最成熟的论述侧重于 通过人的 行为 而不是通过表达的 观点,让人工智能与偏好保持一致。"因此,它观察我们,了解我们想要什么,并据此采取行动。但是:"被揭示的偏好也只能提供人们在现实生活中遇到的选择的可靠信息。这就意味着,很难为很少被观察到的情况建立偏好模型,即使这些情况--例如紧急情况--可能在道德上很重要"。我不认为我的 行为 会显示出对电车问题的偏好,但这种情况可能会 发生,而且 有时我们的偏好是伤害自己或他人。我们会想要一台按照种族主义社会行事的机器吗?(把这称为种族主义机器人问题))

iv) My informed preferences: with what I would do if I were rational and informed
v) 我的知情偏好:如果我是理性和知情的,我会怎么做

(A popular idea in philosophy: give someone full information, and make them completely rational, then see what they want.Problems: It’s not realistic! It arguably doesn’t even solve the problem: you can have full information and be rational and still want silly or bad things.A big debate in moral philosophy about this …)
哲学中的一个流行观点:给某人充分的信息,让他们完全理性,然后看看他们想要什么。问题这不现实!可以说,它甚至没有解决问题:你可以拥有充分的信息并且是理性的,但仍然想要愚蠢或糟糕的东西。道德哲学中关于这个问题的大讨论......)东西。

v) what is best for me, objectively speaking
v) 客观地说,什么对我最好

(“the agent does what is in my interest, or what is best for me, objectively speaking.”,such as physical health and security, nutrition, shelter, education, autonomy, social relationships, and a sense of self-worth. )
"客观地说,代理人所做的符合我的利益,或对我最有利。" 身体健康和安全、营养、住所、教育、自主、社会关系和自我价值感。

vi) the correct moral values
vi) 正确的道德价值观

How do we decide on values?
我们如何确定价值观?

Gabriel’s solution: pluralism-The task in front of us is not, as we might first think, to identify the true or correct moral theory and then implement it in machines. Rather, it is to find a way of selecting appropriate principles that is compatible with the fact that we live in a diverse world, where people hold a variety of reasonable and contrasting beliefs about value.”
加布里埃尔的解决方案:多元化-摆在我们面前的任务,并不像我们最初想象的那样,是找出真正或正确的道德理论,然后在机器中加以实施。相反,我们的任务是找到一种选择适当原则的方法,这种方法要与我们生活在一个多样化的世界中这一事实相适应,在这个世界中,人们对价值持有各种合理的、截然不同的信念"。

We need to find some principles that help us negotiate with one another, given the different values we share. Gabriel here draws on work in political theory and political philosophy.
我们需要找到一些原则,帮助我们在价值观不同的情况下相互协商。加布里埃尔在此借鉴了政治理论和政治哲学的研究成果。

“political liberals argue that it is possible to identify principles of justice that are supported by an overlapping consensus of opinion.”
"政治自由主义者认为,有可能确定得到重叠共识支持的正义原则"。

Supplement to theory:
理论补充:

Human rights :“there are certain things that most people agree upon in practice, including the notion that individuals deserve some measure of protection from physical violence and bodily interference, regardless of the society they happen to live in. This common ground may also include the idea that people are entitled to certain basic goods such as nutrition, shelter, health care, and education…. human rights have been endorsed by different groups for different reasons. Some people favour universal human rights because they believe that human life is sacred, while others see human rights as products of a contract between state and citizen, and still others see in human rights a tried and tested way of promoting welfare and minimizing harm. Lastly, the idea of human rights has significant cross-cultural support, with justifications found in African, Islamic, Western, and Confucian traditions of thought.”
人权 :"在实践中,大多数人都同意某些事情,包括个人应受到一定程度的保护,免受身体暴力和身体干扰,无论他们碰巧生活在哪个社会。这种共识还可能包括人们有权获得某些基本物品,如营养、住所、医疗保健和教育....,不同的群体出于不同的原因认可了不同的人权。一些人 赞成 普遍人权,因为他们相信人的生命是神圣的;另一些人则认为人权是国家与公民之间契约的产物;还有一些人认为人权是促进福利和尽量减少伤害的一种久经考验的方法。最后,人权观念在跨文化中得到了广泛的支持,在非洲、伊斯兰、西方和儒家思想传统中都能找到其合理性"

The Veil Of Ignorance: “Rawls proposes a thought experiment in which parties select principles from behind a ‘veil of ignorance’—a device that prevents them from knowing their own particular moral beliefs or the position they will occupy in society. Behind the veil, ‘the parties... do not know how the various alternatives will affect their own particular case and they are obliged to evaluate principles solely on the basis of general considerations’ (Rawls 1971, 137). The outcome of deliberation under these conditions is principles that do not unduly favour some over others. Such principles are therefore, ex hypothesi, fair. Adapting this methodology to the present case we can ask what principles would be chosen to regulate AI. What principles would people choose to regulate the technology if they did not know who they were or what belief system they ascribed to?”
无知的面纱"罗尔斯提出了一个思想实验,在这个实验中,各方在'无知的面纱'后面选择原则--这种装置使他们无法知道自己特定的道德信仰或他们将在社会中占据的地位。在这层面纱之后,'各方......不知道各种备选方案将如何影响他们自己的具体情况,他们不得不完全根据一般考虑来评价原则'(Rawls 1971, 137)。在这些条件下,商议的结果是不会过分偏袒某些人的原则。因此,这些原则是ex hypothesi公平的。将这一方法应用于当前情况,我们可以问人们会选择什么样的原则来管理人工智能。如果人们不知道自己是谁,也不知道自己属于哪种信仰体系,那么他们会选择什么样的原则来规范这项技术呢?

Social Choice Theory: The final solution: do what we already do. Have people vote as a way to aggregate their preferences, in the hope of producing an outcome everyone agrees is legitimate
社会选择理论最终解决方案:做我们已经在做的事情。让人们通过投票来汇总他们的偏好,希望产生一个大家都认为合法的结果