权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Reward Design for Safe Reinforcement Learning

安全强化学习的奖励设计

基本信息

批准号：
2872672
负责人：
金额：
--
依托单位：
University of Oxford
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2023
资助国家：
英国
起止时间：
2023 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2872672
关键词：
Reward Design Safe Reinforcement Learning

项目摘要

In my DPhil, I intend to focus on the safe development of autonomous systems: algorithms that will be deployed in ways that change their environment and have to make sequences of decisions. One popular paradigm for creating decision-making agents is reinforcement learning (RL). Training an RL agent involves two stages: (1) designing the reward signal used to 'score' behaviour and (2) using that reward signal to train a high-scoring agent. Much previous research has focussed on the challenges of training an agent to get a high reward. However, the problem of specifying a reward that captures exactly what designers want is extremely challenging - especially in complex, real-world environments. If the reward function is misspecified, competent optimisers can learn to behave in unpredictable and undesirable ways.In recent years, reward learning has become a popular way to specify rewards in complicated environments. For example, ChatGPT uses a reward model trained on human labels. These reward models are only approximately accurate to the designers' intentions, and models may learn to exploit errors in the reward model to get rewards for undesirable actions. Forming a better understanding of how ChatGPT's reward models' inaccuracies influence its behaviour may be an important step to avoiding unsafe or antisocial behaviour.I want to further develop the theory of reward function design to create safe decision-making systems. My aims and objectives are as follows:1. To develop the theory of how agents fail when their reward functions are misspecified. For example, we can study ways to softly optimise an imperfect reward function to avoid unsafe behaviour. Alternatively, we can try to derive bounds on the error in the performance of a model in terms of the error in a reward model.2. To develop the theory of ways to design safer or more accurately specified reward functions. We can investigate whether some reward misspecification leads to more benign behaviours than others or find ways to improve reward learning methods.3. To investigate alternative training methods that side-step the need for a reward function. One such method is cooperative inverse reinforcement learning, which asks agents to model their uncertainty about their goals and to ask questions when they are uncertain. Another method might be training agents using goal-conditioning. The novelty of this research direction is the focus on the design of the reward rather than on the training process and the safety rather than the competence of agents. When RL has historically been applied in small or toy environments, the complexities of reward design were obscured relative to the challenges of learning to score a high reward. I instead aim to abstract away learning to score a high reward, by asking: if agents were very competent at doing what reward them for doing, how do we reward them for the right behaviours? I intend to develop previous work from the OxCAV group on reward theory, such as in impact regularisation, reward gaming and Goodhart's Law. This project falls within the EPSRC Artificial Intelligence Technologies research area.

在我的博士学位中，我打算专注于自主系统的安全开发：算法将以改变其环境的方式部署，并且必须做出一系列决策。创建决策代理的一个流行范例是强化学习（RL）。训练RL代理包括两个阶段：(1)设计用于“评分”行为的奖励信号；(2)使用奖励信号训练高分代理。之前的许多研究都集中在训练代理人获得高奖励的挑战上。然而，明确一个能够准确捕捉设计师想要的奖励是一个极具挑战性的问题——特别是在复杂的现实世界环境中。如果奖励功能被错误指定，有能力的优化者就会学会以不可预测和不受欢迎的方式行事。近年来，奖励学习已成为在复杂环境中指定奖励的一种流行方法。例如，ChatGPT使用在人类标签上训练的奖励模型。这些奖励模型只能大致准确地反映设计者的意图，并且模型可能学会利用奖励模型中的错误来为不受欢迎的行为获得奖励。更好地理解ChatGPT奖励模型的不准确性如何影响其行为，可能是避免不安全或反社会行为的重要一步。我想进一步发展奖励函数设计理论来创建安全的决策系统。我的目的和目标如下：发展当代理人的奖励函数被错误指定时他们是如何失败的理论。例如，我们可以研究如何温和优化不完美的奖励函数，以避免不安全的行为。或者，我们可以尝试根据奖励模型中的误差推导出模型性能中误差的界限。发展设计更安全或更精确指定奖励函数的方法的理论。我们可以研究一些奖励错误是否会导致比其他行为更良性的行为，或者找到改进奖励学习方法的方法。研究避开奖励功能的替代训练方法。其中一种方法是合作逆强化学习，它要求代理对其目标的不确定性进行建模，并在不确定时提出问题。另一种方法可能是使用目标条件反射训练代理。该研究方向的新颖之处在于关注奖励的设计而不是训练过程，关注代理的安全性而不是能力。当强化学习被应用于小型或玩具环境时，奖励设计的复杂性相对于学习获得高奖励的挑战来说是模糊的。相反，我的目标是抽象出获得高奖励的学习，我提出这样的问题：如果代理人非常有能力做奖励他们做的事情，我们如何奖励他们正确的行为？我打算继续发展OxCAV团队之前关于奖励理论的研究成果，如影响规格化、奖励游戏和古德哈特定律。该项目属于EPSRC人工智能技术研究领域。