权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Conditions and methods for Decentralised Reinforcement Learning

去中心化强化学习的条件和方法

基本信息

批准号：
2619847
负责人：
金额：
--
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2021
资助国家：
英国
起止时间：
2021 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2619847
关键词：
Conditions methods Decentralised Reinforcement Learning

项目摘要

Advances in Reinforcement learning (RL) in the last decade have made it a hot topic for research. Improvements in hardware performance and the combination of RL with the use of neural networks have allowed for the development of algorithms that achieve state-of-the-art performance in many control problems, including computer games in which they beat human champions. Some open questions that remain in the field however are how to learn in more complex environments, how to learn more efficiently from limited samples and how to learn for more general tasks. One approach used to learn in very complex environments is to decentralise the control task to multiple agents, rather than a single centralised one. This can greatly reduce the complexity of learning by each agent with the possible expense of more limited policies (action plans) that can be enacted by the group of agents and other technical issues affecting the stability of the training process. The decentralisation occurs quite naturally in many scenarios, such as in self-driving cars, in which each agent can be one car or in a resource assignment task in a cluster of computers, in which each agent could control the tasks assigned to each computer. These decentralised agents can have various levels of communication and synchronisation with each other, which affects the size of the set of possible policies to be taken by the agents. My research aims to deal with agents that communicate implicitly, meaning that they do not directly share their status (state information) with each other, however they observe common features of the environment that allow to collect information about the status of the other agents. The first question that I aim to answer is what are the scenarios in which such a decentralised RL system can achieve the same level of performance as a centralised single agent? This involves setting mathematical conditions on the states, the rewards, and the policy. This is done by modelling the decentralised solution as a decentralised partially observable Markov decision process (dec-POMDP), which allows to consider the decentralisation of the agents and the partial observability of the environment from each agent. Then, I want to investigate in more general scenarios, what is the effect of applying decentralisation? Can I derive theoretical bounds on the performance loss due to decentralisation under certain conditions? Are there special conditions under which decentralisation is especially useful? Subsequently, I want to use these conditions to develop an algorithm that can easily distinguish between tasks that are decentralizable and those that are not. Depending on what the mathematical conditions are, this may be easily done directly using the derived formula, but it could also involve massive computation. In this case, it would be useful to create approximations that would allow to easily test how decentralising the solution of the problem task affect the theoretical performance bounds.While there has been much existing research about decentralised RL algorithms trying to achieve the maximum performance in every kind of scenario, the relationship between centralised and decentralised RL solutions has not been explored in depth. My PhD research aims to provide a theoretical foundation about this relationship and aims to provide novel tools in the form of algorithms that would allow the designer of a decentralised solution to know the maximum theoretical performance that a certain design of a decentralised solution can achieve. This research has the potential to be applied to the control of many systems having components that require cooperating behaviour to achieve the optimal performance. Examples of such systems can be found in self-driving cars, robotics, communication networks, etc. My research is aligned with the ESPRC field "Artificial Intelligence technologies" and "ICT networks and distributed systems".

强化学习（RL）在过去十年中的进展使其成为研究的热门话题。硬件性能的改进以及RL与神经网络的结合使得算法的开发能够在许多控制问题中实现最先进的性能，包括击败人类冠军的计算机游戏。然而，该领域仍然存在一些悬而未决的问题，即如何在更复杂的环境中学习，如何从有限的样本中更有效地学习，以及如何学习更一般的任务。在非常复杂的环境中学习的一种方法是将控制任务分散到多个代理，而不是一个集中的代理。这可以大大降低每个代理学习的复杂性，但可能会导致代理组制定的政策（行动计划）更加有限，以及其他影响培训过程稳定性的技术问题。去中心化在许多场景中非常自然地发生，例如在自动驾驶汽车中，每个代理可以是一辆汽车，或者在计算机集群中的资源分配任务中，每个代理可以控制分配给每个计算机的任务。这些分散的代理可以有各种级别的通信和相互同步，这会影响代理所采取的可能策略集的大小。我的研究旨在处理隐式通信的代理，这意味着他们不直接相互分享他们的状态（状态信息），但是他们观察到环境的共同特征，允许收集有关其他代理状态的信息。我想回答的第一个问题是，在哪些情况下，这种去中心化的RL系统可以实现与集中式单个代理相同的性能水平？这涉及到对状态、奖励和策略设置数学条件。这是通过将分散的解决方案建模为分散的部分可观测马尔可夫决策过程（dec-POMDP）来完成的，该过程允许考虑代理的分散性和每个代理的环境的部分可观测性。然后，我想在更一般的情况下调查，应用分散化的效果是什么？我能推导出在一定条件下由于分散化而导致的性能损失的理论界限吗？是否存在分权特别有用的特殊条件？随后，我想使用这些条件来开发一种算法，该算法可以轻松区分可分散的任务和不可分散的任务。根据数学条件的不同，这可能很容易直接使用导出的公式来完成，但也可能涉及大量的计算。在这种情况下，创建近似值将是有用的，这将允许轻松地测试如何分散的问题任务的解决方案影响理论性能bounds.While已经有很多现有的研究分散RL算法试图在每种情况下实现最大性能，集中和分散RL解决方案之间的关系还没有深入探讨。我的博士研究旨在提供关于这种关系的理论基础，并旨在以算法的形式提供新的工具，使分散式解决方案的设计者能够了解分散式解决方案的某种设计可以实现的最大理论性能。这项研究有可能被应用到许多系统的控制，需要合作的行为，以实现最佳性能的组件。这种系统的例子可以在自动驾驶汽车，机器人，通信网络等中找到。我的研究与ESPRC领域“人工智能技术”和“ICT网络和分布式系统”一致。