权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Improved analysis of policy gradient methods in reinforcement learning.

强化学习中策略梯度方法的改进分析。

基本信息

批准号：
2602524
负责人：
金额：
--
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2021
资助国家：
英国
起止时间：
2021 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2602524
关键词：
Improved analysis policy gradient methods

项目摘要

Reinforcement learning is a popular branch of machine learning that aims to solve a sequential decision-making problem in an environment. This has a wide variety of applications including autonomous driving, robotics, recommendation systems and healthcare. In some of these applications, the cost of a wrong decision could be dramatic. In particular for applications like autonomous driving, the lives of human beings are at stake. As such, it is of crucial importance that we understand how the methods work and whether they really do work in the way that was intended.However, the methods that are used in practice are often only poorly understood. The theory describing these methods is currently unable to explain the huge successes that reinforcement learning has enjoyed in practice. The aim of this project is to provide improved theoretical guarantees for methods known as policy gradient methods that form the basis for much of the practical implementations of reinforcement learning. These methods are particularly used for large-scale problems that are often faced in practice.Specifically, theory on algorithms of this type takes the form of convergence bounds. That is, the algorithm is aiming to output a solution to the problem that is optimal. We are interested in understanding how quickly the algorithm outputs something close to this optimal solution, where the notion of closeness is mathematically precise. The aim of improved analyses translates here into saying that an algorithm converges faster than what was previously proven.Recently, a particular type of a policy-gradient method in a specific setting has been studied under a new perspective known as policy mirror descent. What exactly this means is not too important except that mirror descent is a concept from optimisation theory that has been heavily studied in that setting. As such, tools and methods of analysis may be translated from optimisation theory to this reinforcement learning framework. This can be exploited to achieve improved convergence guarantees, which is one of the avenues that we are using in this project.This project is part of the StatML CDT, which is a joint CDT between Imperial College London and the university of Oxford. It falls within the EPSRC statistics and applied probability research area. In particular, though this project is heavily linked to optimisation, it remains very statistical in nature. This is because we are interested in using data that inherently has some randomness to it in order to solve the decision-making problem of reinforcement learning.

强化学习是机器学习的一个流行的分支，旨在解决环境中的顺序决策问题。这具有广泛的应用，包括自动驾驶，机器人，推荐系统和医疗保健。在其中一些应用中，错误决策的成本可能是巨大的。特别是对于自动驾驶这样的应用，人类的生命处于危险之中。因此，我们了解这些方法是如何工作的，以及它们是否真的以预期的方式工作，这是至关重要的。然而，在实践中使用的方法往往只是知之甚少。描述这些方法的理论目前无法解释强化学习在实践中取得的巨大成功。该项目的目的是为被称为策略梯度方法的方法提供改进的理论保证，这些方法构成了强化学习的许多实际实现的基础。这些方法特别适用于实际中经常遇到的大规模问题。具体来说，这类算法的理论采用收敛界的形式。也就是说，该算法旨在输出问题的最优解。我们感兴趣的是了解算法输出接近最优解的速度，其中接近度的概念在数学上是精确的。改进分析的目的在这里翻译成说，算法收敛速度比以前证明的。最近，在一个特定的设置下，一个特定类型的策略梯度方法已被研究在一个新的角度称为政策镜像下降。这到底意味着什么并不太重要，除了镜像下降是一个来自优化理论的概念，在该环境中已经进行了大量研究。因此，分析工具和方法可以从优化理论转化为这种强化学习框架。这可以被利用来实现改进的收敛保证，这是我们在这个项目中使用的途径之一。这个项目是StatML CDT的一部分，它是伦敦帝国理工学院和牛津大学之间的联合CDT。它属于EPSRC统计和应用概率研究领域的福尔斯。特别是，尽管这个项目与优化密切相关，但它本质上仍然非常统计。这是因为我们有兴趣使用固有的随机性数据来解决强化学习的决策问题。