权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Random Policy Search and Stochastic control

随机策略搜索和随机控制

基本信息

批准号：
2435718
负责人：
金额：
--
依托单位：
University of Oxford
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2020
资助国家：
英国
起止时间：
2020 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2435718
关键词：
Random Policy Search Stochastic control

项目摘要

The paradigm of model free reinforcement learning has proved successful in a multitude of tasks in automated decision making. This paradigm assumes limited information about the environment it operates in and tries to learn good policies by interacting with its environment over time. The quality of a policy is judged by its performance on an appropriately chosen objective function. Usually, the learning algorithms rely on estimating either a value function or directly the policy or a mixture of both. A specific challenge is the learning of policies in environments with continuous action spaces for which computer scientists have introduced several simulation-based benchmark problems (e.g. MuJoCo). A simple algorithm that produced remarkably good results in these benchmarks is the random policy search algorithm with linear parametrization. For this algorithm, our goal is to find a good policy directly by learning parametrized policies linear in the state instead of searching the entire function space of admissible policies. The fitting is done by approximating the gradient of the value function in the directions of the parameter with a Monte-Carlo scheme. Given the gradient, a gradient ascent step is performed. This procedure is repeated until a suitable policy is found and only requires a sampling oracle.In spite of the simplicity of the algorithm, little is known about its theoretical properties. A good starting point for an analysis is stochastic control theory. Already, authors used these tools to establish convergence to the optimal policy in cases where the algorithm is applied to the linear quadratic regulator (LQR). We would like to build on the same approach to establish the behaviour of the algorithm in other control frameworks and deepen the understanding of the interplay between representation and optimization. This research is of interest since it potentially improves algorithm selection for practical problems and gives new insights in deeper aspects of learning.For the first part, we apply the random policy search algorithm to a time-continuous optimal queueing problem that has a direct application to optimal execution in limit order book markets. For this problem and a specified linear policy on discretized time intervals, we can give the closed form expression for the value function and its gradient. We hope to find a parameter region for which the algorithm is guaranteed to converge to an optimal time-discrete policy, if initialised within the region. For that purpose, one could establish a gradient dominance condition for a gradient-descent algorithm with the known gradient and then generalize to a situation where the gradient is approximated. Furthermore, since the value function for the optimal time-continuous policy can be calculated as well we hope to make a statement about discretization errors. Finally, we want to apply the algorithm to real world limit order book data and extend it to a two sided queue corresponding to the market maker problem. I work with Roel Oomen from Deutsche Bank on this project.Further, we would like to better understand the capability of the algorithm to learn an appropriate state representation while simultaneously optimizing. As a first instructive problem, we could consider the LQR case where the appropriate state can be almost written as a combination of bases functions where the coefficients are assumed to have a low dimensional structure. The optimal result in this framework would establish a model selection property for a proximal gradient descent algorithm. Finally, we hope to investigate the role of state representation in an application of this algorithm in the context of backward stochastic differential equation solvers. This project falls within the EPSRC Artificial intelligence technologies research area.

无模型强化学习的范式已经在自动决策的许多任务中证明是成功的。这种范式假设了关于它所处环境的有限信息，并试图通过与环境的交互来学习好的策略。一项政策的质量是根据其在适当选择的目标函数上的表现来判断的。通常，学习算法依赖于估计值函数或直接估计策略或两者的混合。一个具体的挑战是学习政策的环境中连续的行动空间，计算机科学家已经推出了几个基于模拟的基准问题（例如MuJoCo）。一个简单的算法，在这些基准测试中产生了非常好的结果是随机策略搜索算法与线性参数化。对于该算法，我们的目标是找到一个好的政策，直接通过学习参数化的政策线性的状态，而不是搜索整个函数空间的容许政策。拟合是通过用蒙特-卡罗方案在参数的方向上近似值函数的梯度来完成的。给定梯度，执行梯度上升步骤。这个过程被重复直到找到一个合适的策略，并且只需要一个采样oracle.尽管算法简单，但对其理论性质知之甚少。一个很好的分析起点是随机控制理论。已经，作者使用这些工具来建立收敛到最优策略的情况下，该算法被应用于线性二次调节器（LQR）。我们希望建立在相同的方法，以建立在其他控制框架的算法的行为，并加深对表示和优化之间的相互作用的理解。这一研究具有重要意义，因为它可能改善实际问题的算法选择，并在更深层次的学习方面提供新的见解。在第一部分中，我们将随机策略搜索算法应用于一个时间连续的最优排序问题，该问题直接应用于限价订单市场的最优执行。对于这个问题和离散时间区间上的一个指定的线性策略，我们可以给出值函数及其梯度的封闭形式表达式。我们希望找到一个参数区域，该算法是保证收敛到一个最佳的时间离散的政策，如果在该区域内初始化。为此目的，人们可以建立一个梯度下降算法的梯度优势条件与已知的梯度，然后推广到梯度近似的情况。此外，由于最优时间连续策略的值函数也可以计算，我们希望对离散化误差做出说明。最后，我们想把这个算法应用到真实的世界限价委托簿数据上，并把它推广到对应于做市商问题的双边排队。我与德意志银行的Roel Oomen合作完成了这个项目。此外，我们希望更好地了解算法在学习适当状态表示的同时进行优化的能力。作为第一个有指导意义的问题，我们可以考虑LQR的情况下，适当的状态几乎可以写为基函数的组合，其中系数被假设为具有低维结构。在这个框架中的最佳结果将建立一个模型选择属性的近似梯度下降算法。最后，我们希望调查的作用，状态表示在应用该算法的背景下，向后随机微分方程求解器。该项目属于EPSRC人工智能技术研究领域的福尔斯。