权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

An Adaptive Robust Dynamic Programming Approach for Decision Making under Model Uncertainty

模型不确定性下决策的自适应鲁棒动态规划方法

基本信息

批准号：
2440945
负责人：
金额：
--
依托单位：
University of Bath
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2020
资助国家：
英国
起止时间：
2020 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2440945
关键词：
Adaptive Robust Dynamic Programming Approach

项目摘要

In many real-world problems an agent must make decisions in an environment that is only partially known. By interacting with the world, the decision-maker is able to obtain more information about the system which allows for more educated choices in the future. Hence, a common characteristic of these problems is that the decision-maker can choose between decisions that lead to a fairly risk-free, high immediate reward, and more risky decisions which may be worse, but may provide the agent with previously unseen information about their environment. In the field of Reinforcement Learning this dilemma is commonly referred to as the "exploration-exploitation trade-off," and is an area of active research.A fundamental challenge in understanding the exploration-exploitation trade-off is that one needs to measure the information gain "learned" by the agent, and to be able to understand how this information develops over time. Classically, this can be done in a Bayesian framework. However, the Bayesian framework requires an initial set of beliefs, and in practice, these may be imprecise. An alternative approach is to make decisions based on outcomes under worst-case scenarios, however this approach lacks the ability to account for learning.In this project we aim to combine the best of both worlds by considering an adaptive (i.e. can incorporate learning), robust (i.e. accounting for uncertainty in the setup) framework for stochastic control problems featuring model uncertainty. Our starting point is the framework of Bielecki et al. (2017), who considered an adaptive, robust approach to a stochastic control problem related to an investment problem. We will attempt to apply their approach to the Newsvendor problem. The Newsvendor problem is a simple stochastic control problem that involves learning. In this problem, an agent (the newsvendor) must choose the number of newspapers to stock for the next period before observing the number of newspapers sold, and is encouraged to learn the distribution of the demand for newspapers, whilst minimising the cost due to unused stock, or unmet demand. As the current choice of stock will affect future outcomes due to differences in information about the number of sales observed, solving such problems requires understanding how the agent's beliefs will change in the future. We hope to construct approximation arguments based on the theory of Optimal Transport in order to reduce the complexity of the problem. Other possible aims of the project include generalising results that are currently known only in very special settings (e.g. from Y.-T. Chuang, 2019) which precisely quantify the surplus in stock used only for the sake of learning.The interest in the Newsvendor model is primarily on account of its mathematical tractability, and the strong dependence of the information acquired on the decisions made by the agent. We expect the principles to be more widely applicable to many RL examples, and may thus contribute more broadly to future developments in Reinforcement Learning.

在许多现实世界的问题中，智能体必须在部分已知的环境中做出决策。通过与外界的互动，决策者能够获得有关系统的更多信息，从而在未来做出更明智的选择。因此，这些问题的一个共同特征是，决策者可以在导致相当无风险、高即时回报的决策和风险更大的决策之间做出选择，这些决策可能更糟，但可能向代理提供有关其环境的先前未见过的信息。在强化学习领域，这种困境通常被称为“探索-利用的权衡”，是一个活跃的研究领域。理解探索-利用权衡的一个基本挑战是，人们需要衡量智能体“学到”的信息增益，并能够理解这些信息是如何随着时间的推移而发展的。通常，这可以在贝叶斯框架中完成。然而，贝叶斯框架需要一组初始信念，而在实践中，这些信念可能是不精确的。另一种方法是根据最坏情况下的结果做出决定，然而这种方法缺乏解释学习的能力。在这个项目中，我们的目标是通过考虑一个自适应（即可以结合学习），鲁棒（即考虑设置中的不确定性）框架来结合两个世界的优点，以解决具有模型不确定性的随机控制问题。我们的出发点是Bielecki等人（2017）的框架，他们考虑了一种自适应的鲁棒方法来解决与投资问题相关的随机控制问题。我们将尝试应用他们的方法来解决报贩问题。报贩问题是一个涉及学习的简单随机控制问题。在这个问题中，代理商（报贩）必须在观察报纸销售量之前选择下一时期的报纸库存数量，并被鼓励了解报纸需求的分布，同时最小化由于未使用库存或未满足需求而导致的成本。由于观察到的销售数量信息的差异，当前的股票选择会影响未来的结果，解决这类问题需要了解代理的信念在未来会如何变化。为了降低问题的复杂性，我们希望在最优传输理论的基础上构造近似参数。该项目的其他可能目标包括推广目前仅在非常特殊的情况下（例如，从y - t）才知道的结果。Chuang, 2019)，它精确地量化了仅用于学习的库存盈余。人们对报贩模型的兴趣主要是由于其数学上的可追溯性，以及获取的信息对代理所做决策的强烈依赖。我们希望这些原则能够更广泛地适用于许多强化学习的例子，从而可能对强化学习的未来发展做出更广泛的贡献。