权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Statistical Methods in Offline Reinforcement Learning

离线强化学习中的统计方法

基本信息

批准号：
EP/W014971/1
负责人：
Chengchun Shi
金额：
$ 50.76万
依托单位：
London School of Economics and Political Science
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2022
资助国家：
英国
起止时间：
2022 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FW014971%2F1
关键词：
Statistical Methods Offline Reinforcement Learning

项目摘要

Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics). A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.

强化学习（RL）关注的是智能代理如何在给定环境中采取行动，以学习最优策略，使其获得的累积奖励最大化。在过去的几年里，它可以说是机器学习领域最具活力的研究前沿之一。根据b谷歌Scholar的数据，2020年有超过4万篇科学文章发表了“强化学习”一词。ICML 2020（机器学习领域的顶级会议）接受了100多篇关于强化学习的论文，占被接受论文总数的10%以上。在使用强化学习解决各种领域的挑战性问题方面取得了重大进展，包括游戏、机器人、医疗保健、竞标和自动驾驶。然而，与计算机科学相反，统计学作为一个领域，直到最近才开始在深度和广度上与强化学习进行接触。拟议的研究将开发统计学习方法来解决离线强化学习领域的几个关键问题。我们的目标是提出强化学习算法，利用以前收集的数据，而不需要额外的在线数据收集。提出的研究主要是由应用在医疗保健的动机。大多数现有的最先进的强化学习算法都是由在线设置（例如，电子游戏）驱动的。它们在医疗保健领域的应用仍不清楚。我们还指出，我们的解决方案将可转移到其他领域（例如机器人）。提出的研究将考虑的一个基本问题是离线策略优化，其目标是学习基于离线数据集的最佳策略以最大化长期结果。解决这个问题至少面临两个主要挑战。首先，与易于收集或模拟数据的在线设置相反，许多离线应用程序（例如，医疗保健）中的观察数量有限。由于数据如此有限，开发具有统计效率的强化学习算法至关重要。本研究将设计一些“价值提升”方法，这些方法通常适用于最先进的强化学习算法，以提高其统计效率。对于由现有算法计算的给定初始策略，我们的目标是输出一个期望收益以更快的速度收敛的新策略，实现期望的“价值增强”属性。其次，许多离线数据集是通过聚合许多异构数据源创建的。这是医疗保健领域的典型情况，从不同患者收集的数据轨迹可能没有共同的分布函数。我们将研究RL中现有的迁移学习方法，并基于我们在统计学方面的专业知识，开发针对医疗保健应用的新方法。拟议的研究将考虑的另一个问题是政策外评价（OPE）。OPE旨在使用由不同策略生成的预收集数据集来学习目标策略的预期返回（值）。在医疗保健和自动驾驶等应用程序中，新策略需要在在线验证之前进行离线评估，这一点至关重要。在大多数现有的工作中，一个共同的假设是没有不可测量的混杂。然而，这一假设无法从数据中得到验证。在医疗保健应用程序生成的观察数据集中，可能会违反该规则。此外，由于样本量有限，许多离线应用程序将受益于有一个置信区间（CI）来量化值估计器的不确定性。提出的研究涉及在潜在混杂因素存在的情况下为目标政策的值构建CI。此外，在各种应用中，结果分布是偏斜的和重尾的。分位数等标准比平均值更合理。我们将开发方法来学习目标政策下收益的分位数曲线，并构建其相关的置信区间。

项目成果

期刊论文数量（9）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

A MULTIAGENT REINFORCEMENT LEARNING FRAMEWORK FOR OFF-POLICY EVALUATION IN TWO-SIDED MARKETS

DOI：
10.1214/22-aoas1700
发表时间：
2023-12-01
期刊：
ANNALS OF APPLIED STATISTICS
影响因子：
1.8
作者：
Shi，Chengchun;Wan，Runzhe;Song，Rui
通讯作者：
Song，Rui

Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons

DOI：
10.1080/01621459.2022.2106868
发表时间：
2022-02
期刊：
Journal of the American Statistical Association
影响因子：
3.7
作者：
C. Shi;S. Luo;Hongtu Zhu;R. Song
通讯作者：
C. Shi;S. Luo;Hongtu Zhu;R. Song

Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework

DOI：
10.1080/01621459.2022.2027776
发表时间：
2020-02
期刊：
Journal of the American Statistical Association
影响因子：
3.7
作者：
C. Shi;Xiaoyu Wang;S. Luo;Hongtu Zhu;Jieping Ye;R. Song
通讯作者：
C. Shi;Xiaoyu Wang;S. Luo;Hongtu Zhu;Jieping Ye;R. Song

Testing Directed Acyclic Graph via Structural, Supervised and Generative Adversarial Learning

DOI：
10.1080/01621459.2023.2220169
发表时间：
2021-06
期刊：
ArXiv
影响因子：
0
作者：
C. Shi;Yunzhe Zhou;Lexin Li
通讯作者：
C. Shi;Yunzhe Zhou;Lexin Li

Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process

混杂马尔可夫决策过程的离策略置信区间估计

DOI：
10.1080/01621459.2022.2110878
发表时间：
2022
期刊：
Journal of the American Statistical Association
影响因子：
3.7
作者：
Shi C
通讯作者：
Shi C

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Chengchun Shi其他文献

Changes of dissolved organic matter following salinity invasion in different seasons in a nitrogen rich tidal reach

DOI：
10.1016/j.scitotenv.2023.163251
发表时间：
2023
期刊：
Science of The Total Environment
影响因子：
9.8
作者：
Rongrong Xie;Jiabin Qi;Chengchun Shi;Peng Zhang;Rulin Wu;Jiabing Li;Joanna J. Waniek
通讯作者：
Joanna J. Waniek

Elucidating the links between Nsub2/subO dynamics and changes in microbial communities following saltwater intrusions

阐明次氮基双氧钛（Nsub2/subO）动力学与盐水入侵后微生物群落变化之间的联系

DOI：
10.1016/j.envres.2023.118021
发表时间：
2024-03-15
期刊：
ENVIRONMENTAL RESEARCH
影响因子：
7.700
作者：
Rongrong Xie;Laichang Lin;Chengchun Shi;Peng Zhang;Peiyuan Rao;Jiabing Li;Dandan Izabel-Shen
通讯作者：
Dandan Izabel-Shen

Changes in perturbation-correlation moving-window two-dimensional correlation spectroscopy of dissolved organic matter induced by dam regulation in a river-type reservoir

河流型水库中大坝调节引起的溶解性有机物的微扰相关移动窗口二维相关光谱的变化

DOI：
10.1016/j.ecoenv.2025.118464
发表时间：
2025-07-15
期刊：
ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY
影响因子：
6.100
作者：
Xiaodan Ma;Yujuan Ma;Jiabin Qi;Jiabing Li;Jin Chen;Jihui Liu;Lili Chen;Chengchun Shi;Rongrong Xie
通讯作者：
Rongrong Xie

Optimized SVR model for predicting dissolved oxygen levels using wavelet denoising and variable reduction: Taking the Minjiang River estuary as an example

基于小波去噪和变量缩减的溶解氧水平预测优化支持向量回归模型：以闽江河口为例

DOI：
10.1016/j.ecoinf.2025.103007
发表时间：
2025-05-01
期刊：
Ecological Informatics
影响因子：
7.300
作者：
Peng Zhang;Xinyang Liu;Huiru Zhang;Chengchun Shi;Gangfu Song;Lei Tang;Ruihua Li
通讯作者：
Ruihua Li