Statistical Methods in Offline Reinforcement Learning
离线强化学习中的统计方法
基本信息
- 批准号:EP/W014971/1
- 负责人:
- 金额:$ 50.76万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2022
- 资助国家:英国
- 起止时间:2022 至 无数据
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Reinforcement learning (RL) is concerned with how intelligent agents take actions in a given environment to learn an optimal policy that maximises the cumulative reward that they receive. It has been arguably one of the most vibrant research frontiers in machine learning over the last few years. According to Google Scholar, over 40K scientific articles have been published in 2020 with the phrase "reinforcement learning". Over 100 papers on RL were accepted for presentation at ICML 2020 (a premier conference in the machine learning area), accounting for more than 10% of the accepted papers in total. Significant progress has been made in solving challenging problems across various domains using RL, including games, robotics, healthcare, bidding and automated driving. Nevertheless statistics as a field, as opposed to computer science, has only recently begun to engage with RL both in depth and in breadth. The proposed research will develop statistical learning methodologies to address several key issues in offline RL domains. Our objective is to propose RL algorithms that utilise previously collected data, without additional online data collection. The proposed research is primarily motivated by applications in healthcare. Most of the existing state-of-the-art RL algorithms were motivated by online settings (e.g., video games). Their generalisations to applications in healthcare remain unknown. We also remark that our solutions will be transferable to other fields (e.g., robotics). A fundamental question the proposed research will consider is offline policy optimisation where the objective is to learn an optimal policy to maximise the long-term outcome based on an offline dataset. Solving this question faces at least two major challenges. First, in contrast to online settings where data are easy to collect or simulate, the number of observations in many offline applications (e.g., healthcare) is limited. With such limited data, it is critical to develop RL algorithms that are statistically efficient. The proposed research will devise some "value enhancement" methods that are generally applicable to state-of-the-art RL algorithms to improve their statistical efficiency. For a given initial policy computed by existing algorithms, we aim to output a new policy whose expected return converges at a faster rate, achieving the desired "value enhancement" property. Second, many offline datasets are created via aggregating over many heterogeneous data sources. This is typically the case in healthcare where the data trajectories collected from different patients might not have a common distribution function. We will study existing transfer learning methods in RL and develop new approaches designed for healthcare applications, based on our expertise in statistics.Another question the proposed research will consider is off-policy evaluation (OPE). OPE aims to learn a target policy's expected return (value) with a pre-collected dataset generated by a different policy. It is critical in applications from healthcare and automated driving where new policies need to be evaluated offline before online validation. A common assumption made in most of the existing works is that of no unmeasured confounding. However, this assumption is not testable from the data. It can be violated in observational datasets generated from healthcare applications. Moreover, many offline applications will benefit from having a confidence interval (CI) that quantifies the uncertainty of the value estimator, due to the limited sample size. The proposed research is concerned with constructing a CI for a target policy's value in the presence of latent confounders. In addition, in a variety of applications, the outcome distribution is skewed and heavy-tailed. Criteria such as quantiles are more sensible than the mean. We will develop methodologies to learn the quantile curve of the return under a target policy and construct its associated confidence band.
强化学习(RL)关注的是智能代理如何在给定环境中采取行动,以学习最优策略,使其获得的累积奖励最大化。在过去的几年里,它可以说是机器学习领域最具活力的研究前沿之一。根据b谷歌Scholar的数据,2020年有超过4万篇科学文章发表了“强化学习”一词。ICML 2020(机器学习领域的顶级会议)接受了100多篇关于强化学习的论文,占被接受论文总数的10%以上。在使用强化学习解决各种领域的挑战性问题方面取得了重大进展,包括游戏、机器人、医疗保健、竞标和自动驾驶。然而,与计算机科学相反,统计学作为一个领域,直到最近才开始在深度和广度上与强化学习进行接触。拟议的研究将开发统计学习方法来解决离线强化学习领域的几个关键问题。我们的目标是提出强化学习算法,利用以前收集的数据,而不需要额外的在线数据收集。提出的研究主要是由应用在医疗保健的动机。大多数现有的最先进的强化学习算法都是由在线设置(例如,电子游戏)驱动的。它们在医疗保健领域的应用仍不清楚。我们还指出,我们的解决方案将可转移到其他领域(例如机器人)。提出的研究将考虑的一个基本问题是离线策略优化,其目标是学习基于离线数据集的最佳策略以最大化长期结果。解决这个问题至少面临两个主要挑战。首先,与易于收集或模拟数据的在线设置相反,许多离线应用程序(例如,医疗保健)中的观察数量有限。由于数据如此有限,开发具有统计效率的强化学习算法至关重要。本研究将设计一些“价值提升”方法,这些方法通常适用于最先进的强化学习算法,以提高其统计效率。对于由现有算法计算的给定初始策略,我们的目标是输出一个期望收益以更快的速度收敛的新策略,实现期望的“价值增强”属性。其次,许多离线数据集是通过聚合许多异构数据源创建的。这是医疗保健领域的典型情况,从不同患者收集的数据轨迹可能没有共同的分布函数。我们将研究RL中现有的迁移学习方法,并基于我们在统计学方面的专业知识,开发针对医疗保健应用的新方法。拟议的研究将考虑的另一个问题是政策外评价(OPE)。OPE旨在使用由不同策略生成的预收集数据集来学习目标策略的预期返回(值)。在医疗保健和自动驾驶等应用程序中,新策略需要在在线验证之前进行离线评估,这一点至关重要。在大多数现有的工作中,一个共同的假设是没有不可测量的混杂。然而,这一假设无法从数据中得到验证。在医疗保健应用程序生成的观察数据集中,可能会违反该规则。此外,由于样本量有限,许多离线应用程序将受益于有一个置信区间(CI)来量化值估计器的不确定性。提出的研究涉及在潜在混杂因素存在的情况下为目标政策的值构建CI。此外,在各种应用中,结果分布是偏斜的和重尾的。分位数等标准比平均值更合理。我们将开发方法来学习目标政策下收益的分位数曲线,并构建其相关的置信区间。
项目成果
期刊论文数量(9)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A MULTIAGENT REINFORCEMENT LEARNING FRAMEWORK FOR OFF-POLICY EVALUATION IN TWO-SIDED MARKETS
- DOI:10.1214/22-aoas1700
- 发表时间:2023-12-01
- 期刊:
- 影响因子:1.8
- 作者:Shi,Chengchun;Wan,Runzhe;Song,Rui
- 通讯作者:Song,Rui
Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons
- DOI:10.1080/01621459.2022.2106868
- 发表时间:2022-02
- 期刊:
- 影响因子:3.7
- 作者:C. Shi;S. Luo;Hongtu Zhu;R. Song
- 通讯作者:C. Shi;S. Luo;Hongtu Zhu;R. Song
Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework
- DOI:10.1080/01621459.2022.2027776
- 发表时间:2020-02
- 期刊:
- 影响因子:3.7
- 作者:C. Shi;Xiaoyu Wang;S. Luo;Hongtu Zhu;Jieping Ye;R. Song
- 通讯作者:C. Shi;Xiaoyu Wang;S. Luo;Hongtu Zhu;Jieping Ye;R. Song
Testing Directed Acyclic Graph via Structural, Supervised and Generative Adversarial Learning
- DOI:10.1080/01621459.2023.2220169
- 发表时间:2021-06
- 期刊:
- 影响因子:0
- 作者:C. Shi;Yunzhe Zhou;Lexin Li
- 通讯作者:C. Shi;Yunzhe Zhou;Lexin Li
Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process
混杂马尔可夫决策过程的离策略置信区间估计
- DOI:10.1080/01621459.2022.2110878
- 发表时间:2022
- 期刊:
- 影响因子:3.7
- 作者:Shi C
- 通讯作者:Shi C
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Chengchun Shi其他文献
Changes of dissolved organic matter following salinity invasion in different seasons in a nitrogen rich tidal reach
- DOI:
10.1016/j.scitotenv.2023.163251 - 发表时间:
2023 - 期刊:
- 影响因子:9.8
- 作者:
Rongrong Xie;Jiabin Qi;Chengchun Shi;Peng Zhang;Rulin Wu;Jiabing Li;Joanna J. Waniek - 通讯作者:
Joanna J. Waniek
Elucidating the links between Nsub2/subO dynamics and changes in microbial communities following saltwater intrusions
阐明次氮基双氧钛(Nsub2/subO)动力学与盐水入侵后微生物群落变化之间的联系
- DOI:
10.1016/j.envres.2023.118021 - 发表时间:
2024-03-15 - 期刊:
- 影响因子:7.700
- 作者:
Rongrong Xie;Laichang Lin;Chengchun Shi;Peng Zhang;Peiyuan Rao;Jiabing Li;Dandan Izabel-Shen - 通讯作者:
Dandan Izabel-Shen
Changes in perturbation-correlation moving-window two-dimensional correlation spectroscopy of dissolved organic matter induced by dam regulation in a river-type reservoir
河流型水库中大坝调节引起的溶解性有机物的微扰相关移动窗口二维相关光谱的变化
- DOI:
10.1016/j.ecoenv.2025.118464 - 发表时间:
2025-07-15 - 期刊:
- 影响因子:6.100
- 作者:
Xiaodan Ma;Yujuan Ma;Jiabin Qi;Jiabing Li;Jin Chen;Jihui Liu;Lili Chen;Chengchun Shi;Rongrong Xie - 通讯作者:
Rongrong Xie
Optimized SVR model for predicting dissolved oxygen levels using wavelet denoising and variable reduction: Taking the Minjiang River estuary as an example
基于小波去噪和变量缩减的溶解氧水平预测优化支持向量回归模型:以闽江河口为例
- DOI:
10.1016/j.ecoinf.2025.103007 - 发表时间:
2025-05-01 - 期刊:
- 影响因子:7.300
- 作者:
Peng Zhang;Xinyang Liu;Huiru Zhang;Chengchun Shi;Gangfu Song;Lei Tang;Ruihua Li - 通讯作者:
Ruihua Li
simplexreg: 一个基于Simplex分布处理比例数据回归分析的R软件包
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:5.8
- 作者:
Peng Zhang;Zhegguo Qiu;Chengchun Shi - 通讯作者:
Chengchun Shi
Chengchun Shi的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
Computational Methods for Analyzing Toponome Data
- 批准号:60601030
- 批准年份:2006
- 资助金额:17.0 万元
- 项目类别:青年科学基金项目
相似海外基金
Impact of Urban Environmental Factors on Momentary Subjective Wellbeing (SWB) using Smartphone-Based Experience Sampling Methods
使用基于智能手机的体验采样方法研究城市环境因素对瞬时主观幸福感 (SWB) 的影响
- 批准号:
2750689 - 财政年份:2025
- 资助金额:
$ 50.76万 - 项目类别:
Studentship
Developing behavioural methods to assess pain in horses
开发评估马疼痛的行为方法
- 批准号:
2686844 - 财政年份:2025
- 资助金额:
$ 50.76万 - 项目类别:
Studentship
Population genomic methods for modelling bacterial pathogen evolution
用于模拟细菌病原体进化的群体基因组方法
- 批准号:
DE240100316 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Discovery Early Career Researcher Award
Development and Translation Mass Spectrometry Methods to Determine BioMarkers for Parkinson's Disease and Comorbidities
确定帕金森病和合并症生物标志物的质谱方法的开发和转化
- 批准号:
2907463 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Studentship
Non invasive methods to accelerate the development of injectable therapeutic depots
非侵入性方法加速注射治疗储库的开发
- 批准号:
EP/Z532976/1 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Research Grant
Spectral embedding methods and subsequent inference tasks on dynamic multiplex graphs
动态多路复用图上的谱嵌入方法和后续推理任务
- 批准号:
EP/Y002113/1 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Research Grant
CAREER: Nonlinear Dynamics of Exciton-Polarons in Two-Dimensional Metal Halides Probed by Quantum-Optical Methods
职业:通过量子光学方法探测二维金属卤化物中激子极化子的非线性动力学
- 批准号:
2338663 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Continuing Grant
Conference: North American High Order Methods Con (NAHOMCon)
会议:北美高阶方法大会 (NAHOMCon)
- 批准号:
2333724 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Standard Grant
REU Site: Computational Methods with applications in Materials Science
REU 网站:计算方法及其在材料科学中的应用
- 批准号:
2348712 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Standard Grant
CAREER: New methods in curve counting
职业:曲线计数的新方法
- 批准号:
2422291 - 财政年份:2024
- 资助金额:
$ 50.76万 - 项目类别:
Continuing Grant