权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

New Algorithms for Markov Decision Processes and Reinforcement Learning

马尔可夫决策过程和强化学习的新算法

基本信息

批准号：
2208163
负责人：
Lexing Ying
金额：
$ 40万
依托单位：
Stanford University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2022
资助国家：
美国
起止时间：
2022-09-01 至 2025-08-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2208163&HistoricalAwards=false
关键词：
New Algorithms Markov Decision Processes

项目摘要

Markov decision processes and reinforcement learning have had significant recent success in applications, ranging from outperforming humans in Atari games to AlphaFold overshadowing competing methods in predicting protein folding. This success results from several fundamental developments, including deep neural networks providing a powerful mechanism for representing high dimensional functions, unprecedented computing power provided by graphical processing units and tensor processing units, and the development of novel algorithms for both prediction and control. However, there are still many challenges in applying these recent techniques to mission-critical applications in health, social and economic planning, and defense. This project aims to develop and analyze novel algorithms for Markov decision processes and reinforcement learning with the intention of making these approaches more broadly applicable. Educational impacts include postdoctoral and graduate student training, as well as undergraduate course development centered around machine learning. This project involves the development of a unified framework for Markov decision processes based on linear programming, where the primal, dual, and primal-dual problems are studied for both the regularized and non-regularized cases. Existing algorithms based on Markov decision processes will then be connected to this unified framework. For the tabular setting, a quasi-Newton type policy gradient algorithm will be developed for general entropic regularizers. For the primal-dual problem, a rapidly converging gradient ascent descent algorithm based on a strictly convexified formulation with a non-standard preconditioning metric will be developed. The nonlinear approximation setting will be addressed by variational actor-critic algorithms that are stable and converge at least to a local minimum. Finally, to address the double sampling issue, new algorithms based on the borrowing-from-the-future idea will be developed to significantly reduce the bias.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

马尔可夫决策过程和强化学习最近在应用中取得了巨大的成功，从Atari游戏中超越人类到AlphaFold在预测蛋白质折叠方面超越竞争方法。这一成功得益于几个基本的发展，包括深度神经网络提供了一种强大的机制来表示高维函数，图形处理单元和张量处理单元提供了前所未有的计算能力，以及用于预测和控制的新算法的开发。然而，在将这些最新技术应用于健康，社会和经济规划以及国防等关键任务应用方面仍然存在许多挑战。该项目旨在开发和分析马尔可夫决策过程和强化学习的新算法，旨在使这些方法更广泛地适用。教育影响包括博士后和研究生培训，以及以机器学习为中心的本科课程开发。该项目涉及基于线性规划的马尔可夫决策过程的统一框架的开发，其中研究了正则化和非正则化情况下的原始，对偶和原始-对偶问题。现有的算法基于马尔可夫决策过程，然后将连接到这个统一的框架。对于表格设置，拟牛顿型的政策梯度算法将开发一般熵正则化。对于原始对偶问题，将开发一种基于严格凸化公式和非标准预处理度量的快速收敛梯度上升下降算法。非线性近似设置将由变分演员评论家算法，是稳定的，至少收敛到一个局部最小值。最后，为了解决双重抽样问题，将开发基于未来预测理念的新算法，以显著减少偏差。该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。