基于近似多步模型的连续空间强化学习方法研究

结题报告

项目介绍

AI项目解读

基本信息

批准号：
61702055
项目类别：
青年科学基金项目
资助金额：
25.0万
负责人：
钟珊
依托单位：
常熟理工学院
学科分类：
F06.人工智能
结题年份：
2020
批准年份：
2017
项目状态：
已结题
起止时间：
2018-01-01 至2020-12-31

项目参与者：
龚声蓉；王朝晖；董瑞志；姚宇峰；董虎胜；李永刚；燕然；戴兴华；
关键词：
模型学习线性近似规划强化学习连续空间

项目摘要

Approximate reinforcement learning methods have the advantages such as strong generalization and saving computation resources so that they are especially suitable for the problems with the continuous spaces. However, their low sample efficiency and convergence rate hinder the further application in practice. The approximate reinforcement learning methods can accelerate the convergence for the algorithm by using model learning and planning, consequently the sample efficiency and convergence rate can be improved heavily. Therefore, they have been spots in the field of reinforcement learning. .The problems existed in the methods based on model learning such as low planning efficiency, slow policy convergence and poor in-time performance are the main focuses of this project. In order to solve these problems, we propose a reinforcement learning method for continuous spaces based on multi-step model and new policy update rule, where the primary innovation points include: 1) To improve the planning efficiency, an approximate multi-step model is constructed and is then used for planning, in the meanwhile, the value function error formula generated from the planning of the approximate model planning is derived, and it is further analyzed so as to set the parameters and improve the stability; 2) The improved policy update rule is designed based on the advantageous function so that the policy can be converged rapidly; 3) The approximate reinforcement learning algorithm based on approximate multi-step model and the improved policy update rule is proposed where the convergence is also analyzed theoretically; 4) Combined with the proposed algorithm, the approximate reinforcement learning framework with parallel operation is constructed and then it is applied in the practical building energy saving problem.

近似强化学习方法具有泛化能力强和节省计算资源的优点，尤其适合连续空间的最优策略求解，但却存在样本低效和收敛速度慢的问题，因而制约了其在实时问题中的应用。基于模型学习的近似强化学习能通过模型学习与规划促进算法收敛，从而提高样本效率和收敛速度，是强化学习领域的研究热点之一。.本项目主要针对现有的基于模型学习的方法存在的规划效率低、策略收敛慢和实时性欠佳等问题，提出了一种基于近似多步模型和新策略更新规则的连续空间强化学习方法，主要创新点为：1)建立近似的多步模型并利用其规划来提高规划效率，同时推导由近似模型规划产生的值函数误差，通过分析误差公式来指导算法参数设置，从而提高算法稳定性；2)设计基于优势函数的新策略更新规则，实现策略快速收敛；3)构建基于近似多步模型和新策略更新规则的近似强化学习算法，并对算法收敛性进行理论分析；4)结合所提算法，构建近似强化学习并行框架，并应用于实际的建筑节能问题。

结项摘要

基于近似模型的强化学习方法能充分利用样本数据从而提高最优策略的求解速度，尤其适合连续空间的最优策略求解，但却存在模型精确度难以保障和模型规划难以获取最优解的问题。为了解决该问题，本项目提出了一系列基于单步和多步模型近似并利用模型规划来加快算法收敛的连续空间强化学习方法，主要创新点为：1)基于单个样本和样本的轨迹，来建立近似的多步模型，并利用单步模型和多步模型的共同规划来提高规划的效率，构建基于近似多步模型和策略更新规则的近似强化学习算法，并对算法收敛性进行理论分析；2)建立基于模型加速和经验回放的策略学习机制，并设计基于优势函数的策略更新规则，实现策略快速收敛；3)通过对状态空间和动作空间的分段，建立一种双层的分段模型，实现对连续状态和动作空间的更精确地刻画，构造更为精确的模型；4)为更好地捕获模型中出现的不确定性，建立了一种基于高斯函数的模型，并给出了模型中参数的求解方式，实现了模型的不确定性的刻画；5)为了进一步提高样本的利用率，在Dyna框架中，采用最小二乘算法来取代时间差分算法，实现值函数、策略以及模型的参数求解，并加入资格迹，以加快整个算法的求解速度；6)设计端到端的无人驾驶深度网络模型，结合历史决策数据和当前感知图片来建立从感知数据到决策行为的映射。7)结合所提算法，构建近似强化学习并行框架，将其应用于清洁机器人、无人驾驶、倒立摆和平衡杆等问题中，并应用于实际的建筑节能问题。