权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Scaling Genetic Programming to Complex Reinforcement Learning Tasks

将遗传编程扩展到复杂的强化学习任务

基本信息

批准号：
RGPIN-2020-04438
负责人：
Heywood, Malcolm
金额：
$ 2.11万
依托单位：
Dalhousie University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=750269
关键词：
Scaling Genetic Programming Complex Reinforcement

项目摘要

Reinforcement learning (RL) represents a type of task in which an agent interacts with an environment to maximize its long term reward. A lot of progress has recently been made with deep learning under high-dimensional state and action spaces. This means that rather than having to first develop a suite of appropriate input features, sensors such as video can be used directly. An enormous number of applications have benefited from this development, from algorithms that play Go and Chess better than humans, to facilitating new levels of human competitive performance for robot control tasks. However, one drawback of such an approach is that they generally represent complex black box solutions that require hardware support to deploy, even after training. We recently proposed an alternative approach for scaling RL to high-dimensional state spaces using genetic programming. To do so, teams of programs self organize into Tangled Program Graphs (TPG), which represents an approach of organizing teams of programs into graphs. Our initial benchmarking under high-dimensional RL tasks demonstrates that equivalent quality solutions can be discovered, but with multiple orders of magnitude lower complexity. The proposed research program will greatly expand on the TPG approach to efficiently discover solutions to non-reactive RL tasks requiring multiple simultaneous actions per time step. The long term research program is organized around three objectives: 1) Support for the Automatic identification of behavioural subgraphs: provides the basis for task transfer, accelerated training and increased transparency of machine learning solutions. 2) Develop Multiple concurrent memory models: is the basis for scaling TPG to a wide cross section of non-reactive RL tasks. Without this, it would not be possible to scale to partially observable problems, a class of tasks of widespread impact. 3) Support for describing actions as Multi-dimensional spaces: means that decisions involving multiple real and discrete actions per state can be made simultaneously. A capability that also potentially appears in many applications. Successful completion of this research program will result in a TPG framework that provides solution quality complementing those from deep learning. However, TPG constructs solutions by explicitly discovering mechanisms for decomposing the decision making task. This means that solutions are light-weight, executing in real-time without any form of hardware support. The simplicity of solutions will also support insights into attribute support and solution transparency. This is particularly important when attempting to gain knowledge from solutions post training. Success in the proposed research program would demonstrate new models for addressing open ended questions regarding the application and deployment of RL agents to navigation, motor control and strategic decision making in real-time partially observable environments.

强化学习（RL）代表一种任务类型，其中代理与环境交互以最大化其长期奖励。近年来，在高维状态和动作空间下的深度学习已经取得了很大的进展。这意味着无需首先开发一套合适的输入功能，可以直接使用视频等传感器。大量的应用程序受益于这一发展，从比人类更擅长围棋和国际象棋的算法，到促进人类在机器人控制任务中的竞争表现的新水平。然而，这种方法的一个缺点是，它们通常表示复杂的黑盒解决方案，需要硬件支持才能部署，甚至在培训之后也是如此。我们最近提出了一种使用遗传规划将RL扩展到高维状态空间的替代方法。为此，程序团队自组织成纠缠程序图（TPG），它代表了一种将程序团队组织成图的方法。我们在高维RL任务下的初步基准测试表明，可以发现同等质量的解决方案，但复杂性降低了多个数量级。提出的研究计划将极大地扩展TPG方法，以有效地发现每个时间步需要多个同时动作的非反应性RL任务的解决方案。长期研究计划围绕三个目标组织：1)支持行为子图的自动识别：为任务转移，加速训练和提高机器学习解决方案的透明度提供基础。2)开发多个并发内存模型：这是将TPG扩展到广泛的非反应性RL任务的基础。没有这一点，就不可能扩展到部分可观察到的问题，这是一类具有广泛影响的任务。3)支持将动作描述为多维空间：意味着可以同时做出涉及每个状态的多个真实和离散动作的决策。这一功能也可能出现在许多应用程序中。该研究项目的成功完成将产生一个TPG框架，该框架提供的解决方案质量与深度学习的解决方案质量相辅相成。然而，TPG通过显式地发现分解决策任务的机制来构建解决方案。这意味着解决方案是轻量级的，无需任何形式的硬件支持即可实时执行。解决方案的简单性还将支持对属性支持和解决方案透明性的洞察。当试图从培训后的解决方案中获得知识时，这一点尤为重要。该研究计划的成功将展示解决开放式问题的新模型，这些问题涉及RL代理在实时部分可观察环境中的导航、运动控制和战略决策的应用和部署。