权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Scalable Reinforcement Learning Methods for Learning in Real-Time with Robots

用于机器人实时学习的可扩展强化学习方法

基本信息

批准号：
RGPIN-2021-02690
负责人：
Mahmood, Ashique
金额：
$ 1.75万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=750510
关键词：
Scalable Reinforcement Learning Methods Real

项目摘要

Reinforcement learning brings the promise of continually adaptive systems for numerous tasks that humans do well but are physically laborious such as housekeeping, warehouse fulfillment, and delivery services. Such tasks require a common-sense understanding from the agent's part of a dynamically changing physical environment, which is difficult to enumerate and include in a system through hand-engineering. The proposed program aims at developing real-time learning robotic systems that interact with the physical world and adapt in real-time. Some of the most promising approaches in reinforcement learning for robotics are based on learning from human-provided demonstration data and simulators. However, approaches reliant on human interventions are not scalable or sufficient for developing robotics systems that can adapt their performance in real-time under new or changing environments. Our proposed program complements the existing approaches by developing scalable and automatic mechanisms for continually learning robotic systems. All advanced deep reinforcement learning methods for control use expensive learning mechanisms such as those based on experience replay buffers. While such expensive learning mechanisms are more appropriate for training offline or over clouds, we propose a lightweight onboard learning system to adapt and react to changes quickly in real-time. Our proposed onboard learning system will be composed of computationally inexpensive and stable policy and representation learning algorithms. We consider the policy to be only the last semi-linear layer of the network, for which gradient updates can be made more stably without using replay buffers. In addition, the onboard system will perform representation learning only through random perturbation to a small portion of the hidden nodes. We investigate whether such a lightweight learning system in conjunction with a more expensive replay-based learning system performs better than replay-based learning alone. The proposed program also aims at developing efficient and stable policy and representation learning methods. We develop a theoretical framework that enriches our understanding of how to create new and efficient policy learning methods in a directed way. For representation learning, we extend an existing strategy for representation search called generate-and-test to reinforcement learning. We develop a general mechanism of generate-and-test where the utility of features is defined solely based on the loss function, allowing applicability to any loss function and neural architecture. Computationally inexpensive learning mechanisms are essential for making reinforcement learning systems more accessible and applicable to robotics. The lightweight onboard system of the proposed program will allow graduate students, entrepreneurs, and enthusiasts around the world to build continually learning robots more easily, relieving humans from numerous laborious tasks.

强化学习为许多人类做得很好但很费力的任务带来了持续适应系统的前景，如内务管理、仓库履行和递送服务。这样的任务需要对动态变化的物理环境中代理部分的常识性理解，这很难通过手工工程来列举和包括在系统中。拟议的计划旨在开发与物理世界交互并实时适应的实时学习机器人系统。机器人强化学习中一些最有前途的方法是基于从人类提供的演示数据和模拟器中学习的。然而，依赖于人类干预的方法对于开发能够在新的或变化的环境中实时调整其性能的机器人系统来说是不可扩展的，或者是不够的。我们提出的计划通过开发可扩展的自动机制来持续学习机器人系统，从而补充了现有的方法。所有用于控制的高级深度强化学习方法都使用昂贵的学习机制，例如基于经验重放缓冲区的学习机制。虽然这种昂贵的学习机制更适合于离线或云上培训，但我们提出了一个轻量级的车载学习系统，以实时快速适应和反应变化。我们建议的车载学习系统将由计算成本低且稳定的策略和表示学习算法组成。我们认为该策略只是网络的最后一层半线性，在不使用重放缓冲区的情况下，可以更稳定地对其进行梯度更新。此外，车载系统将仅通过对一小部分隐藏节点的随机扰动来执行表示学习。我们调查了这种轻量级学习系统与更昂贵的基于重播的学习系统相结合是否比单独基于重播的学习系统性能更好。拟议的方案还旨在开发高效和稳定的政策和表征学习方法。我们开发了一个理论框架，丰富了我们对如何以定向的方式创建新的、高效的政策学习方法的理解。对于表示学习，我们将一种已有的表示搜索策略--生成并测试策略扩展到强化学习。我们开发了一种通用的生成和测试机制，其中特征的效用完全基于损失函数定义，允许适用于任何损失函数和神经结构。在计算上廉价的学习机制是使强化学习系统更容易访问和适用于机器人的关键。拟议项目的轻量级车载系统将允许世界各地的研究生、企业家和爱好者更容易地制造持续学习的机器人，将人类从无数繁重的任务中解放出来。