权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Efficient Robotic Reinforcement Learning via Off-Policy and Meta-Learning

通过离策略和元学习实现高效的机器人强化学习

基本信息

批准号：
2285275
负责人：
金额：
--
依托单位：
University of Oxford
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2019
资助国家：
英国
起止时间：
2019 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2285275
关键词：
Efficient Robotic Reinforcement Learning via

项目摘要

This project falls within the EPSRC Artificial Intelligence and Robotics research areas.Research ContextDeep reinforcement learning-based methods are increasingly researched approaches for robotics because of their promise to provide more flexible control policies with reduced manual engineering overhead. In contrast to traditional robotics methods, where the control policies are specified by highly specialized experts for each task separately, learning algorithms can acquire general behaviors from their own experience in the same way that many biological organisms do. Deep learning models are shown to generalize well when trained on diverse datasets, and the key to their success lies in their ability to learn millions of parameters from large amounts of training data. One of the major limitations of real-world robotic learning is that we cannot afford to collect large enough datasets for "ImageNet-scale" generalization within a single experiment.Potential ImpactWhile robots are becoming increasingly affordable to average consumers, the set of tasks they can carry out is limited due to the difficulty of designing robust control policies. Robots could reduce the human burden in many everyday tasks such as cooking at homes, elderly care at assisted-living communities, surgery at hospitals, or rescue operations in dangerous disaster zones. Algorithms that can learn and generalize efficiently are crucial to disseminating useful low-cost robots for wider audiences.Objectives and Research MethodologyFor reinforcement learning algorithms to evolve into practical methods for complex real-world tasks, we must design novel algorithms that allow us to get around the issue of data scarcity. One possible way is to better leverage existing historical data. Towards this goal, we propose to investigate how to (1) better utilize off-policy data, that is, the data collected outside of the specific robot experiment, and (2) meta-learn policies that can adapt to new tasks quickly.There is an abundance of previously collected robotic data available, which already provides a large and diverse experience for learning robotic methods. With the ability to incorporate this experience into reinforcement learning, we can get the policies to truly generalize across different objects, environments, scenes, and possibly even across different robots. For example, we could use the RoboNet dataset to improve the training of a single-task or multi-task reinforcement learning. We would define one or more tasks manually and relabel all the RoboNet data with these task rewards. We would then run an off-policy reinforcement algorithm, such as Soft-Actor Critic, on the large set of RoboNet data and a modest amount of new data for each task. This should allow policies to generalize and learn faster.Meta-reinforcement learning algorithms allow agents to rapidly adapt to new tasks by exploiting the structural similarities of previously collected experiences. Existing meta-learning algorithms operate mainly in a setting where all the experience is accessible to the learner in a single batch. More realistically, in the real world, the tasks encountered by agents are typically experienced in a sequential fashion, which is why we should extend the current meta-learning formulation to support such cases of streaming experiences. Another interesting direction would be to formulate versions of meta reinforcement learning where all the MDPs don't necessarily share the same state and action spaces. This would require developing new model architectures that can read in heterogeneous state spaces and output heterogeneous actions.

该项目属于EPSRC人工智能和机器人学研究领域。研究背景基于深度强化学习的方法是越来越多的机器人研究方法，因为它们承诺提供更灵活的控制策略，减少人工工程开销。与传统的机器人方法不同，传统的机器人方法由高度专业化的专家分别为每个任务指定控制策略，而学习算法可以像许多生物有机体一样，从自己的经验中获取一般行为。深度学习模型被证明在不同的数据集上训练时具有很好的泛化能力，其成功的关键在于它们能够从大量的训练数据中学习数百万个参数。现实世界机器人学习的主要限制之一是，我们无法在一次实验中收集足够大的数据集，以用于“ImageNet-Scale”的推广。潜在影响虽然机器人对普通消费者来说越来越负担得起，但由于设计稳健的控制策略的难度，它们可以执行的任务集是有限的。机器人可以在许多日常工作中减轻人类的负担，比如在家里做饭，在辅助生活社区照顾老人，在医院做手术，或者在危险的灾区进行救援行动。能够有效地学习和推广的算法对于向更广泛的受众传播有用的低成本机器人至关重要。目标和研究方法为了使强化学习算法演变成用于复杂现实世界任务的实用方法，我们必须设计新的算法，使我们能够绕过数据稀缺的问题。一种可能的方法是更好地利用现有的历史数据。为此，我们建议研究如何(1)更好地利用非策略数据，即在特定机器人实验之外收集的数据，以及(2)能够快速适应新任务的元学习策略。通过将这种体验整合到强化学习中，我们可以获得真正适用于不同对象、环境、场景，甚至可能适用于不同机器人的策略。例如，我们可以使用Robonet数据集来改进单任务或多任务强化学习的训练。我们将手动定义一个或多个任务，并使用这些任务奖励重新标记所有Robonet数据。然后，我们将在Robonet的大量数据和每个任务的少量新数据上运行非策略强化算法，如Soft-Actor Critic。这应该允许策略泛化和更快地学习。元强化学习算法允许代理通过利用以前收集的经验的结构相似性快速适应新任务。现有的元学习算法主要运行在学习者可以在单个批次中获得所有体验的环境中。更现实地说，在现实世界中，代理遇到的任务通常是以顺序的方式体验的，这就是为什么我们应该扩展当前的元学习公式来支持这种流体验的情况。另一个有趣的方向是制定不同版本的元强化学习，其中所有的MDP不必共享相同的状态和动作空间。这将需要开发能够读入异类状态空间和输出异类操作的新模型体系结构。