权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Scaling Unsupervised Environment Design

扩展无监督环境设计

基本信息

批准号：
2888076
负责人：
金额：
--
依托单位：
University of Oxford
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2023
资助国家：
英国
起止时间：
2023 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2888076
关键词：
Scaling Unsupervised Environment Design

项目摘要

Reinforcement learning (RL) is a subfield of machine learning where an agent (e.g. an autonomous vehicle) learns from acting in an environment (e.g. a real road / simulation of a road). Despite making great progress in solving complex video games (Atari, Go, StarCraft), it has not yet been successfully applied to many real world problems. The root cause of this is the inability of RL agents to generalise to unseen scenarios. Specifically, an RL agent trained in a simulation doesn't transfer well when deployed in the real world, due to the inevitable inaccuracies of simulation (note that, due to the large volume of training data needed and potential dangers, it is often impractical to train an agent in the real world).Recent pioneering work has demonstrated significant empirical benefits to generalisation by training a teacher that learns to propose high-quality scenarios (e.g. road layouts) for the agent to train on, mirroring results from supervised learning that have shown the importance of data quality in generalisation. A limitation to this work is that the teacher has to learn from a sparse and noisy signal, resulting in low sample efficiency and necessitating large computational resources, meaning it has only been successfully applied to very simple problems. To reduce signal noise, I have proposed methods encouraging the teacher to maintain a diverse set of scenarios using metrics for approximated surprise, ease of discrimination and distance in a learned latent space. Furthermore, I propose a novel data augmentation method, whereby scenarios are decomposed into a set of 'sub-scenarios', expanding the training data with minimal computational cost. Finally, the current state of the art method trains the teacher by applying random perturbations. I suggest a method for targeted perturbations by constantly approximating the agent's regret (the difference between how well it did at the task and how well an optimal agent would have done) and applying perturbations where this is lowest. All these techniques aim to improve the efficiency of the overall process, reducing the resources needed and allowing this powerful technique to be opened up to more complex domains, benefiting real world applications like autonomous driving. It should be noted that, while I use autonomous driving as a running example, the methods being developed will be generalisable to any RL problem and will be evaluated over a diverse range of environments.This project falls within the EPSRC Artificial intelligence technologies research area.

强化学习 (RL) 是机器学习的一个子领域，其中代理（例如自动驾驶车辆）通过在环境（例如真实道路/模拟道路）中的行为来学习。尽管在解决复杂视频游戏（雅达利、围棋、星际争霸）方面取得了巨大进展，但它尚未成功应用于许多现实世界的问题。其根本原因是强化学习代理无法泛化到未见过的场景。具体来说，由于模拟不可避免的不准确，在模拟中训练的强化学习智能体在部署到现实世界中时不能很好地迁移（请注意，由于需要大量的训练数据和潜在的危险，在现实世界中训练智能体通常是不切实际的）。最近的开创性工作已经证明，通过培训学习提出高质量场景（例如道路布局）的教师，对泛化具有显着的经验效益为代理进行训练，反映监督学习的结果，这些结果表明了数据质量在泛化中的重要性。这项工作的一个限制是，教师必须从稀疏且嘈杂的信号中学习，导致样本效率低下，并且需要大量的计算资源，这意味着它仅成功应用于非常简单的问题。为了减少信号噪声，我提出了一些方法，鼓励教师使用近似惊喜、易于辨别的度量以及学习的潜在空间中的距离来维护一组不同的场景。此外，我提出了一种新颖的数据增强方法，将场景分解为一组“子场景”，以最小的计算成本扩展训练数据。最后，当前最先进的方法通过应用随机扰动来训练教师。我建议一种有针对性的扰动方法，通过不断地近似代理的遗憾（它在任务中的表现与最佳代理的表现之间的差异）并在其最低的地方应用扰动。所有这些技术都旨在提高整个过程的效率，减少所需的资源，并使这种强大的技术能够扩展到更复杂的领域，从而有利于自动驾驶等现实世界的应用。应该指出的是，虽然我使用自动驾驶作为运行示例，但正在开发的方法将适用于任何 RL 问题，并将在各种环境中进行评估。该项目属于 EPSRC 人工智能技术研究领域。