权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

RI: Small: Combining Reinforcement Learning and Deep Learning Methods to Address High-Dimensional Perception, Partial Observability and Delayed Reward

RI：小：结合强化学习和深度学习方法来解决高维感知、部分可观察性和延迟奖励问题

基本信息

批准号：
1526059
负责人：
Satinder Baveja
金额：
$ 49.99万
依托单位：
Regents of the University of Michigan - Ann Arbor
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2015
资助国家：
美国
起止时间：
2015-09-01 至 2020-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1526059&HistoricalAwards=false
关键词：
RI Small Combining Reinforcement Learning

项目摘要

Consider the problem faced by a machine agent that has to interact with some dynamical environment to achieve some goals. Concretely, imagine an agent engaged in a virtual competition as a human would. It can see the screen composed of many moving objects. At any time, it can choose one of a dozen or so actions. Its action controls one of the objects on the screen, but it often is not clear which one. Every so often the an evaluation is given of the competition. At some point the competition ends. How should such an agent choose actions, or more importantly how can we build agents that can learn to compete, i.e., achieve high scores, through trial and error. In this project methods will be developed and evaluated to build such agents. The above problem is an instance of what is called a reinforcement learning (RL) problem. Such problems abound in sequential decision-making settings. Applications in industry include factory optimization, robotics, and chronic disease management (to list but three diverse domains of interest). Like many of these RL problems, Atari games (used as a testbed here to evaluate learning strategies) have three characteristics of interest to this project. First, they generate high-dimensional images and so the agent faces a difficult perception problem. Second, they often have deeply-delayed rewards; i.e., actions have long-term consequences. For example, losing a resource may not cost at the moment of loss, but could lead to very high losses much later when that resource is critically necessary. Third, they have deep partial observability, i.e., to compete effectively one has to often remember the deep past. For example, a location encountered far back in the past may become valuable much later because a critical resource becomes available at that time and the agent would have to find its way back to that location to use the resource. It is proposed to address these three challenges respectively with new neural network architectures for predicting the consequences of actions, new methods for intrinsically motivating agents even when reward is delayed, and new recurrent neural network architectures to remember the past effectively. Success of the proposed work is expected to significantly expand the scope of application of reinforcement learning. Finally, Atari games will be used instead of, say, factory optimization as an evaluation domain because they are readily available. They will be used to draw high-school and under-represented undergraduate students interest into complex ideas underlying the proposed work; their fun visualizations will allow them to be integrated into teaching in the PIs' classes, and there are a variety of games that vary in the degree of difficulty of the three challenge dimensions allowing more effective control of the evaluations more effectively.

考虑一个机器代理所面临的问题，它必须与一些动态环境进行交互以实现某些目标。具体来说，想象一个像人类一样参与虚拟竞争的代理。它可以看到由许多运动物体组成的屏幕。在任何时候，它都可以从十几个动作中选择一个。它的动作控制屏幕上的一个对象，但通常不清楚是哪个对象。每隔一段时间就会对比赛进行评估。在某种程度上，竞争结束了。这样的代理应该如何选择行动，或者更重要的是，我们如何构建能够学习竞争的代理，即通过试验和错误获得高分。在本项目中，将开发和评估构建此类代理的方法。上面的问题是所谓的强化学习（RL）问题的一个例子。这类问题在顺序决策设置中比比皆是。工业上的应用包括工厂优化、机器人和慢性疾病管理（仅列出三个不同的兴趣领域）。与许多强化学习问题一样，雅达利游戏（用作评估学习策略的测试平台）具有本项目感兴趣的三个特征。首先，它们生成高维图像，因此智能体面临一个困难的感知问题。其次，他们的回报往往非常滞后；也就是说，行为具有长期的后果。例如，资源的损失可能不会在损失的那一刻造成损失，但可能会在以后非常需要该资源时导致非常高的损失。第三，它们具有深刻的部分可观察性，也就是说，为了有效地竞争，人们必须经常记住深刻的过去。例如，很久以前遇到的位置可能在很久以后变得有价值，因为当时有一个关键资源可用，代理必须找到返回该位置的方法来使用该资源。为了解决这三个挑战，我们提出了新的神经网络架构来预测行为的后果，新的方法来内在激励智能体，即使奖励延迟，以及新的递归神经网络架构来有效地记住过去。这项工作的成功有望显著扩大强化学习的应用范围。最后，雅达利游戏将被用作评估领域，而不是工厂优化，因为它们很容易获得。它们将被用来吸引高中和代表性不足的本科生对拟议工作背后的复杂想法感兴趣；他们有趣的可视化将使他们能够融入到pi的课堂教学中，并且有各种各样的游戏，在三个挑战维度的难度程度上有所不同，可以更有效地控制评估。