权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Developing inquisitive, model-based agents for reinforcement learning

开发好奇的、基于模型的强化学习代理

基本信息

批准号：
RGPIN-2019-06079
负责人：
White, Adam
金额：
$ 2.04万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2020
资助国家：
加拿大
起止时间：
2020-01-01 至 2021-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=714340
关键词：
Developing inquisitive model based agents

项目摘要

Natural agents, like animals, learn from a life-time of experience. Most artificial learning systems do not. Newborns begin life with a frenzy of learning: attempting to master their muscle twitches and make sense of their visual inputs. This knowledge is continuously reused and refined throughout life. Our current Artificial Intelligence (AI) systems are well-suited to problems with a clear cause and effect relationship between the system's decisions and the utility of those decisions. Swimming into a shark will cause a loss of life. Shooting an alien ship will increase the score. However, in problems where the consequences of a decision are significantly delayed, it is more difficult to learn this mapping. The most challenging and largely unsolved AI benchmark problems feature such delayed consequences. It is common practice for state-of-the-art systems to train for the equivalent of 30 days on each Atari game, and still achieve well-below human performance in games that feature delayed consequences. One way to deal with the problem of delayed consequences is for the AI to construct its own understanding of how the world works, usually called a model of the world. A model encodes the regularities of the world. For example a model might encode: (1) when I am lined up with a shark and I decide to fire a torpedo, the shark will disappear, and (2) if I am standing on a platform and I decide to jump down, I will end up on the ground. Given access to a model of this form, an AI can mentally simulate future situations that would result from behaving in particular ways without actually interacting with the world. Just as a human can decide where they might end-up if the took a new path down to the river. We can imagine the outcome of taking this alternative path without physically doing it, and avoid unnecessary exploration unless we decide it is valuable to do so. Model-based mental simulation can dramatically improve the efficiency of learning. The remaining question is how does the system decide how to best make use of mental simulation. People often decide to try out things they have never done before. We choose to engage in activities that are mentally and physically challenging, but not beyond our abilities. Humans are motivated by novelty, curiosity and knowledge seeking, and bored by things we already know about. Combining this idea with a model could allow an AI to simulate different ways of behaving, preferring those ways of behaving that result in reduction of uncertainty and the acquisition of new knowledge. With a model, the AI can generate its own internal feedback to focus its mental simulations. The objective of this research program is two fold: (1) to design new approaches for representing and learning models of the world, and (2) to integrate mechanisms that can guide mental simulations (planning) and exploration toward uncertainty and knowledge acquisition.

自然的代理人，像动物一样，从一生的经验中学习。大多数人工学习系统都没有。新生儿开始的生活充满了疯狂的学习：试图掌握他们的肌肉抽搐和理解他们的视觉输入。这些知识在一生中不断重复使用和完善。我们目前的人工智能（AI）系统非常适合解决系统决策与这些决策的效用之间存在明确因果关系的问题。游泳时撞上鲨鱼会造成生命损失。射击一艘外星飞船将增加分数。然而，在决策的结果被显著延迟的问题中，学习这种映射更困难。最具挑战性和基本上未解决的AI基准问题具有这种延迟后果。对于最先进的系统来说，通常的做法是在每个Atari游戏上训练相当于30天，并且在具有延迟后果的游戏中仍然达到远低于人类的表现。处理延迟后果问题的一种方法是人工智能构建自己对世界如何运作的理解，通常称为世界模型。一个模型编码了世界的模式。例如，一个模型可能会编码：（1）当我和一条鲨鱼排成一行，我决定发射鱼雷时，鲨鱼会消失，（2）如果我站在平台上，我决定跳下去，我最终会在地面上。如果能够访问这种形式的模型，人工智能可以在精神上模拟未来的情况，这些情况将由特定的行为方式产生，而无需实际与世界进行交互。就像一个人可以决定他们可能会结束，如果他们采取了一个新的路径下到河。我们可以想象采取这种替代路径而不实际这样做的结果，并避免不必要的探索，除非我们认为这样做是有价值的。基于模型的心理模拟可以显著提高学习效率。剩下的问题是系统如何决定如何最好地利用心理模拟。人们经常决定尝试他们以前从未做过的事情。我们选择从事那些在精神上和身体上具有挑战性的活动，但不会超出我们的能力。人类的动机是新奇、好奇和求知，而对我们已经知道的事情感到厌倦。将这个想法与模型相结合，可以让人工智能模拟不同的行为方式，更喜欢那些导致减少不确定性和获取新知识的行为方式。有了模型，人工智能可以生成自己的内部反馈，以集中其心理模拟。该研究计划的目标有两个方面：（1）设计新的方法来表示和学习世界的模型，以及（2）整合可以引导心理模拟（规划）和探索不确定性和知识获取的机制。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

White, Adam其他文献

Multi-timescale nexting in a reinforcement learning robot

DOI：
10.1177/1059712313511648
发表时间：
2014-04-01
期刊：
ADAPTIVE BEHAVIOR
影响因子：
1.6
作者：
Modayil, Joseph;White, Adam;Sutton, Richard S.
通讯作者：
Sutton, Richard S.

Questioning Anglocentrism in plural policing studies: Private security regulation in Belgium and the United Kingdom

DOI：
10.1177/14773708211014853
发表时间：
2021-05-12
期刊：
EUROPEAN JOURNAL OF CRIMINOLOGY
影响因子：
1.9
作者：
Leloup, Pieter;White, Adam
通讯作者：
White, Adam

A Qualitative Exploration of Parents' Perceptions of Risk in Youth Contact Rugby.

DOI：
10.3390/bs12120510
发表时间：
2022-12-14
期刊：
BEHAVIORAL SCIENCES
影响因子：
2.6
作者：
Anderson, Eric;White, Adam;Hardwicke, Jack
通讯作者：
Hardwicke, Jack

From eye-blinks to state construction: Diagnostic benchmarks for online representation learning.

DOI：
10.1177/10597123221085039
发表时间：
2023-03
期刊：
ADAPTIVE BEHAVIOR
影响因子：
1.6
作者：
Rafiee, Banafsheh;Abbas, Zaheer;Ghiassian, Sina;Kumaraswamy, Raksha;Sutton, Richard S.;Ludvig, Elliot A.;White, Adam
通讯作者：
White, Adam