权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Proto-Value Functions: A Unified Framework for Learning Task-Specific Behaviors and Task-Independent Representations

原始价值函数：学习任务特定行为和任务无关表示的统一框架

基本信息

批准号：
0534999
负责人：
Sridhar Mahadevan
金额：
$ 44.36万
依托单位：
University of Massachusetts Amherst
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2006
资助国家：
美国
起止时间：
2006-01-01 至 2009-12-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0534999&HistoricalAwards=false
关键词：
Proto Value Functions Unified Framework

项目摘要

This project addresses a longstanding puzzle in artificial intelligence (AI): how can agents transform their temporal experience into multiscale task-independent representations that can effectively guide long-term task-specific behavior? The project will investigate a nonparametric framework combining task-independent learning with task-specific learning. Algorithmically, the framework comprises of four phases. Initially, agents learn a discrete manifold representation of a given environment, which can be viewed as a topological graph representing the states reachable through single or multi-step actions. Next, the graph is analyzed using spectral clustering techniques to reveal "bottlenecks," symmetries, and other geometric invariants. In the third phase, an orthonormal set of task-independent basis functions called proto-value functions are extracted from the environment's topology: These basis functions capture large-scale geometric invariants that all value functions on the state space must adhere to. In the final phase, proto-value functions are combined with rewards to approximate task-specific value functions.The proposed framework unifies two previously disparate lines of research in AI: learning of behavior using value functions, pioneered by Arthur Samuel, and the learning of representations based on global state space analysis, pioneered by Saul Amarel. The theoretical basis for the framework draws upon links between discrete and continuous mathematics: Riemannian manifolds and the spectral theory of graphs; elliptic differential equations and abstract harmonic analysis on graphs. Specifically, the Hilbert space of smooth functions on a Riemannian manifold has a discrete spectrum based on the eigenfunctions of the Laplace-Beltrami operator. The applications of this theory to Markov decision processes will be explored, in particular the ability of Laplacian eigenfunctions or proto-value functions to both capture large-scale geometric structure and as well as approximate task-specific value functions. A novel class of algorithms termed Representation Policy Iteration (RPI) will be investigated, which interleave representation learning and behavior learning. The research thus also addresses a longstanding question not resolved in much previous work on approximation methods for solving large Markov decision processes: how can basis functions be generated automatically? The research will investigate the scalability of the proposed framework to larger problems, including both discrete factored state spaces as well as continuous state spaces. The testbeds include simulated discrete and continuous benchmark problems, simulated and real robot testbeds, and an information extraction task of maintaining the Reinforcement Learning Repository (RLR), the world's largest collection of documents and data relating to reinforcement learning.Broader impacts of this project include algorithmic and theoretical insights leading to a unified approach to learning behavior and representation, as well as applications to real-world problems such as humanoid robotics and web repository maintenance. Additionally, this project will give valuable research experience to women graduate students and to undergraduate students from local four year colleges.

这个项目解决了人工智能（AI）中一个长期存在的难题：智能体如何将他们的时间经验转化为多尺度任务独立的表征，从而有效地指导长期的任务特定行为？该项目将研究一个结合任务独立学习和任务特定学习的非参数框架。在算法上，该框架包括四个阶段。最初，智能体学习给定环境的离散流形表示，可以将其视为表示通过单步或多步操作可达到的状态的拓扑图。接下来，使用谱聚类技术分析图，以揭示“瓶颈”、对称性和其他几何不变量。在第三阶段，从环境的拓扑中提取一组称为原值函数的任务无关基函数的标准正交集：这些基函数捕获状态空间上的所有值函数必须遵守的大规模几何不变量。在最后阶段，原型价值函数与奖励相结合，以近似特定于任务的价值函数。提出的框架统一了人工智能中两个先前不同的研究方向：使用价值函数学习行为，由Arthur Samuel首创，以及基于全局状态空间分析的表征学习，由Saul Amarel首创。该框架的理论基础借鉴了离散数学和连续数学之间的联系：黎曼流形和图的谱理论；椭圆微分方程与图的抽象调和分析。具体地说，黎曼流形上光滑函数的希尔伯特空间具有基于拉普拉斯-贝尔特拉米算子的特征函数的离散谱。将探讨该理论在马尔可夫决策过程中的应用，特别是拉普拉斯特征函数或原值函数捕捉大规模几何结构以及近似任务特定值函数的能力。本文将研究表征策略迭代（RPI）算法，它将表征学习与行为学习相结合。因此，这项研究也解决了一个长期存在的问题，在解决大型马尔可夫决策过程的近似方法的许多以前的工作中没有解决：如何自动生成基函数？该研究将探讨所提出的框架在更大问题上的可扩展性，包括离散因子状态空间和连续状态空间。测试平台包括模拟离散和连续基准问题，模拟和真实机器人测试平台，以及维护强化学习存储库（RLR）的信息提取任务，RLR是世界上最大的与强化学习相关的文档和数据集合。该项目更广泛的影响包括算法和理论见解，导致学习行为和表示的统一方法，以及应用于现实世界的问题，如人形机器人和web存储库维护。此外，该项目将为本地四年制大学的女研究生和本科生提供宝贵的研究经验。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Sridhar Mahadevan其他文献

Privacy Aware Experiments without Cookies

没有 Cookie 的隐私意识实验

DOI：
发表时间：
2022
期刊：
Web Search and Data Mining
影响因子：
0
作者：
Shiv Shankar;Ritwik Sinha;Saayan Mitra;Viswanathan Swaminathan;Sridhar Mahadevan;Moumita Sinha
通讯作者：
Moumita Sinha

C ATEGOROIDS : U NIVERSAL C ONDITIONAL I NDEPENDENCE

类别：普遍有条件独立

DOI：
发表时间：
2022
期刊：
影响因子：
0
作者：
A. Preprint;Sridhar Mahadevan
通讯作者：
Sridhar Mahadevan

Categoroids: Universal Conditional Independence

类别：普遍条件独立性

DOI：
发表时间：
2022
期刊：
arXiv.org
影响因子：
0
作者：
Sridhar Mahadevan
通讯作者：
Sridhar Mahadevan

Reconfigurable adaptable micro-robot

可重构的适应性微型机器人

DOI：
10.1109/icsmc.1999.816634
发表时间：
1999
期刊：
IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028)
影响因子：
0
作者：
R. Tummala;Ranjan Mukherjee;D. M. Aslam;Ning Xi;Sridhar Mahadevan;J. Weng
通讯作者：
J. Weng