权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Unifying Recent Advances in Deep Learning with Decision-theoretic Planning for Learned MDPs and POMDPs

将深度学习的最新进展与学习 MDP 和 POMDP 的决策理论规划相结合

基本信息

批准号：
RGPIN-2022-04377
负责人：
Sanner, Scott
金额：
$ 4.01万
依托单位：
University of Toronto
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=759690
关键词：
Unifying Recent Advances Deep Learning

项目摘要

In many complex sequential decision-making problems ranging from urban traffic management to interactive conversational recommender systems, it is difficult (if not impossible) for humans to specify complete and accurate models of these domains. However, the ubiquity of modern sensing ranging from traffic cameras to speech-enabled home devices allows us to collect large quantities of data from these complex systems to learn predictively accurate models for purposes of optimal decision-theoretic planning. To this end, there have been two lines of research investigating planning in learned models: data-driven planning (DDP) and offline model-based reinforcement learning (MBRL). While DDP and offline MBRL have made important progress, they have not fully exploited recent advances in planning or deep learning that consider epistemic uncertainty (i.e., confidence in what a model knows), partial observability, and value-awareness (i.e., what is minimally relevant for decision-making) that are critical for many real-world applications. We will address these deficiencies in the following research themes: Theme 1 -- Planning with Deep Bayesian Models of Epistemic Uncertainty in MDPs: To improve planning and model-learning w.r.t. epistemic uncertainty, we will (a) investigate the efficacy of recent advances in Bayesian deep learning for MDP model acquisition, (b) develop robust end-to-end planning methods for these Bayesian deep models, and (c) learn heuristics to further improve planning efficiency. Theme 2 -- Planning in Deep-learned Partially Observed MDPs (POMDPs): To date, partial observability has received little direct attention in DDP and offline MBRL. To address this deficiency, we will (a) investigate transformers for learning accurate POMDP models, (b) investigate novel methods for deep Bayesian belief updating with complex observations (e.g., images or text), and (c) leverage (a) and (b) in novel end-to-end deep-learned POMDP planning techniques. Theme 3 -- Value-awareness in Learned MDPs and POMDPs: When learning from rich, but incomplete observational data, it is critical for both computational and sample efficiency to learn what is minimally relevant for predicting reward. To address value-awareness for Themes 1 and 2, we will investigate planning in (a) value-aware deep MDP models with epistemic uncertainty and (b) value-aware deep POMDP models. The research will be grounded in two key ongoing applied research projects: (i) MDPs for urban traffic signal control in collaboration with the University of Toronto Intelligent Transportation Systems Centre and (ii) POMDPs for interactive conversational recommender systems. While the proposed research will fundamentally contribute to the unification of recent advances in deep learning with decision-theoretic planning in learned MDPs and POMDPs for a variety of potential domains, these specific applications will serve as practical testbeds to validate this research program.

从城市交通管理到交互式会话推荐系统，在许多复杂的顺序决策问题中，人类很难（如果不是不可能的话）指定这些领域的完整和准确的模型。然而，从交通摄像头到支持语音的家用设备，现代传感技术的普遍存在使我们能够从这些复杂的系统中收集大量数据，以学习预测准确的模型，从而实现最佳决策理论规划。为此，有两条研究线调查学习模型中的规划：数据驱动规划（DDP）和离线基于模型的强化学习（MBRL）。虽然DDP和离线MBRL已经取得了重要进展，但它们还没有充分利用考虑认知不确定性的规划或深度学习的最新进展（即，对模型所知的置信度）、部分可观察性和价值感知（即，与决策最低限度相关的内容），这些内容对于许多现实世界的应用程序至关重要。我们将在以下研究主题中解决这些缺陷：主题1 -MDP中认知不确定性的深度贝叶斯模型规划：为了改善规划和模型学习w.r.t.认识不确定性，我们将（a）调查贝叶斯深度学习在MDP模型获取方面的最新进展的有效性，（B）为这些贝叶斯深度模型开发鲁棒的端到端规划方法，以及（c）学习算法以进一步提高规划效率。主题2 --在深度学习的部分可观察MDP（POMDP）中进行规划：迄今为止，部分可观察性在DDP和离线MBRL中几乎没有受到直接关注。为了解决这一缺陷，我们将（a）研究用于学习精确POMDP模型的转换器，（B）研究用于复杂观测的深度贝叶斯信念更新的新方法（例如，图像或文本），以及（c）在新颖的端到端深度学习POMDP规划技术中利用（a）和（B）。主题3 --学习MDP和POMDP的价值意识：当从丰富但不完整的观察数据中学习时，学习与预测奖励最小相关的内容对于计算和样本效率都至关重要。为了解决主题1和主题2的价值意识，我们将研究（a）具有认知不确定性的价值意识深度MDP模型和（B）价值意识深度POMDP模型中的规划。该研究将基于两个关键的正在进行的应用研究项目：（一）与多伦多大学智能交通系统中心合作的城市交通信号控制MDPs和（二）POMDPs的交互式会话推荐系统。虽然拟议的研究将从根本上促进深度学习的最新进展与各种潜在领域的学习MDP和POMDP中的决策理论规划的统一，但这些具体应用将作为验证该研究计划的实际测试平台。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Sanner, Scott其他文献

Evaluation of Machine Learning Algorithms for Predicting Readmission After Acute Myocardial Infarction Using Routinely Collected Clinical Data

DOI：
10.1016/j.cjca.2019.10.023
发表时间：
2020-06-01
期刊：
CANADIAN JOURNAL OF CARDIOLOGY
影响因子：
6.2
作者：
Gupta, Shagun;Ko, Dennis T.;Sanner, Scott
通讯作者：
Sanner, Scott

Online continual learning in image classification: An empirical survey

DOI：
10.1016/j.neucom.2021.10.021
发表时间：
2021-11-05
期刊：
NEUROCOMPUTING
影响因子：
6
作者：
Mai, Zheda;Li, Ruiwen;Sanner, Scott
通讯作者：
Sanner, Scott

Relevance- and interface-driven clustering for visual information retrieval

DOI：
10.1016/j.is.2020.101592
发表时间：
2020-12-01
期刊：
INFORMATION SYSTEMS
影响因子：
3.7
作者：
Bouadjenek, Mohamed Reda;Sanner, Scott;Du, Yihao
通讯作者：
Du, Yihao

A longitudinal study of topic classification on Twitter.

Twitter上的主题分类的纵向研究。

DOI：
10.7717/peerj-cs.991
发表时间：
2022
期刊：
PEERJ COMPUTER SCIENCE
影响因子：
3.8
作者：
Bouadjenek, Mohamed Reda;Sanner, Scott;Iman, Zahra;Xie, Lexing;Shi, Daniel Xiaoliang
通讯作者：
Shi, Daniel Xiaoliang