PPoSS: Planning: Inflight Analytics to Control Large-Scale Heterogeneous Systems
PPoSS:规划:用于控制大规模异构系统的飞行分析
基本信息
- 批准号:2029049
- 负责人:
- 金额:$ 25万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-10-01 至 2024-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The goal of this project is to fundamentally reinvent the design of the system, from hardware to application, using fast, novel inflight analytics to control and optimize large-scale heterogeneous computer systems to meet the performance and resiliency requirements of emerging applications such as data mining, artificial intelligence, and individualized medicine. Towards that goal, advanced machine-learning (ML) methods along with domain knowledge will be employed to support real-time system-state estimation and decision-making, including resource management, congestion/failure detection and mitigation, preemptive intrusion detection, and configuration management. Innovations across the system stack will be needed to achieve optimal results by taking full advantage of contextual information collected from multiple layers of the system and adapting rapidly to the deployment environment, workloads, and application requirements. ML-driven inflight analytics methods, developed in this effort, will be demonstrated on a heterogeneous “rack-scale” computing system, with the ultimate future objective of scaling up the framework to a warehouse-scale computing system.The project will be organized around the following research activities. (i) Work with noisy and incomplete telemetry data (e.g., hardware telemetry, OS-level logs, and application-level traces) available from monitors across the system stack to perform system-state estimation (e.g., resource utilization). Telemetry data are often noisy and inconsistent in terms of semantics, modalities, and time granularities, making systems only partially observable. Bayesian deep-learning models will be developed to accurately capture system states and cope with data noise and incompleteness. (ii) Design models and algorithms for practical inflight analytics that make decisions (e.g., on scheduling or failure mitigation) based on the estimated system state to enhance system performance, reliability, and security. Such a framework will consist of an ensemble of interdependent ML models based on partially observable Markov decision processes (POMDPs) augmented with domain knowledge (e.g., interconnect topology) and trained in real time. (iii) Synthesize hardware accelerators for fast, low-cost inflight analytic. Toward that end, a compiler and a runtime framework will be developed that take high-level declarative probabilistic programs (i.e., the POMDPs), automatically compile them onto accelerators, and plan their execution across heterogeneous hardware (FPGAs, ASICs, and CPUs/GPUs). (iv) Assess the trustworthiness of inflight analytics. For that, a trust-assessment framework will be created to evaluate resiliency to failures and attacks due to residual imperfections of heterogeneous components, input uncertainty, and the use of stochastic ML algorithms. While in the planning stage, this project will focus on design of inflight analytics in the context of rack-scale systems. The methods and algorithms developed will be useful in helping smaller-scale sites with limited resources manage their systems more efficiently. Students involved in this project will have a rare opportunity to participate in the design of heterogeneous ML-driven systems with broad applicability. The integration of ML methods and algorithms into real systems can be attractive to a diverse range of individuals, including underrepresented minority students. The goal is to raise awareness of scientific and engineering challenges in design and deployment of next-generation computing systems to support emerging applications.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该项目的目标是从根本上重塑系统的设计,从硬件到应用,使用快速、新颖的飞行分析来控制和优化大型异类计算机系统,以满足数据挖掘、人工智能和个性化医疗等新兴应用的性能和弹性要求。为了实现这一目标,将采用先进的机器学习(ML)方法和领域知识来支持实时系统状态估计和决策,包括资源管理、拥塞/故障检测和缓解、先发制人入侵检测和配置管理。要充分利用从系统多层收集的情景信息并快速适应部署环境、工作负载和应用程序要求,需要在整个系统堆栈中进行创新,以实现最佳结果。在这项工作中开发的ML驱动的飞行分析方法将在一个不同的“机架规模”计算系统上进行演示,最终的未来目标是将框架扩大到仓库规模的计算系统。该项目将围绕以下研究活动进行组织。(I)使用从整个系统堆栈的监视器获得的有噪音且不完整的遥测数据(例如,硬件遥测、操作系统级日志和应用程序级跟踪),以执行系统状态估计(例如,资源利用率)。遥测数据在语义、模式和时间粒度方面往往是噪声和不一致的,使得系统只能部分观察到。将开发贝叶斯深度学习模型,以准确捕获系统状态并处理数据噪声和不完整性。(Ii)设计用于实际飞行分析的模型和算法,根据估计的系统状态做出决策(例如,关于调度或故障缓解),以提高系统性能、可靠性和安全性。这样的框架将由基于部分可观测的马尔可夫决策过程(POMDP)的相互依赖的ML模型的集成组成,所述部分可观测的马尔可夫决策过程(POMDP)用领域知识(例如,互连拓扑)扩充并实时训练。(Iii)合成硬件加速器,以实现快速、低成本的飞行分析。为此,将开发编译器和运行时框架,以获取高级声明性概率程序(即POMDP),自动将它们编译到加速器上,并跨不同的硬件(FGA、ASIC和CPU/GPU)计划它们的执行。(4)评估飞行分析的可信度。为此,将创建一个信任评估框架,以评估由于异类组件的残余缺陷、输入不确定性和随机ML算法的使用而导致的故障和攻击的弹性。在规划阶段,该项目将专注于机架规模系统中的飞行分析设计。开发的方法和算法将有助于帮助资源有限的较小规模的网站更有效地管理其系统。参与这个项目的学生将有一个难得的机会参与具有广泛适用性的异类ML驱动系统的设计。将ML方法和算法集成到实际系统中可以吸引各种不同的个人,包括未被充分代表的少数族裔学生。其目标是提高人们对设计和部署下一代计算系统以支持新兴应用程序的科学和工程挑战的认识。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(15)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Draco: Architectural and Operating System Support for System Call Security
- DOI:10.1109/micro50266.2020.00017
- 发表时间:2020-10
- 期刊:
- 影响因子:0
- 作者:Dimitrios Skarlatos;Qingrong Chen;Jianyan Chen;Tianyin Xu;J. Torrellas
- 通讯作者:Dimitrios Skarlatos;Qingrong Chen;Jianyan Chen;Tianyin Xu;J. Torrellas
Reinforcement learning for resource management in multi-tenant serverless platforms
- DOI:10.1145/3517207.3526971
- 发表时间:2022-04
- 期刊:
- 影响因子:0
- 作者:Haoran Qiu;Weichao Mao;Archit Patke;Chen Wang;H. Franke;Z. Kalbarczyk;T. Başar;R. Iyer
- 通讯作者:Haoran Qiu;Weichao Mao;Archit Patke;Chen Wang;H. Franke;Z. Kalbarczyk;T. Başar;R. Iyer
Delay sensitivity-driven congestion mitigation for HPC systems
HPC 系统的延迟敏感度驱动的拥塞缓解
- DOI:10.1145/3447818.3460362
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Patke, Archit;Jha, Saurabh;Qiu, Haoran;Brandt, Jim;Gentile, Ann;Greenseid, Joe;Kalbarczyk, Zbigniew;Iyer, Ravishankar K.
- 通讯作者:Iyer, Ravishankar K.
Reasoning about modern datacenter infrastructures using partial histories
- DOI:10.1145/3458336.3465276
- 发表时间:2021-06
- 期刊:
- 影响因子:0
- 作者:Xudong Sun;L. Suresh;Aishwarya Ganesan;Ramnatthan Alagappan;Michael Gasch;Lilia Tang;Tianyin Xu
- 通讯作者:Xudong Sun;L. Suresh;Aishwarya Ganesan;Ramnatthan Alagappan;Michael Gasch;Lilia Tang;Tianyin Xu
SIMPPO: a scalable and incremental online learning framework for serverless resource management
- DOI:10.1145/3542929.3563475
- 发表时间:2022-11
- 期刊:
- 影响因子:0
- 作者:Haoran Qiu;Weichao Mao;Archit Patke;Chen Wang;H. Franke;Z. Kalbarczyk;T. Başar;R. Iyer
- 通讯作者:Haoran Qiu;Weichao Mao;Archit Patke;Chen Wang;H. Franke;Z. Kalbarczyk;T. Başar;R. Iyer
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ravishankar Iyer其他文献
717 MICROBIOTA IN STOOL ARE SUPERIOR TO SALIVA IN DIFFERENTIATING CIRRHOSIS AND HEPATIC ENCEPHALOPATHY USING ARTIFICIAL INTELLIGENCE APPROACHES
- DOI:
10.1016/s0016-5085(21)02616-0 - 发表时间:
2021-05-01 - 期刊:
- 影响因子:
- 作者:
Krishnakant Saboo;Nikita Petrakov;Andrew Fagan;Masoumeh Sikaroodi;Chathur Acharya;Sara Mcgeorge;Patrick M. Gillevet;Ravishankar Iyer;Jasmohan S. Bajaj - 通讯作者:
Jasmohan S. Bajaj
458 ARTIFICIAL INTELLIGENCE TECHNIQUES DEMONSTRATE BETTER PREDICTION FOR 90-DAY READMISSION AND DEATH IN WOMEN THAN MEN WITH CIRRHOSIS.
- DOI:
10.1016/s0016-5085(20)33859-2 - 发表时间:
2020-05-01 - 期刊:
- 影响因子:
- 作者:
Krishnakant Saboo;Chang Hu;K. Rajender Reddy;Jacqueline G. O'Leary;Puneeta Tandon;Florence Wong;Guadalupe Garcia-Tsao;Patrick S. Kamath;Jennifer C. Lai;Scott W. Biggins;Michael B. Fallon;Paul J. Thuluvath;Ram Subramanian;Benedict Maliakkal;Hugo E. Vargas;Leroy Thacker;Ravishankar Iyer;Jasmohan S. Bajaj - 通讯作者:
Jasmohan S. Bajaj
Ravishankar Iyer的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ravishankar Iyer', 18)}}的其他基金
CPS: Breakthrough:Towards Resiliency in Cyber-physical Systems for Robot-assisted Surgery
CPS:突破:实现机器人辅助手术的网络物理系统的弹性
- 批准号:
1545069 - 财政年份:2016
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
I/UCRC: Phase I: Computing and Genomics-An Essential Partnership for Biology Breakthroughs (CCBGM)
I/UCRC:第一阶段:计算和基因组学 - 生物学突破的重要合作伙伴 (CCBGM)
- 批准号:
1624790 - 财政年份:2016
- 资助金额:
$ 25万 - 项目类别:
Continuing Grant
CI: NEW: Collaborative Research: Computer System Failure Data Repository to Enable Data-Driven Dependability
CI:新:协作研究:计算机系统故障数据存储库以实现数据驱动的可靠性
- 批准号:
1513051 - 财政年份:2015
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
I/UCRC Planning Grant: Computing and Genomics - An Essential Partnership for Biology Breakthroughs
I/UCRC 规划拨款:计算和基因组学——生物学突破的重要合作伙伴
- 批准号:
1439719 - 财政年份:2014
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Adaptive Software Implemented Fault-Tolerance for Networked Systems
自适应软件为网络系统实现容错
- 批准号:
9902026 - 财政年份:1999
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Academic Research Infrastructure: Acquisition of Research Equipment for High-Speed Computing and Networking Initiative
学术研究基础设施:采购高速计算和网络计划的研究设备
- 批准号:
9601631 - 财政年份:1996
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Engineering Research Equipment Grant: Investigation of LISPMachine Architecture Reliability and Performance
工程研究设备补助金:LISP机器架构可靠性和性能研究
- 批准号:
8604893 - 财政年份:1986
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
相似海外基金
Planning Grant: Developing capacity to attract diverse students to the geosciences: A public relations framework
规划补助金:培养吸引多元化学生学习地球科学的能力:公共关系框架
- 批准号:
2326816 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Planning: Advancing Discovery on a Sustainable National Research Enterprise
规划:推进可持续国家研究企业的发现
- 批准号:
2412406 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Planning: Artificial Intelligence Assisted High-Performance Parallel Computing for Power System Optimization
规划:人工智能辅助高性能并行计算电力系统优化
- 批准号:
2414141 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Planning: FIRE-PLAN: Exploring fire as medicine to revitalize cultural burning in the Upper Midwest
规划:FIRE-PLAN:探索火作为药物,以振兴中西部北部的文化燃烧
- 批准号:
2349282 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
CC* Planning: Strengthening Central Michigan University's Cyberinfrastructure
CC* 规划:加强中央密歇根大学的网络基础设施
- 批准号:
2345749 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Planning: FIRE-PLAN: Building Wildland Fire Science Capacity in Alaska Through The University of Alaska Fairbanks Rural Campuses
规划:FIRE-PLAN:通过阿拉斯加大学费尔班克斯乡村校区建设阿拉斯加荒地火灾科学能力
- 批准号:
2333423 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: Planning: FIRE-PLAN:High-Spatiotemporal-Resolution Sensing and Digital Twin to Advance Wildland Fire Science
合作研究:规划:FIRE-PLAN:高时空分辨率传感和数字孪生,以推进荒地火灾科学
- 批准号:
2335568 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: Planning: FIRE-PLAN:High-Spatiotemporal-Resolution Sensing and Digital Twin to Advance Wildland Fire Science
合作研究:规划:FIRE-PLAN:高时空分辨率传感和数字孪生,以推进荒地火灾科学
- 批准号:
2335569 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
CAREER: Statistical Power Analysis and Optimal Sample Size Planning for Longitudinal Studies in STEM Education
职业:STEM 教育纵向研究的统计功效分析和最佳样本量规划
- 批准号:
2339353 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Continuing Grant
HoloSurge: Multimodal 3D Holographic tool and real-time Guidance System with point-of-care diagnostics for surgical planning and interventions on liver and pancreatic cancers
HoloSurge:多模态 3D 全息工具和实时指导系统,具有护理点诊断功能,可用于肝癌和胰腺癌的手术规划和干预
- 批准号:
10103131 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
EU-Funded