Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale
合作研究:SHF:小型:大规模学习容错
基本信息
- 批准号:2135309
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-01-01 至 2024-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In computer-aided design and analysis of engineered systems such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit (one of the world's fastest supercomputers), increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. The likelihood of such errors and their negative impacts are further increased as such simulations are typically long running, and the corruption of a single data field or variable may require weeks to months of re-computations before critical decisions can be made. This project will develop automated approaches that bring fault tolerance to hardware faults for such applications which are widely used not only across multiple industrial sectors but to also increase the predictive power of climate or weather models to aid critical decision making. Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. This project seeks to address these limitations by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This project is based on the premise that in such physics-based applications, the rate of change of the solution vector components across time steps (iterations) and spatial domains is a key metric to automatically identifying the critical computational variables, monitoring their evolution, and dynamically selecting the type of safeguarding techniques that should be applied. The investigators will pursue three key directions: (i) characterizing the intrinsic resiliency of the application by developing resiliency gradient metrics, (ii) developing and testing fault-tolerance schemes that adapt the level and type of protection to the resiliency gradient with the goal of reducing computational overheads and increasing scalability, and (iii) constructing an automatic online decision-based learning framework for adaptively selecting fault-tolerance methods in relation to the system's ability to use approximate computing and co-scheduling techniques. The investigators will also work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在汽车或半导体芯片等工程系统的计算机辅助设计和分析中,计算模型在高性能计算机上进行模拟,以表征和评估关键属性。这种高性能计算系统的绝对规模,例如,Summit(世界上最快的超级计算机之一)中的200多亿个晶体管,增加了由宇宙辐射或处理器芯片电压波动等事件引起的瞬时硬件故障的可能性。这种错误及其负面影响的可能性进一步增加,因为这种模拟通常是长期运行的,并且在做出关键决策之前,单个数据字段或变量的损坏可能需要数周至数月的重新计算。该项目将开发自动化方法,为这些应用程序的硬件故障带来容错能力,这些应用程序不仅广泛用于多个工业部门,而且还可以提高气候或天气模型的预测能力,以帮助做出关键决策。传统的容错方案可以是特定于应用程序的,需要程序员付出大量努力来重新设计或定制大规模软件,或者是与应用程序无关的,其中所有或大多数数据被定期冗余存储以允许恢复,从而由于其显著的存储器和处理开销而限制了其可扩展性。该项目旨在解决这些限制,提供了一个新的一类容错计划,适合于广泛的应用程序的基础上,随着时间的推移,离散空间域的迭代数值模拟的理论基础。该项目基于这样的前提,即在这种基于物理的应用中,跨时间步长(迭代)和空间域的解向量分量的变化率是自动识别关键计算变量、监控其演变以及动态选择应应用的保护技术类型的关键度量。调查人员将沿着三个主要方向开展工作:(i)通过开发弹性梯度度量来表征应用的固有弹性,(ii)开发和测试容错方案,该容错方案使保护的级别和类型适应弹性梯度,目的是减少计算开销并增加可伸缩性,以及(iii)构建自动在线基于决策的学习框架,用于根据系统使用近似计算和协同调度技术的能力自适应地选择容错方法。研究人员还将与应用程序和运行时系统开发人员密切合作,以寻求更广泛地使用这个容错框架,开发专门的本科生和研究生课程,为学生培训,并提供研究经验,以高中学生。这个奖项反映了NSF的法定使命,并已被认为是值得通过评估使用基金会的智力价值和更广泛的影响审查标准的支持。
项目成果
期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Online Scheduling of Moldable Task Graphs under Common Speedup Models
- DOI:10.1145/3545008.3545049
- 发表时间:2022-08
- 期刊:
- 影响因子:0
- 作者:A. Benoit;L. Perotin;Y. Robert;Hongyang Sun
- 通讯作者:A. Benoit;L. Perotin;Y. Robert;Hongyang Sun
Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts
通过软错误影响的机器学习预测对稀疏迭代求解器进行动态选择性保护
- DOI:10.1145/3624062.3624117
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Chen, Zizhao;Verrecchia, Thomas;Sun, Hongyang;Booth, Joshua;Raghavan, Padma
- 通讯作者:Raghavan, Padma
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Padma Raghavan其他文献
Multi-resource scheduling of moldable workflows
可成型工作流程的多资源调度
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
L. Perotin;Sandhya Kandaswamy;Hongyang Sun;Padma Raghavan - 通讯作者:
Padma Raghavan
Padma Raghavan的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Padma Raghavan', 18)}}的其他基金
NSF I-Corps Hub (Track 1): Mid-South Region
NSF I-Corps 中心(轨道 1):中南部地区
- 批准号:
2229521 - 财政年份:2023
- 资助金额:
$ 30万 - 项目类别:
Cooperative Agreement
SHF: Small: Embedded Graph Software-Hardware Models and Maps for Scalable Sparse Computations
SHF:小型:用于可扩展稀疏计算的嵌入式图软件硬件模型和映射
- 批准号:
1719674 - 财政年份:2016
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
SHF: Small: Embedded Graph Software-Hardware Models and Maps for Scalable Sparse Computations
SHF:小型:用于可扩展稀疏计算的嵌入式图软件硬件模型和映射
- 批准号:
1319448 - 财政年份:2013
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
DC: Small: Adaptive Sparse Data Mining On Multicores
DC:小型:多核上的自适应稀疏数据挖掘
- 批准号:
1017882 - 财政年份:2010
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Toward a Linear Time Sparse Solver with Locality-Enhanced Scalable Parallelism
具有局部增强的可扩展并行性的线性时间稀疏求解器
- 批准号:
0830679 - 财政年份:2008
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
MRI: Acquistion of A Scalable Instrument for Discovery through Computing
MRI:获取可扩展的仪器,通过计算进行发现
- 批准号:
0821527 - 财政年份:2008
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
CSR-SMA: Toward Model-Driven Multilevel Analysis and Optimization of Multicomponent Computer Systems
CSR-SMA:迈向模型驱动的多组件计算机系统的多级分析和优化
- 批准号:
0720749 - 财政年份:2007
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Adaptive Software for Extreme-Scale Scientific Computing: Co-Managing Quality-Performance-Power Tradeoffs
用于超大规模科学计算的自适应软件:共同管理质量-性能-功耗权衡
- 批准号:
0444345 - 财政年份:2004
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Grant to Support Activities at the Eleventh SIAM Conference on Parallel Processing for Scientific Computing
资助支持第十一届 SIAM 科学计算并行处理会议的活动
- 批准号:
0340869 - 财政年份:2003
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Robust Limited Memory Hybrid Sparse Solvers
鲁棒的有限内存混合稀疏求解器
- 批准号:
0102537 - 财政年份:2001
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403134 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331302 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331301 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
- 批准号:
2412357 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
- 批准号:
2402804 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403408 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
- 批准号:
2423813 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:
2402806 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403135 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403409 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant