Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale
合作研究:SHF:小型:大规模学习容错
基本信息
- 批准号:2135310
- 负责人:
- 金额:$ 19.96万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-01-01 至 2024-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In computer-aided design and analysis of engineered systems such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit (one of the world's fastest supercomputers), increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. The likelihood of such errors and their negative impacts are further increased as such simulations are typically long running, and the corruption of a single data field or variable may require weeks to months of re-computations before critical decisions can be made. This project will develop automated approaches that bring fault tolerance to hardware faults for such applications which are widely used not only across multiple industrial sectors but to also increase the predictive power of climate or weather models to aid critical decision making. Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. This project seeks to address these limitations by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This project is based on the premise that in such physics-based applications, the rate of change of the solution vector components across time steps (iterations) and spatial domains is a key metric to automatically identifying the critical computational variables, monitoring their evolution, and dynamically selecting the type of safeguarding techniques that should be applied. The investigators will pursue three key directions: (i) characterizing the intrinsic resiliency of the application by developing resiliency gradient metrics, (ii) developing and testing fault-tolerance schemes that adapt the level and type of protection to the resiliency gradient with the goal of reducing computational overheads and increasing scalability, and (iii) constructing an automatic online decision-based learning framework for adaptively selecting fault-tolerance methods in relation to the system's ability to use approximate computing and co-scheduling techniques. The investigators will also work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在汽车或半导体芯片等工程系统的计算机辅助设计和分析中,计算模型在高性能计算机上进行模拟,以表征和评估关键属性。这种高性能计算系统的绝对规模,例如Summit(世界上最快的超级计算机之一)中有超过200亿个晶体管,增加了宇宙辐射或处理器芯片电压波动等事件造成瞬态硬件故障的可能性。此类错误及其负面影响的可能性进一步增加,因为此类模拟通常是长时间运行的,并且在做出关键决策之前,单个数据字段或变量的损坏可能需要数周到数月的重新计算。该项目将开发自动化方法,为此类应用提供硬件故障容错能力,这些应用不仅广泛用于多个工业部门,而且还可以提高气候或天气模型的预测能力,以帮助关键决策。传统的容错方案可能是特定于应用程序的,需要程序员花费大量精力重新设计或定制大型软件;也可能是与应用程序无关的,其中所有或大部分数据都是定期冗余存储以允许恢复,从而限制了它们的可伸缩性,因为它们的内存和处理开销很大。该项目旨在通过为一类新的容错方案提供理论基础来解决这些限制,该方案适用于基于离散空间域上随时间演变的迭代数值模拟的广泛应用。该项目基于这样一个前提,即在这种基于物理的应用中,解决向量组件跨时间步长(迭代)和空间域的变化率是自动识别关键计算变量、监测其演变和动态选择应应用的保护技术类型的关键指标。调查人员将从三个关键方向展开调查:(i)通过开发弹性梯度指标来表征应用程序的内在弹性;(ii)开发和测试容错方案,使保护的级别和类型适应弹性梯度,以减少计算开销和提高可扩展性;(iii)构建一个基于自动在线决策的学习框架,用于根据系统使用近似计算和协同调度技术的能力自适应选择容错方法。研究人员还将与应用程序和运行时系统开发人员密切合作,寻求更广泛地使用这种容错框架,为学生培训开发专门的本科和研究生课程,并为高中学生提供研究经验。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Joshua Booth其他文献
Nerve fibre organisation in the human optic nerve and chiasm: what do we really know?
人类视神经和视交叉中的神经纤维组织:我们究竟了解多少?
- DOI:
10.1038/s41433-024-03137-7 - 发表时间:
2024-06-07 - 期刊:
- 影响因子:3.200
- 作者:
Pratap R. Pawar;Joshua Booth;Andrew Neely;Gawn McIlwaine;Christian J. Lueck - 通讯作者:
Christian J. Lueck
Joshua Booth的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Joshua Booth', 18)}}的其他基金
CAREER: Fast, Energy Efficient Irregular Kernels via Neural Accerlation
职业:通过神经加速实现快速、节能的不规则内核
- 批准号:
2044633 - 财政年份:2021
- 资助金额:
$ 19.96万 - 项目类别:
Continuing Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331302 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331301 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403134 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
- 批准号:
2412357 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
- 批准号:
2402804 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403408 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
- 批准号:
2423813 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:
2402806 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403135 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403409 - 财政年份:2024
- 资助金额:
$ 19.96万 - 项目类别:
Standard Grant