权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale

合作研究：SHF：小型：大规模学习容错

基本信息

批准号：
2135310
负责人：
Joshua Booth
金额：
$ 19.96万
依托单位：
University of Alabama in Huntsville
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2022
资助国家：
美国
起止时间：
2022-01-01 至 2024-12-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2135310&HistoricalAwards=false
关键词：
Collaborative Research SHF Small Learning

项目摘要

In computer-aided design and analysis of engineered systems such as automobiles or semiconductor chips, computational models are simulated on high-performance computers to characterize and evaluate key attributes. The sheer scale of such high-performance computing systems, e.g., over 20 billion transistors in Summit (one of the world's fastest supercomputers), increases the likelihood of transient hardware faults from events such as cosmic radiation or processor-chip voltage fluctuations. The likelihood of such errors and their negative impacts are further increased as such simulations are typically long running, and the corruption of a single data field or variable may require weeks to months of re-computations before critical decisions can be made. This project will develop automated approaches that bring fault tolerance to hardware faults for such applications which are widely used not only across multiple industrial sectors but to also increase the predictive power of climate or weather models to aid critical decision making. Traditional fault-tolerant schemes can be either application-specific, requiring significant programmer effort to redesign or customize large-scale software, or application-agnostic where all or most data are redundantly stored periodically to allow for recovery, thus limiting their scalability due to their significant memory and processing overheads. This project seeks to address these limitations by providing a theoretical foundation for a new class of fault-tolerant schemes that are suitable for the broad array of applications based on iterative numerical simulations that evolve over time on discretized spatial domains. This project is based on the premise that in such physics-based applications, the rate of change of the solution vector components across time steps (iterations) and spatial domains is a key metric to automatically identifying the critical computational variables, monitoring their evolution, and dynamically selecting the type of safeguarding techniques that should be applied. The investigators will pursue three key directions: (i) characterizing the intrinsic resiliency of the application by developing resiliency gradient metrics, (ii) developing and testing fault-tolerance schemes that adapt the level and type of protection to the resiliency gradient with the goal of reducing computational overheads and increasing scalability, and (iii) constructing an automatic online decision-based learning framework for adaptively selecting fault-tolerance methods in relation to the system's ability to use approximate computing and co-scheduling techniques. The investigators will also work closely with application and runtime system developers to seek broader use of this fault tolerance framework, develop specialized undergraduate and graduate curriculum for student training, and offer research experiences to high school students.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

在汽车或半导体芯片等工程系统的计算机辅助设计和分析中，计算模型在高性能计算机上进行模拟，以表征和评估关键属性。这种高性能计算系统的绝对规模，例如Summit（世界上最快的超级计算机之一）中有超过200亿个晶体管，增加了宇宙辐射或处理器芯片电压波动等事件造成瞬态硬件故障的可能性。此类错误及其负面影响的可能性进一步增加，因为此类模拟通常是长时间运行的，并且在做出关键决策之前，单个数据字段或变量的损坏可能需要数周到数月的重新计算。该项目将开发自动化方法，为此类应用提供硬件故障容错能力，这些应用不仅广泛用于多个工业部门，而且还可以提高气候或天气模型的预测能力，以帮助关键决策。传统的容错方案可能是特定于应用程序的，需要程序员花费大量精力重新设计或定制大型软件；也可能是与应用程序无关的，其中所有或大部分数据都是定期冗余存储以允许恢复，从而限制了它们的可伸缩性，因为它们的内存和处理开销很大。该项目旨在通过为一类新的容错方案提供理论基础来解决这些限制，该方案适用于基于离散空间域上随时间演变的迭代数值模拟的广泛应用。该项目基于这样一个前提，即在这种基于物理的应用中，解决向量组件跨时间步长（迭代）和空间域的变化率是自动识别关键计算变量、监测其演变和动态选择应应用的保护技术类型的关键指标。调查人员将从三个关键方向展开调查：(i)通过开发弹性梯度指标来表征应用程序的内在弹性；（ii）开发和测试容错方案，使保护的级别和类型适应弹性梯度，以减少计算开销和提高可扩展性；（iii）构建一个基于自动在线决策的学习框架，用于根据系统使用近似计算和协同调度技术的能力自适应选择容错方法。研究人员还将与应用程序和运行时系统开发人员密切合作，寻求更广泛地使用这种容错框架，为学生培训开发专门的本科和研究生课程，并为高中学生提供研究经验。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。