SHF: Medium: Collaborative Research: Toward Extreme Scale Fault-Tolerance: Exploration Methods, Comparative Studies and Decision Processes
SHF:中:协作研究:走向极端规模容错:探索方法、比较研究和决策过程
基本信息
- 批准号:1563744
- 负责人:
- 金额:$ 53万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2016
- 资助国家:美国
- 起止时间:2016-08-01 至 2021-07-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Current high-performance computing (HPC) research target computer systems with exaflop (1018 or a quintillion floating point operations per second) capabilities. Such computational power will enable new, important discoveries across all basic science domains. Application resilience to computer faults and failures is a major challenge to the realization of extreme scale computing systems. This project, Simulation and Modeling for Understanding Resilience and Faults at Scale (SMURFS), addresses this challenge by developing methods to improve our predictive understanding of the complex interactions amongst a given application, a given real or hypothetical hardware and software system environment, and a given fault-tolerance strategy at extreme scale. Specifically, SMURFS develops:1. New simulation and modeling capabilities for studying application resilience at scale;2. Capabilities to execute a comprehensive set of comparative fault-tolerance studies; and3. Effective prescriptions to guide application developers, hardware architects and system designers to realize efficient, resilient extreme scale capabilities.SMURFS explores the impact of faults and failures, fault mitigation strategies and emerging technologies by providing new analytical and component models for predicting fault-tolerant application behavior at scale. The Iron simulation framework integrates these models for validation and comprehensive performance studies over a wide range of representative applications, application proxies, fault-tolerance protocols and hardware configurations. These studies inform a rule-based system for prescribing best fault-tolerance practices and configurations for new candidate applications and scenarios.SMURFS renders (1) new simulation and analytical models that predict application performance at scale; (2) detailed understandings of how application features interplay with different fault-tolerance strategies and hardware technologies; (3) new knowledge about application behavior at scale; and (4) valuable insight and prescriptions for designing, developing and deploying future extreme scale HPC systems.More broadly, artifacts like the Iron framework and the public suite of application traces will be valuable to the HPC research, engineering, development, procurement and administrative communities. Researchers can use these artifacts for their own research that can impact the HPC exploration and design space. For example, this framework can be instrumental in the co-design of cohesive extreme scale applications, software environments and hardware platforms. Additionally, Iron-based research can inform and improve scientific computing practices, accelerating the rate of scientific discovery. Finally, Iron will be useful as an instructional device to teach about HPC issues both in classroom and tutorial contexts and other programs that engage diverse populations of middle, high school and college students in New Mexico and Tennessee.
当前的高性能计算(HPC)研究目标是具有exaflop(每秒1018或1万亿次浮点运算)能力的计算机系统。这种计算能力将使所有基础科学领域的新的重要发现成为可能。应用程序对计算机故障和失效的恢复能力是实现超大规模计算系统的主要挑战。这个项目,模拟和建模理解弹性和故障的规模(SMURFS),通过开发方法来解决这一挑战,以提高我们的预测理解之间的复杂的相互作用,一个给定的应用程序,一个给定的真实的或假设的硬件和软件系统环境,并在极端规模给定的容错策略。具体来说,SMURFS开发:1.新的模拟和建模功能,用于研究大规模应用程序弹性;2.执行一套全面的比较容错研究的能力; SMURFS是指导应用程序开发人员、硬件架构师和系统设计人员实现高效、有弹性的极限扩展能力的有效处方。SMURFS通过提供新的分析和组件模型来预测大规模容错应用程序的行为,从而探索故障和故障的影响、故障缓解策略和新兴技术。Iron仿真框架集成了这些模型,用于在广泛的代表性应用程序、应用程序代理、容错协议和硬件配置上进行验证和综合性能研究。这些研究为基于规则的系统提供了信息,用于为新的候选应用程序和场景规定最佳容错实践和配置。SMURFS提供了(1)新的模拟和分析模型,可预测大规模应用程序的性能;(2)详细了解应用程序功能如何与不同的容错策略和硬件技术相互作用;(3)关于大规模应用程序行为的新知识;以及(4)对于设计、开发和部署未来极端规模HPC系统的宝贵见解和处方。更广泛地说,像Iron框架和应用程序跟踪的公共套件这样的工件将对HPC研究、工程、开发、采购和管理社区有价值。研究人员可以将这些工件用于他们自己的研究,这可能会影响HPC探索和设计空间。 例如,该框架可以在内聚的极端规模应用程序,软件环境和硬件平台的协同设计中发挥作用。此外,基于铁的研究可以为科学计算实践提供信息和改进,加快科学发现的速度。 最后,Iron将作为一种教学设备,在课堂和教程环境中教授HPC问题,以及其他吸引新墨西哥州和田纳西州不同人群的初中、高中和大学生的项目。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Thomas Herault其他文献
An evaluation of User-Level Failure Mitigation support in MPI
- DOI:
10.1007/s00607-013-0331-3 - 发表时间:
2013-05-29 - 期刊:
- 影响因子:2.800
- 作者:
Wesley Bland;Aurelien Bouteiller;Thomas Herault;Joshua Hursey;George Bosilca;Jack J. Dongarra - 通讯作者:
Jack J. Dongarra
Physicians, pharmacists and take-home naloxone: What practices? The SINFONI study
- DOI:
10.1016/j.therap.2024.07.001 - 发表时间:
2024-11-01 - 期刊:
- 影响因子:
- 作者:
Mélanie Duval;Aurélie Aquizerate;Emmanuelle Jaulin;Morgane Rousselet;Emmanuelle Kuhn;Alain Guilleminot;Isabelle Nicolleau;Solen Pele;Thomas Herault;Pascal Artarit;Eleni Soulidou-Jacques;Edouard-Jules Laforgue;Caroline Victorri-Vigneau - 通讯作者:
Caroline Victorri-Vigneau
A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
- DOI:
10.1016/j.future.2024.07.022 - 发表时间:
2024-12-01 - 期刊:
- 影响因子:
- 作者:
Leonardo Bautista-Gomez;Anne Benoit;Sheng Di;Thomas Herault;Yves Robert;Hongyang Sun - 通讯作者:
Hongyang Sun
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols
- DOI:
10.1016/j.future.2007.02.002 - 发表时间:
2008-01-01 - 期刊:
- 影响因子:
- 作者:
Darius Buntinas;Camille Coti;Thomas Herault;Pierre Lemarinier;Laurence Pilard;Ala Rezmerita;Eric Rodriguez;Franck Cappello - 通讯作者:
Franck Cappello
Thomas Herault的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似海外基金
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403134 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
- 批准号:
2402804 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403408 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
- 批准号:
2423813 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:
2402806 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403135 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403409 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:
2402805 - 财政年份:2024
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: High-Performance, Verified Accelerator Programming
合作研究:SHF:中:高性能、经过验证的加速器编程
- 批准号:
2313024 - 财政年份:2023
- 资助金额:
$ 53万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Verifying Deep Neural Networks with Spintronic Probabilistic Computers
合作研究:SHF:中:使用自旋电子概率计算机验证深度神经网络
- 批准号:
2311295 - 财政年份:2023
- 资助金额:
$ 53万 - 项目类别:
Continuing Grant