SPX: Collaborative Research: Cross-layer Application-Aware Resilience at Extreme Scale (CAARES)
SPX:协作研究:超大规模跨层应用程序感知弹性 (CAARES)
基本信息
- 批准号:1725692
- 负责人:
- 金额:$ 26.61万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2017
- 资助国家:美国
- 起止时间:2017-08-15 至 2020-07-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
The increasing demands of science and engineering applications push the limits of current large-scale systems, and is expected to achieve exascale (10^18 FLOPS) performance early in the next decade. One of the lesser studied challenge at extreme scales is the reliability of the computing system itself, primarily due to the very large number of cores and components utilized and to the sharp decrease of the Mean Time Between Failures on such systems (in the order of tens of minutes). This project departs from the traditional single component fault management model, and explores how multiple software libraries (and application components) used in the context of a single parallel application can interact to provide the holistic fault management support necessary for parallel applications targeting capability computing. This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today. The goal of this project is to depart from the current siloed resilience mechanisms, and propose cross-layer composition solutions that can fundamentally address these resilience challenges at extreme scales. This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today. More specifically, this proposal will address the following research challenges: (1) development of a theoretical foundation for a deeper understanding of the challenges and opportunities arising from combining different resilience models and methodologies; (2) design of a flexible programming abstraction to allow different resilience models and mechanisms to be combined to cooperate and address resilience in a more holistic manner; and (3) development of basic, programming paradigm independent, constructs necessary to implement cross-layer and domain-specific approaches to support resilience and to understand related performance / quality trade-offs. The proposed approach will be validated by exposing these generic abstractions in two different programming paradigms (MPI and OpenSHMEM), by creating and developing specialized concepts for each of these paradigms. This will enable the assessment of the validity of the concepts and the corresponding overheads imposed by the different software layers, using few software frameworks and applications.
科学和工程应用日益增长的需求推动了当前大规模系统的极限,预计在下一个十年的早期将实现exascale(10^18 FLOPS)性能。在极端规模下较少研究的挑战之一是计算系统本身的可靠性,主要是由于使用了非常大量的核心和组件,以及这些系统上的平均故障间隔时间急剧减少(大约几十分钟)。该项目从传统的单组件故障管理模型出发,并探讨如何在一个单一的并行应用程序的上下文中使用的多个软件库(和应用程序组件)可以交互,以提供必要的并行应用程序的能力计算的整体故障管理支持。这种探索将不仅限于使用单个并行编程范式开发的软件,而且将扩展到包括更具挑战性的情况,其中可以组合多个编程范式以实现共同目标,以模拟当今使用的一组大规模科学应用。该项目的目标是脱离当前孤立的弹性机制,并提出跨层组合解决方案,从根本上解决这些极端规模的弹性挑战。这种探索将不仅限于使用单一并行编程范式开发的软件,而是将被扩展以包括其中多个编程范例可以被组合以实现共同目标的更有挑战性的情况,来模拟当今使用的一系列大规模科学应用。更具体地说,这项建议将解决以下研究挑战:(1)为更深入地了解不同复原力模型和方法相结合所带来的挑战和机遇奠定理论基础;(2)设计灵活的方案编制抽象概念,以便将不同的复原力模型和机制相结合,以更全面的方式开展合作和解决复原力问题;以及(3)开发基本的、独立于编程范式的、实现跨层和特定领域方法所必需的结构,以支持弹性并理解相关的性能/质量权衡。所提出的方法将通过在两种不同的编程范式(MPI和OpenSHMEM)中暴露这些通用抽象来验证,通过为这些范式中的每一个创建和开发专门的概念。这将使评估的有效性的概念和相应的间接费用所施加的不同的软件层,使用几个软件框架和应用程序。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Do Moldable Applications Perform Better on Failure-Prone HPC Platforms?
- DOI:10.1007/978-3-030-10549-5_61
- 发表时间:2018-08
- 期刊:
- 影响因子:0
- 作者:Valentin Le Fèvre;G. Bosilca;A. Bouteiller;T. Hérault;A. Hori;Y. Robert;J. Dongarra
- 通讯作者:Valentin Le Fèvre;G. Bosilca;A. Bouteiller;T. Hérault;A. Hori;Y. Robert;J. Dongarra
Overhead of Using Spare Nodes
使用备用节点的开销
- DOI:10.1177/1094342020901885
- 发表时间:2020
- 期刊:
- 影响因子:0
- 作者:Hori, A.;Yoshinaga, K.;Herault, T.;Bouteiller, A.;Bosilca, G.;Ishikawa, Y.
- 通讯作者:Ishikawa, Y.
Local rollback for resilient MPI applications with application-level checkpointing and message logging
- DOI:10.1016/j.future.2018.09.041
- 发表时间:2019-02
- 期刊:
- 影响因子:0
- 作者:Nuria Losada;G. Bosilca;A. Bouteiller;P. González;María J. Martín
- 通讯作者:Nuria Losada;G. Bosilca;A. Bouteiller;P. González;María J. Martín
Asynchronous Receiver-Driven Replay for Local Rollback of MPI Applications
- DOI:10.1109/ftxs49593.2019.00006
- 发表时间:2019-11
- 期刊:
- 影响因子:0
- 作者:Nuria Losada;A. Bouteiller;G. Bosilca
- 通讯作者:Nuria Losada;A. Bouteiller;G. Bosilca
Fault tolerance of MPI applications in exascale systems: The ULFM solution
- DOI:10.1016/j.future.2020.01.026
- 发表时间:2020-05
- 期刊:
- 影响因子:0
- 作者:Nuria Losada;P. González;María J. Martín;G. Bosilca;A. Bouteiller;K. Teranishi
- 通讯作者:Nuria Losada;P. González;María J. Martín;G. Bosilca;A. Bouteiller;K. Teranishi
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
George Bosilca其他文献
An evaluation of User-Level Failure Mitigation support in MPI
- DOI:
10.1007/s00607-013-0331-3 - 发表时间:
2013-05-29 - 期刊:
- 影响因子:2.800
- 作者:
Wesley Bland;Aurelien Bouteiller;Thomas Herault;Joshua Hursey;George Bosilca;Jack J. Dongarra - 通讯作者:
Jack J. Dongarra
Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors
Intel、AMD 和 Fujitsu 处理器上的批量、小型和矩形矩阵乘法的缓存优化和性能建模
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:2.7
- 作者:
Sameer Deshmukh;Rio Yokota;George Bosilca - 通讯作者:
George Bosilca
Kernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms
- DOI:
10.1016/j.jpdc.2013.01.015 - 发表时间:
2013-07-01 - 期刊:
- 影响因子:
- 作者:
Teng Ma;George Bosilca;Aurelien Bouteiller;Jack J. Dongarra - 通讯作者:
Jack J. Dongarra
Self-healing network for scalable fault-tolerant runtime environments
- DOI:
10.1016/j.future.2009.04.001 - 发表时间:
2010-03-01 - 期刊:
- 影响因子:
- 作者:
Thara Angskun;Graham Fagg;George Bosilca;Jelena Pješivac-Grbović;Jack Dongarra - 通讯作者:
Jack Dongarra
George Bosilca的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('George Bosilca', 18)}}的其他基金
Collaborative Research: Frameworks: Production quality Ecosystem for Programming and Executing eXtreme-scale Applications (EPEXA)
合作研究:框架:用于编程和执行超大规模应用程序的生产质量生态系统 (EPEXA)
- 批准号:
1931384 - 财政年份:2019
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
OAC Core: Small: Collaborative Research: Scalable Run-Time for Highly Parallel, Heterogeneous Systems
OAC 核心:小型:协作研究:高度并行、异构系统的可扩展运行时
- 批准号:
1909015 - 财政年份:2019
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
Collaborative Research: SI2-SSI: EVOLVE: Enhancing the Open MPI Software for Next Generation Architectures and Applications
合作研究:SI2-SSI:EVOLVE:增强下一代架构和应用的开放式 MPI 软件
- 批准号:
1664142 - 财政年份:2017
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
Collaborative Research: SI2-SSI:Task-Based Environment for Scientific Simulation at Extreme Scale (TESSE)
合作研究:SI2-SSI:基于任务的超大规模科学模拟环境 (TESSE)
- 批准号:
1450300 - 财政年份:2015
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SI2-SSE: Collaborative Research: ADAPT: Next Generation Message Passing Interface (MPI) Library - Open MPI
SI2-SSE:协作研究:ADAPT:下一代消息传递接口 (MPI) 库 - 开放 MPI
- 批准号:
1339820 - 财政年份:2013
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
G8 Initiative: Collaborative Research: ECS: Enabling Climate Simulation at Extreme Scale
G8 倡议:合作研究:ECS:实现极端规模的气候模拟
- 批准号:
1063019 - 财政年份:2011
- 资助金额:
$ 26.61万 - 项目类别:
Continuing Grant
Collaborative: CSR-AES: System Support for Auto-tuning MPI Applications
协作:CSR-AES:自动调整 MPI 应用程序的系统支持
- 批准号:
0720678 - 财政年份:2007
- 资助金额:
$ 26.61万 - 项目类别:
Continuing Grant
相似海外基金
SPX: Collaborative Research: Automated Synthesis of Extreme-Scale Computing Systems Using Non-Volatile Memory
SPX:协作研究:使用非易失性存储器自动合成超大规模计算系统
- 批准号:
2408925 - 财政年份:2023
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Scalable Neural Network Paradigms to Address Variability in Emerging Device based Platforms for Large Scale Neuromorphic Computing
SPX:协作研究:可扩展神经网络范式,以解决基于新兴设备的大规模神经形态计算平台的可变性
- 批准号:
2401544 - 财政年份:2023
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
2412182 - 财政年份:2023
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Cross-stack Memory Optimizations for Boosting I/O Performance of Deep Learning HPC Applications
SPX:协作研究:用于提升深度学习 HPC 应用程序 I/O 性能的跨堆栈内存优化
- 批准号:
2318628 - 财政年份:2022
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: NG4S: A Next-generation Geo-distributed Scalable Stateful Stream Processing System
SPX:合作研究:NG4S:下一代地理分布式可扩展状态流处理系统
- 批准号:
2202859 - 财政年份:2022
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: FASTLEAP: FPGA based compact Deep Learning Platform
SPX:协作研究:FASTLEAP:基于 FPGA 的紧凑型深度学习平台
- 批准号:
2333009 - 财政年份:2022
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Memory Fabric: Data Management for Large-scale Hybrid Memory Systems
SPX:协作研究:内存结构:大规模混合内存系统的数据管理
- 批准号:
2132049 - 财政年份:2021
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Automated Synthesis of Extreme-Scale Computing Systems Using Non-Volatile Memory
SPX:协作研究:使用非易失性存储器自动合成超大规模计算系统
- 批准号:
2113307 - 财政年份:2020
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: FASTLEAP: FPGA based compact Deep Learning Platform
SPX:协作研究:FASTLEAP:基于 FPGA 的紧凑型深度学习平台
- 批准号:
1919117 - 财政年份:2019
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
1918987 - 财政年份:2019
- 资助金额:
$ 26.61万 - 项目类别:
Standard Grant