SPX: Collaborative Research: Cross-layer Application-Aware Resilience at Extreme Scale (CAARES)

SPX:协作研究:超大规模跨层应用程序感知弹性 (CAARES)

基本信息

  • 批准号:
    1725499
  • 负责人:
  • 金额:
    $ 26.67万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2017
  • 资助国家:
    美国
  • 起止时间:
    2017-08-15 至 2021-07-31
  • 项目状态:
    已结题

项目摘要

The increasing demands of science and engineering applications push the limits of current large-scale systems, and is expected to achieve exascale (10^18 FLOPS) performance early in the next decade. One of the lesser studied challenge at extreme scales is the reliability of the computing system itself, primarily due to the very large number of cores and components utilized and to the sharp decrease of the Mean Time Between Failures on such systems (in the order of tens of minutes). This project departs from the traditional single component fault management model, and explores how multiple software libraries (and application components) used in the context of a single parallel application can interact to provide the holistic fault management support necessary for parallel applications targeting capability computing. This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today.  The goal of this project is to depart from the current siloed resilience mechanisms, and propose cross-layer composition solutions that can fundamentally address these resilience challenges at extreme scales. This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today. More specifically, this proposal will address the following research challenges: (1) development of a theoretical foundation for a deeper understanding of the challenges and opportunities arising from combining different resilience models and methodologies; (2) design of a flexible programming abstraction to allow different resilience models and mechanisms to be combined to cooperate and address resilience in a more holistic manner; and (3) development of basic, programming paradigm independent, constructs necessary to implement cross-layer and domain-specific approaches to support resilience and to understand related performance / quality trade-offs. The proposed approach will be validated by exposing these generic abstractions in two different programming paradigms (MPI and OpenSHMEM), by creating and developing specialized concepts for each of these paradigms. This will enable the assessment of the validity of the concepts and the corresponding overheads imposed by the different software layers, using few software frameworks and applications.
科学和工程应用的日益增长的需求推动了当前大规模系统的极限,预计在下个十年初将达到艾级(10^18 Flop)的性能。在极端规模下较少研究的挑战之一是计算系统本身的可靠性,这主要是因为使用了非常多的核心和部件,以及这种系统的平均无故障时间急剧减少(大约几十分钟)。该项目不同于传统的单组件故障管理模型,探讨了在单个并行应用程序环境中使用的多个软件库(和应用程序组件)如何交互,以提供面向能力计算的并行应用程序所需的整体故障管理支持。这一探索将不限于使用单一并行编程范例开发的软件,而是将扩展到包括更具挑战性的情况,其中多个编程范例可以被组合以实现共同的目标,以模拟当今使用的一组大规模科学应用。本项目的目标是背离当前孤立的弹性机制,并提出能够在极端规模上从根本上解决这些弹性挑战的跨层组合解决方案。这一探索将不限于使用单一并行编程范例开发的软件,而将被扩展以涵盖更具挑战性的情况,其中多个编程范例可以组合以实现共同的目标,来模拟当今正在使用的一组大规模科学应用程序。更具体地说,这项提议将解决以下研究挑战:(1)为更深入地了解结合不同复原力模型和方法所产生的挑战和机遇奠定理论基础;(2)设计灵活的方案编制抽象,以便能够结合不同的复原力模型和机制,以更全面的方式合作和解决复原力问题;(3)开发独立于编程范式的基本结构,以实施支持复原力和了解相关性能/质量权衡的跨层和具体领域的方法。建议的方法将通过在两个不同的编程范例(MPI和OpenSHMEM)中公开这些通用抽象,通过为每个范例创建和开发专门的概念来验证。这将能够使用较少的软件框架和应用程序来评估概念的有效性和不同软件层施加的相应管理费用。

项目成果

期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
OpenSHMEM Checker - A Clang Based Static Checker for OpenSHMEM
Checkpointing OpenSHMEM Programs Using Compiler Analysis
使用编译器分析对 OpenSHMEM 程序进行检查点
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Barbara Chapman其他文献

Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution
通过集成执行最大限度地提高并行度和 GPU 利用率以实现直接 GPU 编译
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Shilei Tian;Barbara Chapman;Johannes Doerfert
  • 通讯作者:
    Johannes Doerfert
Performance Evaluation of a Multi-Zone Application in Different OpenMP Approaches
  • DOI:
    10.1007/s10766-008-0074-5
  • 发表时间:
    2008-04-29
  • 期刊:
  • 影响因子:
    0.900
  • 作者:
    Haoqiang Jin;Barbara Chapman;Lei Huang;Dieter an Mey;Thomas Reichstein
  • 通讯作者:
    Thomas Reichstein
Feasibility Study of Interventions to Reduce Medication Omissions Without Documentation: Recall and Check Study
在没有文件的情况下减少药物遗漏的干预措施的可行性研究:召回和检查研究
  • DOI:
    10.1097/ncq.0000000000000229
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    1.2
  • 作者:
    Maree Johnson;P. Sanchez;Catherine Zheng;Barbara Chapman
  • 通讯作者:
    Barbara Chapman
Comparison of human and chimpanzee ξ1 blobin genes
  • DOI:
    10.1007/bf02115686
  • 发表时间:
    1985-12-01
  • 期刊:
  • 影响因子:
    1.800
  • 作者:
    Cary Willard;Elsie Wong;John F. Hess;Che-Kun James Shen;Barbara Chapman;Allan C. Wilson;Carl W. Schmid
  • 通讯作者:
    Carl W. Schmid
Experiences Developing the OpenUH Compiler and Runtime Infrastructure

Barbara Chapman的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Barbara Chapman', 18)}}的其他基金

Collaborative Research: SHF: MEDIUM: Smart Integrated Tuning of Parallel Code for Multicore and Manycore Systems
合作研究:SHF:MEDIUM:多核和众核系统并行代码的智能集成调整
  • 批准号:
    2211983
  • 财政年份:
    2022
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Continuing Grant
SHF:Small:Performance Portable Parallel Programming on Extremely Heterogeneous Systems
SHF:Small:极端异构系统上的高性能便携式并行编程
  • 批准号:
    2113996
  • 财政年份:
    2021
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
Increasing Student Participation in Fifth PGAS Conference (PGAS11)
提高第五届 PGAS 会议 (PGAS11) 的学生参与度
  • 批准号:
    1158635
  • 财政年份:
    2011
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SHF:Small: Portable High-Level Programming Model for Heterogeneous Computing Based on OpenMP
SHF:Small:基于OpenMP的可移植异构计算高级编程模型
  • 批准号:
    0917285
  • 财政年份:
    2009
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
Collaborative Research: Extreme OpenMP: A Programming Model for Productive High End Computing
协作研究:Extreme OpenMP:高效高端计算的编程模型
  • 批准号:
    0833201
  • 财政年份:
    2008
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
Scalable Performance and Power-Aware Hybrid Compilation System for Multicores
适用于多核的可扩展性能和功耗感知混合编译系统
  • 批准号:
    0702775
  • 财政年份:
    2007
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
CRI: Planning A Research Compiler Infrastructure Based on Open64
CRI:规划基于Open64的研究编译器基础设施
  • 批准号:
    0708797
  • 财政年份:
    2007
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
Collaborative Research: Performance Toolset for Dynamic Optimization of High-End Hybrid Applications
协作研究:用于高端混合应用动态优化的性能工具集
  • 批准号:
    0444468
  • 财政年份:
    2004
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
POWRE: Structure and Function of an Apoptosis Domain in the 75 kDa Neurotropin Receptor
POWRE:75 kDa Neurotropin 受体中凋亡结构域的结构和功能
  • 批准号:
    0227160
  • 财政年份:
    2002
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
POWRE: Structure and Function of an Apoptosis Domain in the 75 kDa Neurotropin Receptor
POWRE:75 kDa Neurotropin 受体中凋亡结构域的结构和功能
  • 批准号:
    9805771
  • 财政年份:
    1998
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant

相似海外基金

SPX: Collaborative Research: Automated Synthesis of Extreme-Scale Computing Systems Using Non-Volatile Memory
SPX:协作研究:使用非易失性存储器自动合成超大规模计算系统
  • 批准号:
    2408925
  • 财政年份:
    2023
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Scalable Neural Network Paradigms to Address Variability in Emerging Device based Platforms for Large Scale Neuromorphic Computing
SPX:协作研究:可扩展神经网络范式,以解决基于新兴设备的大规模神经形态计算平台的可变性
  • 批准号:
    2401544
  • 财政年份:
    2023
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
  • 批准号:
    2412182
  • 财政年份:
    2023
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Cross-stack Memory Optimizations for Boosting I/O Performance of Deep Learning HPC Applications
SPX:协作研究:用于提升深度学习 HPC 应用程序 I/O 性能的跨堆栈内存优化
  • 批准号:
    2318628
  • 财政年份:
    2022
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: FASTLEAP: FPGA based compact Deep Learning Platform
SPX:协作研究:FASTLEAP:基于 FPGA 的紧凑型深度学习平台
  • 批准号:
    2333009
  • 财政年份:
    2022
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: NG4S: A Next-generation Geo-distributed Scalable Stateful Stream Processing System
SPX:合作研究:NG4S:下一代地理分布式可扩展状态流处理系统
  • 批准号:
    2202859
  • 财政年份:
    2022
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Memory Fabric: Data Management for Large-scale Hybrid Memory Systems
SPX:协作研究:内存结构:大规模混合内存系统的数据管理
  • 批准号:
    2132049
  • 财政年份:
    2021
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Automated Synthesis of Extreme-Scale Computing Systems Using Non-Volatile Memory
SPX:协作研究:使用非易失性存储器自动合成超大规模计算系统
  • 批准号:
    2113307
  • 财政年份:
    2020
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: FASTLEAP: FPGA based compact Deep Learning Platform
SPX:协作研究:FASTLEAP:基于 FPGA 的紧凑型深度学习平台
  • 批准号:
    1919117
  • 财政年份:
    2019
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
  • 批准号:
    1918987
  • 财政年份:
    2019
  • 资助金额:
    $ 26.67万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了