SPX: Collaborative Research: Cross-layer Application-Aware Resilience at Extreme Scale (CAARES)

SPX:协作研究:超大规模跨层应用程序感知弹性 (CAARES)

基本信息

  • 批准号:
    1725649
  • 负责人:
  • 金额:
    $ 26.72万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2017
  • 资助国家:
    美国
  • 起止时间:
    2017-08-15 至 2020-07-31
  • 项目状态:
    已结题

项目摘要

The increasing demands of science and engineering applications push the limits of current large-scale systems, and is expected to achieve exascale (10^18 FLOPS) performance early in the next decade. One of the lesser studied challenge at extreme scales is the reliability of the computing system itself, primarily due to the very large number of cores and components utilized and to the sharp decrease of the Mean Time Between Failures on such systems (in the order of tens of minutes). This project departs from the traditional single component fault management model, and explores how multiple software libraries (and application components) used in the context of a single parallel application can interact to provide the holistic fault management support necessary for parallel applications targeting capability computing. This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today.  The goal of this project is to depart from the current siloed resilience mechanisms, and propose cross-layer composition solutions that can fundamentally address these resilience challenges at extreme scales. This exploration will not be limited to software developed using a single parallel programming paradigm, but will be extended to encompass the more challenging case where multiple programming paradigms can be combined to achieve a common goal, to simulate a set of large scale scientific applications in use today. More specifically, this proposal will address the following research challenges: (1) development of a theoretical foundation for a deeper understanding of the challenges and opportunities arising from combining different resilience models and methodologies; (2) design of a flexible programming abstraction to allow different resilience models and mechanisms to be combined to cooperate and address resilience in a more holistic manner; and (3) development of basic, programming paradigm independent, constructs necessary to implement cross-layer and domain-specific approaches to support resilience and to understand related performance / quality trade-offs. The proposed approach will be validated by exposing these generic abstractions in two different programming paradigms (MPI and OpenSHMEM), by creating and developing specialized concepts for each of these paradigms. This will enable the assessment of the validity of the concepts and the corresponding overheads imposed by the different software layers, using few software frameworks and applications.
科学和工程应用日益增长的需求推动了当前大规模系统的极限,预计在下一个十年的早期将实现exascale(10^18 FLOPS)性能。在极端规模下较少研究的挑战之一是计算系统本身的可靠性,主要是由于使用了非常大量的核心和组件,以及这些系统上的平均故障间隔时间急剧减少(大约几十分钟)。该项目从传统的单组件故障管理模型出发,并探讨如何在一个单一的并行应用程序的上下文中使用的多个软件库(和应用程序组件)可以交互,以提供必要的并行应用程序的能力计算的整体故障管理支持。这种探索将不仅限于使用单个并行编程范式开发的软件,而且将扩展到包括更具挑战性的情况,其中可以组合多个编程范式以实现共同目标,以模拟当今使用的一组大规模科学应用。该项目的目标是脱离当前孤立的弹性机制,并提出跨层组合解决方案,从根本上解决这些极端规模的弹性挑战。这种探索将不仅限于使用单一并行编程范式开发的软件,而是将被扩展以包括其中多个编程范例可以被组合以实现共同目标的更有挑战性的情况,来模拟当今使用的一系列大规模科学应用。更具体地说,这项建议将解决以下研究挑战:(1)为更深入地了解不同复原力模型和方法相结合所带来的挑战和机遇奠定理论基础;(2)设计灵活的方案编制抽象概念,以便将不同的复原力模型和机制相结合,以更全面的方式开展合作和解决复原力问题;以及(3)开发基本的、独立于编程范式的、实现跨层和特定领域方法所必需的结构,以支持弹性并理解相关的性能/质量权衡。所提出的方法将通过在两种不同的编程范式(MPI和OpenSHMEM)中暴露这些通用抽象来验证,通过为这些范式中的每一个创建和开发专门的概念。这将使评估的有效性的概念和相应的间接费用所施加的不同的软件层,使用几个软件框架和应用程序。

项目成果

期刊论文数量(3)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Scalable Crash Consistency for Staging-based In-situ Scientific Workflows
基于分期的原位科学工作流程的可扩展崩溃一致性
Scalable Data Resilience for In-memory Data Staging
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Ivan Rodero其他文献

Grid broker selection strategies using aggregated resource information
  • DOI:
    10.1016/j.future.2009.07.009
  • 发表时间:
    2010-01-01
  • 期刊:
  • 影响因子:
  • 作者:
    Ivan Rodero;Francesc Guim;Julita Corbalan;Liana Fong;S. Masoud Sadjadi
  • 通讯作者:
    S. Masoud Sadjadi
In-situ feature-based objects tracking for data-intensive scientific and enterprise analytics workflows

Ivan Rodero的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Ivan Rodero', 18)}}的其他基金

CIF21 DIBBs: EI: Virtual Data Collaboratory: A Regional Cyberinfrastructure for Collaborative Data Intensive Science
CIF21 DIBB:EI:虚拟数据协作:协作数据密集型科学的区域网络基础设施
  • 批准号:
    2220826
  • 财政年份:
    2021
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
Collaborative Research: Framework: Data: NSCI: HDR: GeoSCIFramework: Scalable Real-Time Streaming Analytics and Machine Learning for Geoscience and Hazards Research
协作研究:框架:数据:NSCI:HDR:GeoSCIFramework:用于地球科学和灾害研究的可扩展实时流分析和机器学习
  • 批准号:
    2219975
  • 财政年份:
    2021
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
Collaborative Research: Framework: Data: NSCI: HDR: GeoSCIFramework: Scalable Real-Time Streaming Analytics and Machine Learning for Geoscience and Hazards Research
协作研究:框架:数据:NSCI:HDR:GeoSCIFramework:用于地球科学和灾害研究的可扩展实时流分析和机器学习
  • 批准号:
    1835692
  • 财政年份:
    2019
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
NSF Large Facilities Cyberinfrastructure Workshop
NSF 大型设施网络基础设施研讨会
  • 批准号:
    1742969
  • 财政年份:
    2017
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
EAGER: Online Processing of Data in Large Facilities using National Advanced CyberInfrastructure
EAGER:使用国家先进网络基础设施在线处理大型设施中的数据
  • 批准号:
    1745246
  • 财政年份:
    2017
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
CIF21 DIBBs: EI: Virtual Data Collaboratory: A Regional Cyberinfrastructure for Collaborative Data Intensive Science
CIF21 DIBB:EI:虚拟数据协作:协作数据密集型科学的区域网络基础设施
  • 批准号:
    1640834
  • 财政年份:
    2016
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
BIGDATA: Collaborative Research: IA: F: Fractured Subsurface Characterization using High Performance Computing and Guided by Big Data
BIGDATA:协作研究:IA:F:使用高性能计算和大数据指导的断裂地下表征
  • 批准号:
    1546145
  • 财政年份:
    2016
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
CRII: CI: Exploring Advanced Cyber-Infrastructure Co-Design for Big Data Analytics
CRII:CI:探索大数据分析的高级网络基础设施协同设计
  • 批准号:
    1464317
  • 财政年份:
    2015
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant

相似海外基金

SPX: Collaborative Research: Automated Synthesis of Extreme-Scale Computing Systems Using Non-Volatile Memory
SPX:协作研究:使用非易失性存储器自动合成超大规模计算系统
  • 批准号:
    2408925
  • 财政年份:
    2023
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Scalable Neural Network Paradigms to Address Variability in Emerging Device based Platforms for Large Scale Neuromorphic Computing
SPX:协作研究:可扩展神经网络范式,以解决基于新兴设备的大规模神经形态计算平台的可变性
  • 批准号:
    2401544
  • 财政年份:
    2023
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
  • 批准号:
    2412182
  • 财政年份:
    2023
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Cross-stack Memory Optimizations for Boosting I/O Performance of Deep Learning HPC Applications
SPX:协作研究:用于提升深度学习 HPC 应用程序 I/O 性能的跨堆栈内存优化
  • 批准号:
    2318628
  • 财政年份:
    2022
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: NG4S: A Next-generation Geo-distributed Scalable Stateful Stream Processing System
SPX:合作研究:NG4S:下一代地理分布式可扩展状态流处理系统
  • 批准号:
    2202859
  • 财政年份:
    2022
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: FASTLEAP: FPGA based compact Deep Learning Platform
SPX:协作研究:FASTLEAP:基于 FPGA 的紧凑型深度学习平台
  • 批准号:
    2333009
  • 财政年份:
    2022
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Memory Fabric: Data Management for Large-scale Hybrid Memory Systems
SPX:协作研究:内存结构:大规模混合内存系统的数据管理
  • 批准号:
    2132049
  • 财政年份:
    2021
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Automated Synthesis of Extreme-Scale Computing Systems Using Non-Volatile Memory
SPX:协作研究:使用非易失性存储器自动合成超大规模计算系统
  • 批准号:
    2113307
  • 财政年份:
    2020
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: FASTLEAP: FPGA based compact Deep Learning Platform
SPX:协作研究:FASTLEAP:基于 FPGA 的紧凑型深度学习平台
  • 批准号:
    1919117
  • 财政年份:
    2019
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
  • 批准号:
    1918987
  • 财政年份:
    2019
  • 资助金额:
    $ 26.72万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了