CAREER: Aon - An Integrative Approach to Petascale Fault Tolerance

职业生涯:Aon - 实现千万亿次容错的综合方法

基本信息

  • 批准号:
    0952960
  • 负责人:
  • 金额:
    $ 40.88万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2010
  • 资助国家:
    美国
  • 起止时间:
    2010-03-01 至 2017-02-28
  • 项目状态:
    已结题

项目摘要

Advances in computing power over the past two decades have driven successive generations of powerful supercomputers. Petascale systems have recently emerged that contain tens of thousands of processors. At this scale, frequent component and software faults cause parallel applications to fail often, forcing users to save critical program data (checkpoint) repeatedly at an unsustainable scale and pace, wasting resources and triggering additional faults. This has created a crisis: recent experience with petascale systems reveal that increased checkpoint frequency is inducing additional faults, and the excessive overhead required for checkpointing onpeta- and exascale systems is reaching theoretical scaling limits. These problems represent a petascale reliability barrier that prevents the effective use of these systems.To address this problem, the investigator is pursuing a research and education program to improve the reliability and efficiency of high performance computing systems through a comprehensive approach to fault detection, prediction, response, and recovery. The effort involves work on four fronts: i) investigation of new methods for fault detection and prediction; ii) creation of new algorithms, techniques, and tools to avoid faults by proactively responding to potential faults, and to efficiently recover from faults when they occur; iii) creation of a fault injection framework and architecture testbed to assess and validate fault prediction, detection, and proactive and reactive response mechanisms; and iv) development of a education and training program to disseminate fault-aware practices for systems administrators and application developers. The expected results are: more reliable HPC systems and parallel applications; new fault prediction, detection, and response algorithms, software libraries, and tools; and the establishment of a cohort of students and researchers trained to use fault prediction and response technologies.
在过去的二十年里,计算能力的进步推动了一代又一代强大的超级计算机的诞生。最近出现了千兆级系统,其中包含数万个处理器。在这种规模下,频繁的组件和软件故障导致并行应用程序经常失败,迫使用户以不可持续的规模和速度重复保存关键程序数据(检查点),浪费资源并引发额外的故障。这就产生了一个危机:最近对petascale系统的经验表明,检查点频率的增加会导致额外的故障,并且在peascale和exascale系统上检查点所需的过多开销正在达到理论上的扩展极限。这些问题构成了千万亿级的可靠性障碍,阻碍了这些系统的有效利用。为了解决这个问题,研究者正在进行一项研究和教育计划,通过故障检测、预测、响应和恢复的综合方法来提高高性能计算系统的可靠性和效率。这项工作涉及四个方面的工作:1)研究故障检测和预测的新方法;Ii)创建新的算法、技术和工具,通过主动响应潜在故障来避免故障,并在故障发生时有效地从故障中恢复;Iii)创建故障注入框架和架构测试平台,以评估和验证故障预测、检测以及主动和被动响应机制;iv)开发教育和培训计划,向系统管理员和应用程序开发人员传播故障感知实践。预期结果是:更可靠的高性能计算系统和并行应用;新的故障预测、检测和响应算法、软件库和工具;并建立一批训练有素的学生和研究人员来使用故障预测和响应技术。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Thomas Hacker其他文献

Thomas Hacker的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Thomas Hacker', 18)}}的其他基金

Building the research innovation workforce: a workshop to identify new insights and directions to advance the research computing community.
建设研究创新队伍:一个研讨会,旨在确定新的见解和方向,以推动研究计算社区的发展。
  • 批准号:
    2036534
  • 财政年份:
    2020
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Standard Grant
Participant Support for the 2015 NSF CyberBridges Workshop
2015 年 NSF Cyber​​Bridges 研讨会的参与者支持
  • 批准号:
    1543630
  • 财政年份:
    2015
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Standard Grant
Participant Support for the 2013 NSF Cyberbridges Workshop: Developing the Next Generation of Cyberinfrastructure Faculty for Computational- and Data-enabled Science & Engineer
2013 年 NSF Cyber​​bridges 研讨会的参与者支持:为计算和数据支持的科学开发下一代网络基础设施教师
  • 批准号:
    1340201
  • 财政年份:
    2013
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Standard Grant

相似国自然基金

RNA纳米颗粒递送新型AON提高USH2A基因外显子13相关视网膜变性疗效及其机制研究
  • 批准号:
  • 批准年份:
    2021
  • 资助金额:
    54 万元
  • 项目类别:
    面上项目
中子俘获剂RGD-PEI-AON-(157Gd-DTPA)n的合成及其治疗未分化型甲状腺癌的实验研究
  • 批准号:
    LY19H180005
  • 批准年份:
    2018
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目

相似海外基金

AON: Sustained observation and study of the rapidly evolving Arctic Ocean environment
AON:对快速变化的北冰洋环境的持续观测和研究
  • 批准号:
    2314360
  • 财政年份:
    2023
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
Collaborative Research: AON: The Arctic Great Rivers Observatory (ArcticGRO)
合作研究:AON:北极大河观测站 (ArcticGRO)
  • 批准号:
    2230812
  • 财政年份:
    2022
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
AON Collaborative Research: Continuation of long-term Beaufort Gyre observations in 2020-2024 to enhance understanding of the Arctic's role in climate variability
AON 合作研究:2020-2024 年继续进行长期波弗特环流观测,以加深对北极在气候变化中的作用的了解
  • 批准号:
    1949881
  • 财政年份:
    2020
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
AON Collaborative Research: Continuation of long-term Beaufort Gyre observations in 2020-2024 to enhance understanding of the Arctic's role in climate variability
AON 合作研究:2020-2024 年继续进行长期波弗特环流观测,以加深对北极在气候变化中的作用的了解
  • 批准号:
    1950077
  • 财政年份:
    2020
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
An AON-USArray observing network in Arctic Alaska
阿拉斯加北极地区的 AON-USArray 观测网络
  • 批准号:
    2024208
  • 财政年份:
    2020
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
Collaborative Research: Using the ITEX-AON network to document and understand terrestrial ecosystem change in the Arctic
合作研究:利用 ITEX-AON 网络记录和了解北极陆地生态系统的变化
  • 批准号:
    1836861
  • 财政年份:
    2019
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
Collaborative Research: AON: The Arctic Great Rivers Observatory (ArcticGRO)
合作研究:AON:北极大河观测站 (ArcticGRO)
  • 批准号:
    1913888
  • 财政年份:
    2019
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
Collaborative Research: AON: The Arctic Great Rivers Observatory (ArcticGRO)
合作研究:AON:北极大河观测站 (ArcticGRO)
  • 批准号:
    1914215
  • 财政年份:
    2019
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
Collaborative Research: Using the ITEX-AON network to document and understand terrestrial ecosystem change in the Arctic
合作研究:利用 ITEX-AON 网络记录和了解北极陆地生态系统的变化
  • 批准号:
    1836873
  • 财政年份:
    2019
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
Collaborative Research: AON: The Arctic Great Rivers Observatory (ArcticGRO)
合作研究:AON:北极大河观测站 (ArcticGRO)
  • 批准号:
    1914081
  • 财政年份:
    2019
  • 资助金额:
    $ 40.88万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了