CSR-PSCE,SM: Recovery Aware Parallel Computing

CSR-PSCE,SM:恢复感知并行计算

基本信息

  • 批准号:
    0834514
  • 负责人:
  • 金额:
    $ 33万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2008
  • 资助国家:
    美国
  • 起止时间:
    2008-09-01 至 2013-05-31
  • 项目状态:
    已结题

项目摘要

As the scale and complexity of parallel systems continue to grow, failures are inevitable. For years research focused on pre-failure prediction and tolerance - predicting failures and taking precautionary actions before failure occurrence. Despite progress on failure prediction, unexpected failures occur in practice, especially in modern systems with unprecedented sizes and complexities. Relying on pre-failure prediction and tolerance alone is insufficient for fault management because of the inevitability of failures. Just as failures need to be carefully avoided and managed when they occur, post-failure diagnosis and recovery is of equal importance and has a profound impact on almost every aspect of parallel computing. The goal of this research project is to develop RAPS, a Recovery Aware Parallel computing System for post-failure diagnosis and recovery. The research focuses on how to quickly and effectively resume parallel computing after a failure has occurred. The ultimate goal is to seamlessly integrate post-failure diagnosis and recovery with pre-failure prediction and tolerance as a compound fault management solution for parallel computing. The approach consists of (1) development of new diagnosis mechanisms for fast failure detection and root cause analysis, (2) development of system-wide orchestration for recovery coordination, (3) design of new recovery techniques for quick restoration of parallel applications, and (4) a comprehensive evaluation. The results of this project can significantly improve the productivity of parallel systems. This project also enhances the CS curriculum at IIT and broadens the participation by underrepresented groups.
随着并行系统的规模和复杂性不断增长,故障不可避免。多年来,研究集中在故障前预测和容差预测故障,并在故障发生前采取预防措施。尽管在故障预测方面取得了进展,但在实践中仍会发生意想不到的故障,特别是在具有前所未有的规模和复杂性的现代系统中。由于故障的必然性,仅依靠故障前的预测和容错是不足以进行故障管理的。正如故障发生时需要小心避免和管理一样,故障后的诊断和恢复同样重要,并对并行计算的几乎每个方面都有深远的影响。本研究项目的目标是开发一种支持故障后诊断和恢复的支持恢复的并行计算系统RAPS。研究的重点是如何在故障发生后快速有效地恢复并行计算。最终目标是将故障后诊断和恢复与故障前预测和容错无缝集成,作为并行计算的复合故障管理解决方案。该方法包括(1)开发用于快速故障检测和根本原因分析的新诊断机制,(2)开发用于恢复协调的全系统协调,(3)设计用于快速恢复并行应用程序的新恢复技术,以及(4)综合评估。该项目的结果可以显著提高并行系统的生产率。该项目还加强了IIT的计算机科学课程,并扩大了代表人数不足的群体的参与。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Zhiling Lan其他文献

Surrogate Modeling for HPC Application Iteration Times Forecasting with Network Features
具有网络特征的 HPC 应用程序迭代时间预测的代理建模
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Xiongxiao Xu;Kevin A. Brown;Tanwi Mallick;Xin Wang;Elkin Cruz;Robert B. Ross;Christopher D. Carothers;Zhiling Lan;Kai Shu
  • 通讯作者:
    Kai Shu
Application power profiling on IBM Blue Gene/Q
  • DOI:
    10.1016/j.parco.2016.05.015
  • 发表时间:
    2016-09-01
  • 期刊:
  • 影响因子:
  • 作者:
    Sean Wallace;Zhou Zhou;Venkatram Vishwanath;Susan Coghlan;John Tramm;Zhiling Lan;Michael E. Papka
  • 通讯作者:
    Michael E. Papka

Zhiling Lan的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Zhiling Lan', 18)}}的其他基金

SHF:Small:Intelligent Management of Hybrid Workloads for Extreme Scale Computing
SHF:Small:超大规模计算混合工作负载的智能管理
  • 批准号:
    2413597
  • 财政年份:
    2023
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Collaborative Research: PPoSS: Planning: SEEr: A Scalable, Energy Efficient HPC Environment for AI-Enabled Science
合作研究:PPoSS:规划:SEEr:面向人工智能科学的可扩展、节能的 HPC 环境
  • 批准号:
    2119294
  • 财政年份:
    2021
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
SHF:Small:Intelligent Management of Hybrid Workloads for Extreme Scale Computing
SHF:Small:超大规模计算混合工作负载的智能管理
  • 批准号:
    2109316
  • 财政年份:
    2021
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
CSR: Small: IRON: Reducing Workload Interference on Massively Parallel Platforms
CSR:小:IRON:减少大规模并行平台上的工作负载干扰
  • 批准号:
    1717763
  • 财政年份:
    2017
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
SHF: Small: Collaborative Research: Experimental-based Research on Effective Models of Parallel Application Execution Time, Power, and Resilience
SHF:小型:协作研究:基于实验的并行应用程序执行时间、功耗和弹性有效模型的研究
  • 批准号:
    1618776
  • 财政年份:
    2016
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
SHF: CSR: Small: Toward Smart HPC through Active Learning and Intelligent Scheduling
SHF:CSR:小型:通过主动学习和智能调度迈向智能 HPC
  • 批准号:
    1422009
  • 财政年份:
    2014
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
SHF: CSR: Small: A Cooperative Framework for Topology Awareness on Large-Scale Systems
SHF:CSR:小型:大型系统拓扑意识的合作框架
  • 批准号:
    1320125
  • 财政年份:
    2013
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Collaborative Research: Towards Petascale Cosmological Simulations
合作研究:迈向千万亿次宇宙学模拟
  • 批准号:
    0904670
  • 财政年份:
    2009
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
CSR/AES: Enhancing Application Robustness via Adaptive and Cooperative Methods
CSR/AES:通过自适应和协作方法增强应用程序的稳健性
  • 批准号:
    0720549
  • 财政年份:
    2007
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant

相似海外基金

CSR-PSCE, SM: MPI-PPA: Improving Efficiency of Large-Scale Clusters Through Statistical Performance Prediction
CSR-PSCE、SM:MPI-PPA:通过统计性能预测提高大规模集群的效率
  • 批准号:
    0936251
  • 财政年份:
    2009
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR-PSCE, SM: Adaptive Memory Management in Shared Environments
合作研究:CSR-PSCE、SM:共享环境中的自适应内存管理
  • 批准号:
    0834323
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
CSR-PSCE,SM: Trade-offs Between Static Power, Performance and Reliability in Future Chip Multiprocessors
CSR-PSCE,SM:未来芯片多处理器静态功耗、性能和可靠性之间的权衡
  • 批准号:
    0834799
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
CSR-PSCE,SM: A Holistic Design Approach to Reliability Using 3D Stacked
CSR-PSCE,SM:使用 3D 堆叠的可靠性整体设计方法
  • 批准号:
    0834798
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
CSR-PSCE, SM: Automatic Multithreaded and Transactional Memory Workload Synthesis for Efficient Multi-core Design Space Evaluation
CSR-PSCE、SM:自动多线程和事务性内存工作负载合成,用于高效的多核设计空间评估
  • 批准号:
    0834288
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR-PSCE, SM: Memory Thermal Management for Multi-Core Systems
合作研究:CSR-PSCE、SM:多核系统的内存热管理
  • 批准号:
    0834475
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
CSR-PSCE, SM: Memory Management Innovations for Next-Generation SMP
CSR-PSCE、SM:下一代 SMP 的内存管理创新
  • 批准号:
    0834619
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
CSR-PSCE,SM: Compiler-Directed System Optimization of a Highly-Parallel Fine-Grained Chip Multiprocessor
CSR-PSCE,SM:高度并行细粒度芯片多处理器的编译器导向系统优化
  • 批准号:
    0834373
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR-PSCE, SM: Memory Thermal Management for Multi-Core Systems
合作研究:CSR-PSCE、SM:多核系统的内存热管理
  • 批准号:
    0834469
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Standard Grant
CSR-PSCE, SM: Recording and Deterministically Replaying Shared-memory Multiprocessor Execution Efficiently
CSR-PSCE、SM:高效记录和确定性重放共享内存多处理器执行
  • 批准号:
    0834738
  • 财政年份:
    2008
  • 资助金额:
    $ 33万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了