Scalable Fault Tolerance for MPI

MPI 的可扩展容错

基本信息

  • 批准号:
    0330620
  • 负责人:
  • 金额:
    --
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2003
  • 资助国家:
    美国
  • 起止时间:
    2003-08-15 至 2008-12-31
  • 项目状态:
    已结题

项目摘要

based architecture to create layered approach to enabling scalable fault tolerance. The This project will focus on the middleware layer used to provide message passing capabilities for distributed memory applications. The system will use a component-design and implementation of the message passing system will provide sufficient flexibility to accommodate a wide variety of application state management and recovery mechanisms. The proposed message passing system will have the following capabilities.o Failure detection;o Fault tolerance of the run-time system;o Fault tolerance of the message passing layer (MPI);o Meaningful semantics for the message passing layer in the presence of faults;o Flexible application interface for fault notification; ando Support for application state management and recovery.This work will be realized as part of the component framework of LAM/MPI, an existing high-quality implementation of the Message Passing Interface standard.This project will have broader impacts in several areas. Availability of a scalable and reliable middleware layer will enable new levels of application scalability. In addition, new capabilities for MPI jobs can be realized, including process migration, non-stop behavior, and gang scheduling. Since the proposed middleware will be realized in the context of MPI, applications written with MPI can immediately benefit from this work.And, since the work will be integrated into a production-quality implementation of MPI, the results can actually be used by a widespread audience. The modular component-based design of our implementation will allow the middleware to be used by a wide variety of state management and application-specific fault tolerance schemes. Finally, important contributions from this work are in the form of interface designs-other implementations of MPI will also be able to adopt these approaches to provide reliability
的体系结构来创建分层方法,以实现可扩展的容错能力。这个项目将集中在用于为分布式内存应用程序提供消息传递功能的中间件层。该系统将使用组件设计,消息传递系统的实现将提供足够的灵活性,以适应各种各样的应用程序状态管理和恢复机制。o故障侦测;o运行时系统的容错能力;o消息传递层的容错能力;o消息传递层在出现故障时的有意义语义;o灵活的应用程序界面,以供故障通知;支持应用程序状态管理和恢复。这项工作将作为LAM/MPI组件框架的一部分实现,消息传递接口标准的一个现有的高质量实现。这个项目将在几个领域产生更广泛的影响。可伸缩和可靠的中间件层的可用性将使应用程序的可伸缩性达到新的水平。此外,还可以实现MPI作业的新功能,包括进程迁移、不间断行为和群组调度。由于所提出的中间件将在MPI的上下文中实现,因此使用MPI编写的应用程序可以立即受益于这项工作。而且,由于这项工作将集成到MPI的生产质量实现中,因此结果实际上可以被广泛使用。我们的实现模块化的基于组件的设计将允许中间件被用于各种各样的状态管理和应用程序特定的容错方案。最后,这项工作的重要贡献体现在接口设计方面--MPI的其他实现也将能够采用这些方法来提供可靠性

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Andrew Lumsdaine其他文献

Multi-scale contrast-based saliency enhancement for salient object detection
用于显着目标检测的基于多尺度对比度的显着性增强
  • DOI:
    10.1049/iet-cvi.2013.0118
  • 发表时间:
    2014-06
  • 期刊:
  • 影响因子:
    1.7
  • 作者:
    Wenhui Zhou;Teng Song;Lili Lin;Andrew Lumsdaine
  • 通讯作者:
    Andrew Lumsdaine
Cascade residual learning based adaptive feature aggregation for light field super-resolution
基于级联残差学习的自适应特征聚合用于光场超分辨率
  • DOI:
    10.1016/j.patcog.2025.111616
  • 发表时间:
    2025-09-01
  • 期刊:
  • 影响因子:
    7.600
  • 作者:
    Hao Zhang;Wenhui Zhou;Lili Lin;Andrew Lumsdaine
  • 通讯作者:
    Andrew Lumsdaine

Andrew Lumsdaine的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Andrew Lumsdaine', 18)}}的其他基金

SI2-SSE: GraphPack: Unified Graph Processing with Parallel Boost Graph Library, GraphBLAS, and High-Level Generic Algorithm Interfaces
SI2-SSE:GraphPack:具有 Parallel Boost Graph Library、GraphBLAS 和高级通用算法接口的统一图形处理
  • 批准号:
    1716828
  • 财政年份:
    2016
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
SI2-SSE: GraphPack: Unified Graph Processing with Parallel Boost Graph Library, GraphBLAS, and High-Level Generic Algorithm Interfaces
SI2-SSE:GraphPack:具有 Parallel Boost Graph Library、GraphBLAS 和高级通用算法接口的统一图形处理
  • 批准号:
    1642439
  • 财政年份:
    2016
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
SHF: Large: Collaborative Research: PXGL: Cyberinfrastructure for Scalable Graph Execution
SHF:大型:协作研究:PXGL:可扩展图形执行的网络基础设施
  • 批准号:
    1111888
  • 财政年份:
    2011
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
CSR-PSCE, TM: A Declarative Approach to Managing the Complexity of Massively Parallel Programs
CSR-PSCE, TM:管理大规模并行程序复杂性的声明式方法
  • 批准号:
    0834722
  • 财政年份:
    2008
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
Collaborative Research: Modular Metaprogramming
协作研究:模块化元编程
  • 批准号:
    0702717
  • 财政年份:
    2007
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
ST-CRTS: Collaborative Research: Lifting Compiler Optimizations via Generic Programming
ST-CRTS:协作研究:通过通用编程提升编译器优化
  • 批准号:
    0541335
  • 财政年份:
    2006
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
High Performance Software Components for Scientific Computing
用于科学计算的高性能软件组件
  • 批准号:
    0196531
  • 财政年份:
    2001
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
NGS: Open Compilation for Self-Optimizing Generic Components
NGS:自优化通用组件的开放编译
  • 批准号:
    0131354
  • 财政年份:
    2001
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
High Performance Software Components for Scientific Computing
用于科学计算的高性能软件组件
  • 批准号:
    9982205
  • 财政年份:
    2000
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
CAREER: High-Performance Computing for Computational Science and Engineering
职业:计算科学与工程的高性能计算
  • 批准号:
    9502710
  • 财政年份:
    1995
  • 资助金额:
    --
  • 项目类别:
    Standard Grant

相似海外基金

CAREER: Storage-Aware Fault Tolerance
职业:存储感知容错
  • 批准号:
    2339784
  • 财政年份:
    2024
  • 资助金额:
    --
  • 项目类别:
    Continuing Grant
Collaborative Research: CIF: Small: Approximate Coded Computing - Fundamental Limits of Precision, Fault-Tolerance, and Privacy
协作研究:CIF:小型:近似编码计算 - 精度、容错性和隐私的基本限制
  • 批准号:
    2231706
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
Collaborative Research: CIF: Small: Approximate Coded Computing - Fundamental Limits of Precision, Fault-tolerance and Privacy
协作研究:CIF:小型:近似编码计算 - 精度、容错性和隐私的基本限制
  • 批准号:
    2231707
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
Unlocking the potential of Quantum LDPC Codes for low-overhead fault-tolerance
释放量子 LDPC 码在低开销容错方面的潜力
  • 批准号:
    EP/Y004620/1
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Research Grant
CRII: SaTC: RUI: When Logic Locking Meets Hardware Trojan Mitigation and Fault Tolerance
CRII:SaTC:RUI:当逻辑锁定遇到硬件木马缓解和容错时
  • 批准号:
    2245247
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
Unlocking the potential of Quantum LDPC Codes for low-overhead fault-tolerance
释放量子 LDPC 码在低开销容错方面的潜力
  • 批准号:
    EP/Y004507/1
  • 财政年份:
    2023
  • 资助金额:
    --
  • 项目类别:
    Research Grant
Towards resiliency through health monitoring, diagnosis, prognosis, and fault tolerance in complex and cyber-physical systems with applications to electrified and connected vehicles.
通过复杂网络物理系统的健康监测、诊断、预测和容错,并应用于电气化和互联车辆,实现弹性。
  • 批准号:
    RGPIN-2018-04002
  • 财政年份:
    2022
  • 资助金额:
    --
  • 项目类别:
    Discovery Grants Program - Individual
Improving fault-tolerance mechanisms in distributed data streaming systems
改进分布式数据流系统中的容错机制
  • 批准号:
    575699-2022
  • 财政年份:
    2022
  • 资助金额:
    --
  • 项目类别:
    Alexander Graham Bell Canada Graduate Scholarships - Master's
Collaborative Research: SHF: Small: Learning Fault Tolerance at Scale
合作研究:SHF:小型:大规模学习容错
  • 批准号:
    2135309
  • 财政年份:
    2022
  • 资助金额:
    --
  • 项目类别:
    Standard Grant
AI Flight Controllers for Improved Flight Characteristics and Fault Tolerance
AI飞行控制器可提高飞行特性和容错能力
  • 批准号:
    10030288
  • 财政年份:
    2022
  • 资助金额:
    --
  • 项目类别:
    BEIS-Funded Programmes
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了