Collaborative Research: ITR/NGS: Deja Vu: Transparent Checkpointing and Migration of Parallel Codes Over Grid Infrastructures

合作研究:ITR/NGS:似曾相识:网格基础设施上并行代码的透明检查点和迁移

基本信息

  • 批准号:
    0325182
  • 负责人:
  • 金额:
    $ 26.03万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2004
  • 资助国家:
    美国
  • 起止时间:
    2004-04-15 至 2009-03-31
  • 项目状态:
    已结题

项目摘要

A daunting challenge is the evolution from today's computational Grid to a true cyberinfrastructure that seamlessly integrates resources ranging from small clusters in academic laboratories to the largest national supercomputing centers and provides ubiquitous access to high performance computing, research instrumentation, data warehouses and visualization. Realization of this future requires fundamental advances in transparent fault recovery mechanisms to mask component failures endemic to any large-scale computational resource. While previous generations of supercomputers engineered reliability into systems hardware, today's high performance computing (HPC) environments are based on clusters of COTS components, with no systemic solution for the reliability of the resource as a whole. Engendering stability in ever growing networked collections of cluster systems needs a software solution that provides reliable access to computing resources through transparent, efficient, and automatic checkpointing and recovery (CPR) mechanisms. This project aims to bring about this future through radically new approaches to longstanding problems in CPR and process migration by building an integrated system called Deja vu. Deja vu provides (a) a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications. (b) a novel post-compiler analysis system that transparently captures application state, (c) a systems architecture that seamlessly integrates user-initiated and system-initiated checkpoints in a single framework enabling the effective use of a wide variety of domain specific knowledge, (d) novel runtime mechanisms for transparent incremental checkpointing, to efficiently capture the least amount of state required to maintain global consistency, (e) a novel communications architecture that enables transparent migration of existing MPI/PVM codes without source-code modifications to either the application or the MPI/PVM libraries, (f) recoverable IO subsystems that can be tailored to specific storage environments, and (g) interfaces to and augmentation of the Globus Toolkit to effectively use the CPR and migration capabilities provided by this research. The core CPR and migration facilities of Deja vu will be surrounded by management, security, and scheduling facilities that (a) integrate with local scheduling systems (e.g., OpenPBS) and accounting systems for site-specific accounting and refunding of lost compute cycles and (b) extend the Globus security architecture with fine grain rights and dynamically created user accounts that allow the fluid resource control available under the Deja vu system to be fully exploited. The design goal of this project is not just to implement "point" solutions, but an integrated system that will constitute a fundamental component of both large-scale computing facilities and Grid infrastructures. Our research team (VT, PSC, ISR) has considerable experience in the design, development, deployment and support of complete solutions.
一个令人生畏的挑战是从今天的计算网格发展到一个真正的网络基础设施,无缝集成资源,从学术实验室的小型集群到最大的国家超级计算中心,并提供无处不在的访问高性能计算,研究仪器,数据仓库和可视化。 实现这一未来需要透明的故障恢复机制的根本进步,以掩盖任何大规模计算资源特有的组件故障。虽然前几代超级计算机将可靠性设计到系统硬件中,但今天的高性能计算(HPC)环境基于COTS组件集群,没有针对资源整体可靠性的系统解决方案。 在不断增长的集群系统网络集合中实现稳定性需要一种软件解决方案,该解决方案通过透明、高效和自动检查点和恢复(CPR)机制提供对计算资源的可靠访问。 该项目旨在通过建立一个名为Deja Vu的综合系统,以全新的方法解决CPR和流程迁移中的长期问题,从而实现这一未来。 Deja Vu提供(a)透明的并行检查点和恢复机制,可以从任何系统故障组合中恢复,而无需对并行应用程序进行任何修改。(b)透明地捕获应用状态的新颖的编译器后分析系统,(c)在单个框架中无缝地集成用户发起的和系统发起的检查点的系统架构,使得能够有效地使用各种领域特定知识,(d)用于透明增量检查点的新颖的运行时机制,以有效地捕获维持全局一致性所需的最少量的状态,(e)一种新颖的通信体系结构,其使得能够透明地迁移现有MPI/PVM代码而无需对应用程序或MPI/PVM库进行源代码修改,(f)可恢复的IO子系统,可针对特定的存储环境进行调整,以及(g)Globus Toolkit的接口和扩充,以有效使用本研究提供的CPR和迁移能力。Deja Vu的核心CPR和迁移设施将被管理、安全和调度设施所包围,这些设施(a)与当地调度系统集成(例如,OpenPBS)和记账系统,用于特定于站点的记账和损失的计算周期的退款,以及(B)利用细粒度权限和动态创建的用户账户来扩展Globus安全架构,其允许完全利用在Deja Vu系统下可用的流体资源控制。 该项目的设计目标不仅仅是实现“点”解决方案,而是一个集成系统,将构成大规模计算设施和网格基础设施的基本组成部分。我们的研究团队(VT、PSC、ISR)在设计、开发、部署和支持完整解决方案方面拥有丰富的经验。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Nathan Stone其他文献

Mediation of Interleukin‐23 and Tumor Necrosis Factor–Driven Reactive Arthritis by Chlamydia‐Infected Macrophages in SKG Mice
SKG 小鼠中衣原体感染的巨噬细胞介导白介素-23 和肿瘤坏死因子驱动的反应性关节炎
  • DOI:
    10.1002/art.41653
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    13.3
  • 作者:
    X. Romand;Xiao Liu;M. A. Rahman;Z. A. Bhuyan;C. Douillard;R. A. Kedia;Nathan Stone;D. Roest;Zi Huai Chew;A. Cameron;L. Rehaume;Aurélie Bozon;Mohammed Habib;C. Armitage;M. Nguyen;B. Favier;K. Beagley;M. Maurin;P. Gaudin;Ranjeny Thomas;T. Wells;A. Baillet
  • 通讯作者:
    A. Baillet

Nathan Stone的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

ITR Collaborative Research: Pervasively Secure Infrastructures (PSI): Integrating Smart Sensing, Data Mining, Pervasive Networking, and Community Computing
ITR 协作研究:普遍安全基础设施 (PSI):集成智能传感、数据挖掘、普遍网络和社区计算
  • 批准号:
    1404694
  • 财政年份:
    2013
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR-SCOTUS: A Resource for Collaborative Research in Speech Technology, Linguistics, Decision Processes, and the Law
ITR-SCOTUS:语音技术、语言学、决策过程和法律合作研究的资源
  • 批准号:
    1139735
  • 财政年份:
    2011
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR/NGS: Collaborative Research: DDDAS: Data Dynamic Simulation for Disaster Management
ITR/NGS:合作研究:DDDAS:灾害管理数据动态模拟
  • 批准号:
    0963973
  • 财政年份:
    2009
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR/NGS: Collaborative Research: DDDAS: Data Dynamic Simulation for Disaster Management
ITR/NGS:合作研究:DDDAS:灾害管理数据动态模拟
  • 批准号:
    1018072
  • 财政年份:
    2009
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR Collaborative Research: A Reusable, Extensible, Optimizing Back End
ITR 协作研究:可重用、可扩展、优化的后端
  • 批准号:
    0838899
  • 财政年份:
    2008
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR Collaborative Research: Pervasively Secure Infrastructures (PSI): Integrating Smart Sensing, Data Mining, Pervasive Networking, and Community Computing
ITR 协作研究:普遍安全基础设施 (PSI):集成智能传感、数据挖掘、普遍网络和社区计算
  • 批准号:
    0833849
  • 财政年份:
    2008
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR/NGS: Collaborative Research: DDDAS: Data Dynamic Simulation for Disaster Management
ITR/NGS:合作研究:DDDAS:灾害管理数据动态模拟
  • 批准号:
    0808419
  • 财政年份:
    2007
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR: Collaborative Research - ASE - (sim+dmc): Image-based Biophysical Modeling: Scalable Registration and Inversion Algorithms and Distributed Computing
ITR:协作研究 - ASE - (sim dmc):基于图像的生物物理建模:可扩展配准和反演算法以及分布式计算
  • 批准号:
    0849301
  • 财政年份:
    2007
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
ITR: Collaborative Research: Modeling and Display of Haptic Information for Enhanced Performance of Computer-Integrated Surgery
ITR:协作研究:触觉信息建模和显示,以提高计算机集成手术的性能
  • 批准号:
    0711040
  • 财政年份:
    2007
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Standard Grant
Collaborative Research: ITR-(ASE)-(dmc): Overcoming Fractionation Errors in Cancer Treatement Planning
合作研究:ITR-(ASE)-(dmc):克服癌症治疗计划中的分割错误
  • 批准号:
    0749671
  • 财政年份:
    2006
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了