NSCI: SI2-SSE: An Extensible Model to Support Scalable Checkpoint-Restart for DMTCP Across Multiple Disciplines
NSCI:SI2-SSE:支持跨多个学科的 DMTCP 可扩展检查点重启的可扩展模型
基本信息
- 批准号:1740218
- 负责人:
- 金额:$ 40万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-01-01 至 2021-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Checkpointing is a technique that periodically saves the state of a long-running computer program to disk. If a computer crash occurs during the running of the program, one can then restart the program state from a previously saved "checkpoint" file on disk. The goal of this project is to discover, implement and deploy novel techniques for adapting checkpointing so as to provide a more robust capability easily usable across applications supporting the research of a variety of scientific and engineering disciplines. In particular, a problem with the classic (transparent) checkpoint model is that these packages do not model, and hence cannot recreate upon restart, communications between the original program and other external processes or programs. In this project, a virtualization model for commonly used mechanisms for communication will be developed so that on restart, external communications are emulated. Checkpointing is used across academia, industry, and government, particularly by those with long-running high performance computing programs. Thus, the project outcomes have broad applicability and value. The project has the added benefit of educating the next generation of students in valuable and highly transferable system skills. Today, transparent checkpoint-restart today is used primarily for fault tolerance, and primarily in closed systems with no external communication. DMTCP is a twelve-year old open source checkpointing project. Its currently evolving process virtualization model of checkpointing enables an application to support complex applications that interact with external subsystems. The project explores and extends a model of process virtualization in order to adapt checkpoint-restart to multiple, novel applications, and to extend its use across multiple scientific and engineering disciplines. Example disciplines that will benefit include: supercomputing (and in particular, forging a path toward practical exascale checkpointing); novel strategies for flexible resource managers (batch queues) for computer clusters that adapt to the current workload; and better support for hardware circuit emulators for Electronic Design Automation (EDA). Example challenges include the need to support transparent checkpointing over the newer low-latency networks such as Omni-Path, integration of application-specific checkpointing with transparent DMTCP-style checkpointing, the need to avoid "flooding" back-end storage during checkpointing in high-end clusters, and new types of resource managers that benefit from the flexibility of arbitrarily suspending running jobs through checkpointing. Rather than build ad hoc solutions for each of the above, this work will provide a simple model allowing end users to easily build their own extensions to support checkpointing of the external subsystems. The simple model will be derived by generalizing over solutions to many of the example challenges described above. In addition to fault tolerance, the technology holds advantages for: fast startup (checkpoint after process initialization,in order to restart and skip this phase in future sessions); debugging (e.g. checkpoint every 30 seconds); reproducible bug reports; extended interactive sessions (e.g. checkpoint before dinner and restart the next day); and so on.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
检查点是一种定期将长时间运行的计算机程序的状态保存到磁盘的技术。 如果在程序运行期间发生计算机崩溃,则可以从先前保存在磁盘上的“检查点”文件重新启动程序状态。该项目的目标是发现,实施和部署新的技术,以适应检查点,从而提供一个更强大的能力,易于跨应用程序使用,支持各种科学和工程学科的研究。特别地,经典(透明)检查点模型的问题在于这些包不建模,因此在重启时不能重新创建原始程序与其他外部进程或程序之间的通信。 在这个项目中,将开发一个常用通信机制的虚拟化模型,以便在重新启动时模拟外部通信。检查点在学术界、工业界和政府中使用,特别是那些拥有长期运行的高性能计算程序的人。 因此,项目成果具有广泛的适用性和价值。 该项目的额外好处是教育下一代学生掌握有价值和高度可转移的系统技能。如今,透明检查点重启主要用于容错,并且主要用于没有外部通信的封闭系统中。 DMTCP是一个有12年历史的开源检查点项目。 其当前不断发展的检查点流程虚拟化模型使应用程序能够支持与外部子系统交互的复杂应用程序。 该项目探索并扩展了流程虚拟化的模型,以使检查点重启适应多个新颖的应用程序,并将其应用扩展到多个科学和工程学科。 将受益的学科包括:超级计算(特别是,打造一条通往实用的艾级检查点的道路);适应当前工作负载的计算机集群的灵活资源管理器(批处理队列)的新策略;以及对电子设计自动化(EDA)硬件电路仿真器的更好支持。 挑战包括需要在较新的低延迟网络(如Omni-Path)上支持透明检查点,将特定于应用程序的检查点与透明DMTCP风格的检查点集成,需要在高端集群中的检查点期间避免“洪水”后端存储,以及新型资源管理器,这些管理器可以通过检查点任意暂停正在运行的作业。 这项工作将提供一个简单的模型,允许最终用户轻松地构建自己的扩展,以支持外部子系统的检查点,而不是为上述每一个构建特定的解决方案。 简单的模型将通过对上述许多示例挑战的解决方案进行归纳而得出。 除了容错之外,该技术还具有以下优势:快速启动(进程初始化后的检查点,以便在将来的会话中重新启动并跳过此阶段);调试(例如,每30秒检查一次);可重现的错误报告;扩展的交互式会话(例如晚餐前检查点,第二天重新开始);该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(13)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Sthread: In-Vivo Model Checking of Multithreaded Programs
Sthread:多线程程序的体内模型检查
- DOI:10.22152/programming-journal.org/2020/4/13
- 发表时间:2020
- 期刊:
- 影响因子:0
- 作者:Cooperman, Gene;Quinson, Martin
- 通讯作者:Quinson, Martin
Towards a generic multilayer negotiation framework for efficient application provisioning in the cloud
建立一个通用的多层协商框架,以在云中实现高效的应用程序配置
- DOI:10.1002/cpe.4182
- 发表时间:2018
- 期刊:
- 影响因子:0
- 作者:Omezzine, Aya;Bellamine Ben Saoud, Narjes;Tazi, Said;Cooperman, Gene
- 通讯作者:Cooperman, Gene
MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale
MANA-2.0:面向未来的大规模 MPI 透明检查点设计
- DOI:10.1109/scws55283.2021.00019
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Xu, Yao;Zhao, Zhengji;Garg, Rohan;Khetawat, Harsh;Hartman-Baker, Rebecca;Cooperman, Gene
- 通讯作者:Cooperman, Gene
Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC
提高 NERSC 生产工作负载与 MPI 无关的透明检查点的可扩展性和可靠性
- DOI:
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Chouhan, Prashant Singh;Khetawat, Harsh;Resnik, Neil;Jain, Twinkle;Garg, Rohan;Cooperman, Gene;Hartman-Baker, Rebecca;Zhao, Zhengji
- 通讯作者:Zhao, Zhengji
Deploying Checkpoint/Restart for Production Workloads at NERSC
在 NERSC 为生产工作负载部署检查点/重启
- DOI:
- 发表时间:2020
- 期刊:
- 影响因子:0
- 作者:Zhengji Zhao, Rebecca Hartman-Baker
- 通讯作者:Zhengji Zhao, Rebecca Hartman-Baker
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Gene Cooperman其他文献
Implementation-Oblivious Transparent Checkpoint-Restart for MPI
MPI 的实施-不经意的透明检查点-重启
- DOI:
10.1145/3624062.3624255 - 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Yao Xu;Leonid Belyaev;Twinkle Jain;Derek Schafer;A. Skjellum;Gene Cooperman - 通讯作者:
Gene Cooperman
Gene Cooperman的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Gene Cooperman', 18)}}的其他基金
SI2-SSE: Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
SI2-SSE:DMTCP 的增强和支持,以实现自适应、可扩展的检查点重启
- 批准号:
1440788 - 财政年份:2014
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
DMTCP: Checkpoint-Restart on the Desktop
DMTCP:检查点-在桌面上重新启动
- 批准号:
0960978 - 财政年份:2010
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
AF:Small: Computation in Very Large Groups
AF:Small:非常大的组中的计算
- 批准号:
0916133 - 财政年份:2009
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
MRI: Enabling Research on Terabyte-Scale Datasets
MRI:支持 TB 级数据集的研究
- 批准号:
0619616 - 财政年份:2006
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
Scalable Parallel Symbolic Computation for Irregular Problems
不规则问题的可扩展并行符号计算
- 批准号:
0204113 - 财政年份:2002
- 资助金额:
$ 40万 - 项目类别:
Continuing Grant
Parallel Infrastructure for Recognition of Non-Local Patterns from Particle Detectors
用于从粒子探测器识别非局部模式的并行基础设施
- 批准号:
9872114 - 财政年份:1999
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
Connections Among Applied Computational Group Theory, Matrix Representations, and Parallel Computations
应用计算群理论、矩阵表示和并行计算之间的联系
- 批准号:
9732330 - 财政年份:1998
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
MRI: A High-Performance, Low-Cost Testbed for Network-based Research
MRI:用于基于网络的研究的高性能、低成本测试平台
- 批准号:
9871022 - 财政年份:1998
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
U.S.-German Cooperative Research in Computational Algebra and High-Speed Networks
美德在计算代数和高速网络方面的合作研究
- 批准号:
9722439 - 财政年份:1997
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
East Coast Computer Algebra Day, Northeastern University, Boston, MA, May 3, l997
东海岸计算机代数日,东北大学,马萨诸塞州波士顿,1997 年 5 月 3 日
- 批准号:
9707543 - 财政年份:1997
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
相似国自然基金
燃烧合成(Mo,Nb)Si2材料中含Nb相的微观组织演变与强韧化机制
- 批准号:51202289
- 批准年份:2012
- 资助金额:25.0 万元
- 项目类别:青年科学基金项目
相似海外基金
SI2-SSE: GenApp - A Transformative Generalized Application Cyberinfrastructure
SI2-SSE:GenApp - 变革性通用应用程序网络基础设施
- 批准号:
1912444 - 财政年份:2018
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
SI2-SSE: A parallel computing framework for large-scale real-space and real-time TDDFT excited-states calculations
SI2-SSE:大规模实空间和实时 TDDFT 激发态计算的并行计算框架
- 批准号:
1739423 - 财政年份:2018
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
Collaborative Research: SI2-SSE: WRENCH: A Simulation Workbench for Scientific Worflow Users, Developers, and Researchers
协作研究:SI2-SSE:WRENCH:面向科学 Worflow 用户、开发人员和研究人员的模拟工作台
- 批准号:
1642369 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
SI2-SSE: Entangled Quantum Dynamics in Closed and Open Systems, an Open Source Software Package for Quantum Simulator Development and Exploration of Synthetic Quantum Matter
SI2-SSE:封闭和开放系统中的纠缠量子动力学,用于量子模拟器开发和合成量子物质探索的开源软件包
- 批准号:
1740130 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
SI2-SSE: Highly Efficient and Scalable Software for Coarse-Grained Molecular Dynamics
SI2-SSE:高效且可扩展的粗粒度分子动力学软件
- 批准号:
1740211 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
SI2-SSE: Collaborative Research: Integrated Tools for DNA Nanostructure Design and Simulation
SI2-SSE:合作研究:DNA 纳米结构设计和模拟的集成工具
- 批准号:
1740212 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
Collaborative Research: NSCI: SI2-SSE: Time Stepping and Exchange-Correlation Modules for Massively Parallel Real-Time Time-Dependent DFT
合作研究:NSCI:SI2-SSE:大规模并行实时瞬态 DFT 的时间步进和交换相关模块
- 批准号:
1740219 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
SI2-SSE: Collaborative Research: Integrated Tools for DNA Nanostructure Design and Simulation
SI2-SSE:合作研究:DNA 纳米结构设计和模拟的集成工具
- 批准号:
1740282 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
Collaborative Research: SI2-SSE: An open source multi-physics platform to advance fundamental understanding of plasma physics and enable impactful application of plasma systems
合作研究:SI2-SSE:一个开源多物理平台,可促进对等离子体物理学的基本理解并实现等离子体系统的有效应用
- 批准号:
1740300 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
NSCI SI2-SSE: Multiscale Software for Quantum Simulations of Nanostructured Materials and Devices
NSCI SI2-SSE:用于纳米结构材料和器件量子模拟的多尺度软件
- 批准号:
1740309 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant