Collaborative Research: Elements: VLCC-States: Versioned Lineage-Driven Checkpointing of Composable States

协作研究:元素:VLCC-States:可组合状态的版本化谱系驱动检查点

基本信息

  • 批准号:
    2411387
  • 负责人:
  • 金额:
    $ 30万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2024
  • 资助国家:
    美国
  • 起止时间:
    2024-10-01 至 2027-09-30
  • 项目状态:
    未结题

项目摘要

Checkpointing is a fundamental pattern used by a variety of scientific applications at both small and large computing scales. Widely adopted for resilience purposes by long-running applications (i.e., checkpoint-restart), it has seen an explosion of additional use cases that directly help applications progress faster and reduce time-to-solution even in the absence of failures: adjoint computations (essential in financial modeling, weather prediction, computational fluid dynamics, seismic imaging, and control theory) need to capture a history of checkpoints in a forward pass, which are then revisited in a backward pass. Training artificial intelligence models, increasingly used by scientific applications, often results in trajectories that do not lead to convergence or may lead to undesirable patterns, prompting the need to backtrack to an earlier checkpoint of the learning model to try an alternative. Transfer learning and fine-tuning using a previous checkpoint of a learning model can be used to adapt the training more quickly, avoiding expensive training from scratch. Many other use cases are important in scientific computing: suspend-resume (e.g., to preempt a long-running job in favor of a higher priority job), migration (checkpoint on one machine, restart on another), debugging (replay a problematic code region to reproduce errors without starting from scratch), and reproducibility (checkpoint and compare intermediate data during repeated runs). Despite broad applicability, current state-of-the-art solutions lack the flexibility, performance, and scalability needed to address these scenarios efficiently. The Versioned Lineage-Driven Checkpointing of Composable States (VLCC-States) project aims to fill this gap. It will streamline the development and use of checkpointing patterns for scientific applications, which simplifies and improves the reusability of integration efforts across different communities, improves awareness of the multitude of checkpointing scenarios, reduces development effort and cost, and enables flexible customization to extract the best performance and scalability for the desired application scenario.VLCC-States provides technical innovation in three areas. First, it introduces composable providers of intermediate states, which hide the complexity of capturing and assembling checkpoints of distributed data structures and their transformations across different modules and programming languages while optimizing their layout to eliminate redundancies, reduce sizes, and improve performance. Second, it provides multi-level co-optimized caching and prefetching techniques, which enable scalable management of the life cycle of checkpoints for interleavings of capture and reuse operations on heterogeneous storage stacks under concurrency. Third, it develops specialized checkpointing tools for large Artificial Intelligence models, with a focus on integration with PyTorch and DeepSpeed, to enable users to transparently take advantage of high-performance and scalable checkpointing using a familiar API. This project will engage partners in industry and national research laboratories to co-design VLCC-States, tune its capabilities, and evaluate its implementation. This project will undertake educational and broadening participation activities to improve community awareness and understanding of challenges in scientific data management.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
检查点是各种科学应用程序在小型和大型计算规模上使用的基本模式。被长期运行的应用程序广泛采用,以实现弹性目的(即,Checkpoint-restart),它已经看到了额外用例的爆炸式增长,这些用例直接帮助应用程序更快地进展并减少解决方案的时间,即使在没有故障的情况下:伴随计算(在金融建模,天气预测,计算流体动力学,地震成像和控制理论中必不可少)需要在向前传递中捕获检查点的历史,然后在向后传递中重新访问。训练人工智能模型,越来越多地被科学应用所使用,通常会导致不会导致收敛或可能导致不期望的模式的轨迹,从而促使需要回溯到学习模型的早期检查点以尝试替代方案。迁移学习和使用学习模型的前一个检查点进行微调可以用来更快地适应训练,避免从头开始进行昂贵的训练。许多其他用例在科学计算中也很重要:挂起-恢复(例如,抢占长时间运行的作业以支持更高优先级的作业)、迁移(在一台机器上检查点,在另一台机器上重新启动)、调试(重放有问题的代码区域以再现错误,而无需从头开始)和再现性(在重复运行期间检查点并比较中间数据)。尽管具有广泛的适用性,但当前最先进的解决方案缺乏有效解决这些场景所需的灵活性、性能和可扩展性。Versioned Lineage-Driven Checkpointing of Composable States(VLCC-States)项目旨在填补这一空白。它将简化用于科学应用的检查点模式的开发和使用,从而简化和提高跨不同社区的集成工作的可重用性,提高对大量检查点场景的认识,减少开发工作和成本,并实现灵活的定制,以针对所需的应用场景提取最佳性能和可扩展性。VLCC-States在三个领域提供技术创新。首先,它引入了中间状态的可组合提供程序,隐藏了捕获和组装分布式数据结构的检查点及其跨不同模块和编程语言的转换的复杂性,同时优化了它们的布局,以消除冗余,减小大小并提高性能。其次,它提供了多级协同优化的缓存和预取技术,这使得可扩展的管理的生命周期的检查点的交叉捕获和重用操作的异构存储堆栈下并发。第三,它为大型人工智能模型开发专门的检查点工具,重点是与PyTorch和DeepSpeed的集成,使用户能够使用熟悉的API透明地利用高性能和可扩展的检查点。该项目将邀请行业和国家研究实验室的合作伙伴共同设计VLCC国家,调整其能力并评估其实施情况。该项目将开展教育和扩大参与活动,以提高社区对科学数据管理挑战的认识和理解。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

M Mustafa Rafique其他文献

M Mustafa Rafique的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('M Mustafa Rafique', 18)}}的其他基金

Collaborative Research: CNS Core: Medium:HardLambda: A new FaaS Abstraction for Cross-Stack Resource Management in Disaggregated Datacenters
协作研究:CNS 核心:Medium:HardLambda:分解数据中心跨堆栈资源管理的新 FaaS 抽象
  • 批准号:
    2106635
  • 财政年份:
    2021
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: Elements: Linking geochemical proxy records to crustal stratigraphic context via community-interactive cyberinfrastructure
合作研究:要素:通过社区交互式网络基础设施将地球化学代理记录与地壳地层背景联系起来
  • 批准号:
    2311092
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: Lattice QCD software for nuclear physics on heterogeneous architectures
合作研究:Elements:用于异构架构核物理的 Lattice QCD 软件
  • 批准号:
    2311430
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: ProDM: Developing A Unified Progressive Data Management Library for Exascale Computational Science
协作研究:要素:ProDM:为百亿亿次计算科学开发统一的渐进式数据管理库
  • 批准号:
    2311757
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: FuSe: Monolithic 3D Integration (M3D) of 2D Materials-Based CFET Logic Elements towards Advanced Microelectronics
合作研究:FuSe:面向先进微电子学的基于 2D 材料的 CFET 逻辑元件的单片 3D 集成 (M3D)
  • 批准号:
    2329189
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: Experimental and computational constraints on the isotope fractionation of Mossbauer-inactive elements in mantle minerals
合作研究:地幔矿物中穆斯堡尔非活性元素同位素分馏的实验和计算约束
  • 批准号:
    2246686
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: Linking geochemical proxy records to crustal stratigraphic context via community-interactive cyberinfrastructure
合作研究:要素:通过社区交互式网络基础设施将地球化学代理记录与地壳地层背景联系起来
  • 批准号:
    2311091
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: Phonon Database Generation, Analysis, and Visualization for Data Driven Materials Discovery
协作研究:要素:数据驱动材料发现的声子数据库生成、分析和可视化
  • 批准号:
    2311202
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: Enabling Particle and Nuclear Physics Discoveries with Neural Deconvolution
合作研究:元素:通过神经反卷积实现粒子和核物理发现
  • 批准号:
    2311667
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: Experimental and computational constraints on the isotope fractionation of Mossbauer-inactive elements in mantle minerals
合作研究:地幔矿物中穆斯堡尔非活性元素同位素分馏的实验和计算约束
  • 批准号:
    2246687
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: GEO-CM: The occurrences of the rare earth elements in highly weathered sedimentary rocks, Georgia kaolins.
合作研究:GEO-CM:强风化沉积岩、乔治亚高岭土中稀土元素的出现。
  • 批准号:
    2327660
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了