Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters

合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持

基本信息

  • 批准号:
    2403088
  • 负责人:
  • 金额:
    $ 22.5万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2024
  • 资助国家:
    美国
  • 起止时间:
    2024-10-01 至 2027-09-30
  • 项目状态:
    未结题

项目摘要

Machine Learning (ML) and Deep Learning (DL) (more specifically, Deep Neural Network (DNN)) workloads are beginning to dominate the High-Performance Computing (HPC) arena. Today, massive computational resources are required to train even a single state-of-the-art deep learning model (e.g., large language models or LLMs). As the need for training massive DNN models continues and expands from the private sector to NSF-supported scientists and engineers (who are more likely to use shared computing resources), efficient checkpointing is emerging as a critical need. Checkpointing not only helps deal with failures but also provides more scheduling flexibility on shared HPC resources, as a very long-running job can be broken into several shorter ones. The premise of the CropDL project is that efficient and automated application-level checkpoint and restart will be critical to facilitating the use of shared HPC clusters for long-running ML training tasks, drastically increasing the number of researchers that can successfully train large ML models for various applications. This project also contributes to education and diversity in multiple aspects, for example, 1) introducing courses (or course material) to bring attention to ML-related workloads in computer systems undergraduate and graduate education; 2) integrating research tasks from this project with synergistic research programs at universities to increase the participation of women and underrepresented minority groups; and 3) supporting and training PhD students in their research, creating momentum on systems and cyberinfrastructure research related to emerging ML workloads and popularizing integrative research that combines the properties of these workloads with the complexities of modern HPC hardware.The overarching goal of CropDL is to support application-level checkpoints/restarts of deep learning applications for better resiliency, faster average completion time, and higher resource utilization. Particularly, several properties of DL workloads (as compared to scientific computations) create distinct sets of opportunities and challenges for checkpointing: 1) limited communication patterns during parallel execution, which can enable efficient coordinated checkpoints, 2) many unique opportunities for compression of checkpoints, and possibly taking uncoordinated checkpoints, and 3) malleable execution, where restarting from a different number of nodes is possible. Based on this observation, the first direction of this project is to exploit the properties of the DNN model(s) to be trained during checkpointing. This includes asynchronous versioned checkpointing for DL applications under a wide variety of parallelism models as well as content-based data reduction (compression and sparsification) techniques to reduce checkpoint volumes. The second direction of research focuses on using current and upcoming HPC systems' resources efficiently while checkpointing. It formulates tasks, data, and I/O requirements from DL applications into DAG representations and develops methods to schedule them. It also supports efficient I/O for deep learning applications with emerging I/O platforms. The last direction is to automate checkpointing through a compilation system based on the computational graph of DL workloads. All these efforts consider a variety of parallelization schemes for DNNs, i.e., data, model, and/or pipelined parallelism.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
机器学习(ML)和深度学习(DL)(更具体地说,深度神经网络(DNN))工作负载开始主导高性能计算(HPC)竞技场。如今,甚至需要大量的计算资源才能训练一个最先进的深度学习模型(例如,大型语言模型或LLM)。随着培训大规模DNN模型的需求继续并从私营部门扩展到由NSF支持的科学家和工程师(他们更有可能使用共享的计算资源),因此有效的检查点正在出现,这是一种关键需求。检查点不仅有助于解决故障,而且还可以在共享的HPC资源上提供更多的计划灵活性,因为长期运行的工作可以分解为几个较短的工作。 CropDL项目的前提是,高效且自动化的应用程序级检查站和重新启动对于促进共享的HPC群集用于长期运行的ML培训任务至关重要,从而大大增加了可以成功地增加大型ML ML模型的研究人员的数量。该项目还有助于多个方面的教育和多样性,例如1)引入课程(或课程材料),以引起对计算机系统本科和研究生教育中与ML相关的工作量的关注; 2)将该项目的研究任务与大学的协同研究计划相结合,以增加妇女和代表性不足的少数群体的参与; and 3) supporting and training PhD students in their research, creating momentum on systems and cyberinfrastructure research related to emerging ML workloads and popularizing integrative research that combines the properties of these workloads with the complexities of modern HPC hardware.The overarching goal of CropDL is to support application-level checkpoints/restarts of deep learning applications for better resiliency, faster average completion time, and higher resource utilization.特别是,DL工作负载的几种特性(与科学计算相比)为检查点创造了不同的机遇和挑战集:1)在平行执行过程中的沟通模式有限,这可以实现有效的协调检查点,2)许多独特的压缩机会来压缩检查点,并可能取得了不可协调的检查点,以及3)可以使得越来越多的数字,并且可以重新数字。基于此观察,该项目的第一个方向是利用在检查点期间要训练的DNN模型的属性。这包括针对多种并行模型下的DL应用程序的异步版检查点以及基于内容的数据降低(压缩和稀疏)技术,以减少检查点量。研究的第二个方向着重于在检查点时有效地使用当前和即将到来的HPC系统的资源。它将DL应用程序中的任务,数据和I/O要求制定为DAG表示形式,并开发了安排它们的方法。它还为新兴I/O平台的深度学习应用程序提供了高效的I/O。最后一个方向是基于DL工作负载的计算图,通过编译系统自动化检查点。所有这些努力都考虑了针对DNN的各种并行化方案,即数据,模型和/或管道平行性。该奖项反映了NSF的法定任务,并被认为是值得通过基金会的知识分子评估来获得支持的,并具有更广泛的影响。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Bin Ren其他文献

Grouped Temporal Enhancement Module for Human Action Recognition
用于人类动作识别的分组时间增强模块
A High Performance Sparse Tensor Algebra Compiler in MLIR
MLIR中的高性能稀疏张量代数编译器
Revealing Protein Binding Affinity on Metal Surfaces:An Electrochemistry Approach
揭示金属表面上的蛋白质结合亲和力:电化学方法
  • DOI:
    10.1039/d1cc07098c
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    4.9
  • 作者:
    Danya Lyu;Pingshi Wang;Shuo zhang;Guokun Liu;Bin Ren
  • 通讯作者:
    Bin Ren
Classication of 2-step nilpotent Lie algebras of dimension 8 with 3-dimensional center
具有 3 维中心的 8 维 2 步幂零李代数的分类
  • DOI:
  • 发表时间:
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Bin Ren;Linsheng Zhu
  • 通讯作者:
    Linsheng Zhu
Development of Weak Signal Recognition and an Extraction Algorithm for Raman Imaging
拉曼成像微弱信号识别和提取算法的开发
  • DOI:
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    7.4
  • 作者:
    Xin Wang;Guokun Liu;Mengxi Xu;Bin Ren;Zhongqun Tian
  • 通讯作者:
    Zhongqun Tian

Bin Ren的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Bin Ren', 18)}}的其他基金

Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
  • 批准号:
    2230944
  • 财政年份:
    2023
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: SMALL: Compile-Parallelize-Schedule-Retarget-Repeat (EASER) Paradigm for Dealing with Extreme Heterogeneity
合作研究:SHF:SMALL:处理极端异构性的编译-并行化-调度-重定向-重复(EASER)范式
  • 批准号:
    2146873
  • 财政年份:
    2022
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
EAGER: Collaborative Research: On the Theoretical Foundation of Recommendation System Evaluation
EAGER:协作研究:推荐系统评价的理论基础
  • 批准号:
    2142681
  • 财政年份:
    2021
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
CAREER: Achieving Real-Time Machine Learning with Sparsification-Compilation Co-design
职业:通过稀疏编译协同设计实现实时机器学习
  • 批准号:
    2047516
  • 财政年份:
    2021
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Continuing Grant

相似国自然基金

支持二维毫米波波束扫描的微波/毫米波高集成度天线研究
  • 批准号:
    62371263
  • 批准年份:
    2023
  • 资助金额:
    52 万元
  • 项目类别:
    面上项目
腙的Heck/脱氮气重排串联反应研究
  • 批准号:
    22301211
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
水系锌离子电池协同性能调控及枝晶抑制机理研究
  • 批准号:
    52364038
  • 批准年份:
    2023
  • 资助金额:
    33 万元
  • 项目类别:
    地区科学基金项目
基于人类血清素神经元报告系统研究TSPYL1突变对婴儿猝死综合征的致病作用及机制
  • 批准号:
    82371176
  • 批准年份:
    2023
  • 资助金额:
    49 万元
  • 项目类别:
    面上项目
FOXO3 m6A甲基化修饰诱导滋养细胞衰老效应在补肾法治疗自然流产中的机制研究
  • 批准号:
    82305286
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 22.5万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了