Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
基本信息
- 批准号:2403090
- 负责人:
- 金额:$ 15万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2024
- 资助国家:美国
- 起止时间:2024-10-01 至 2027-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Machine Learning (ML) and Deep Learning (DL) (more specifically, Deep Neural Network (DNN)) workloads are beginning to dominate the High-Performance Computing (HPC) arena. Today, massive computational resources are required to train even a single state-of-the-art deep learning model (e.g., large language models or LLMs). As the need for training massive DNN models continues and expands from the private sector to NSF-supported scientists and engineers (who are more likely to use shared computing resources), efficient checkpointing is emerging as a critical need. Checkpointing not only helps deal with failures but also provides more scheduling flexibility on shared HPC resources, as a very long-running job can be broken into several shorter ones. The premise of the CropDL project is that efficient and automated application-level checkpoint and restart will be critical to facilitating the use of shared HPC clusters for long-running ML training tasks, drastically increasing the number of researchers that can successfully train large ML models for various applications. This project also contributes to education and diversity in multiple aspects, for example, 1) introducing courses (or course material) to bring attention to ML-related workloads in computer systems undergraduate and graduate education; 2) integrating research tasks from this project with synergistic research programs at universities to increase the participation of women and underrepresented minority groups; and 3) supporting and training PhD students in their research, creating momentum on systems and cyberinfrastructure research related to emerging ML workloads and popularizing integrative research that combines the properties of these workloads with the complexities of modern HPC hardware.The overarching goal of CropDL is to support application-level checkpoints/restarts of deep learning applications for better resiliency, faster average completion time, and higher resource utilization. Particularly, several properties of DL workloads (as compared to scientific computations) create distinct sets of opportunities and challenges for checkpointing: 1) limited communication patterns during parallel execution, which can enable efficient coordinated checkpoints, 2) many unique opportunities for compression of checkpoints, and possibly taking uncoordinated checkpoints, and 3) malleable execution, where restarting from a different number of nodes is possible. Based on this observation, the first direction of this project is to exploit the properties of the DNN model(s) to be trained during checkpointing. This includes asynchronous versioned checkpointing for DL applications under a wide variety of parallelism models as well as content-based data reduction (compression and sparsification) techniques to reduce checkpoint volumes. The second direction of research focuses on using current and upcoming HPC systems' resources efficiently while checkpointing. It formulates tasks, data, and I/O requirements from DL applications into DAG representations and develops methods to schedule them. It also supports efficient I/O for deep learning applications with emerging I/O platforms. The last direction is to automate checkpointing through a compilation system based on the computational graph of DL workloads. All these efforts consider a variety of parallelization schemes for DNNs, i.e., data, model, and/or pipelined parallelism.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
机器学习(ML)和深度学习(DL)(更具体地说,深度神经网络(DNN))工作负载开始主导高性能计算(HPC)竞技场。如今,即使是训练一个最先进的深度学习模型(例如,大型语言模型或LLM)。随着对训练大规模DNN模型的需求继续增长,并从私营部门扩展到NSF支持的科学家和工程师(他们更有可能使用共享计算资源),高效的检查点正在成为一个关键需求。检查点不仅有助于处理故障,还可以在共享HPC资源上提供更大的调度灵活性,因为一个非常长时间运行的作业可以分解为几个较短的作业。CropDL项目的前提是,高效和自动化的应用程序级检查点和重启对于促进使用共享HPC集群进行长时间运行的ML训练任务至关重要,从而大大增加了能够成功训练各种应用程序的大型ML模型的研究人员数量。该项目还在多个方面为教育和多样性做出了贡献,例如:1)引入课程(或课程材料),以引起人们对计算机系统本科和研究生教育中ML相关工作量的关注; 2)将该项目的研究任务与大学的协同研究计划相结合,以增加妇女和代表性不足的少数群体的参与;支持和培养博士研究生,为与新兴ML工作负载相关的系统和网络基础设施研究创造动力,并推广将这些工作负载的属性与现代HPC硬件的复杂性相结合的综合研究。CropDL的总体目标是支持应用级检查点/重新启动深度学习应用程序,以获得更好的弹性、更快的平均完成时间和更高的资源利用率。特别地,DL工作负载的若干属性(与科学计算相比)为检查点创建不同的机会和挑战:1)在并行执行期间有限的通信模式,这可以实现有效的协调检查点,2)用于压缩检查点的许多独特机会,并且可能采用不协调的检查点,以及3)可延展的执行,其中可以从不同数目的节点重新启动。基于这一观察,该项目的第一个方向是利用DNN模型的属性在检查点过程中进行训练。这包括各种并行模型下的DL应用程序的异步版本化检查点,以及基于内容的数据减少(压缩和稀疏化)技术,以减少检查点数量。研究的第二个方向集中在使用当前和即将到来的HPC系统的资源,同时检查点。它将DL应用程序的任务、数据和I/O需求制定为DAG表示,并开发了调度它们的方法。它还支持使用新兴I/O平台的深度学习应用程序的高效I/O。最后一个方向是通过基于DL工作负载的计算图的编译系统来自动化检查点。所有这些努力都考虑了DNN的各种并行化方案,即,该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Wei Niu其他文献
Multilayer Si shadow mask processing of wafer-scale MoS2 devices
晶圆级 MoS2 器件的多层 Si 荫罩加工
- DOI:
10.1088/2053-1583/ab6b6b - 发表时间:
2020 - 期刊:
- 影响因子:5.5
- 作者:
Haima Zhang;Xiaojiao Guo;Wei Niu;Hu Xu;Qijuan Wu;Fuyou Liao;Jing Chen;Hongwei Tang;Hanqi Liu;Zihan Xu;Zhengzong Sun;Zhijun Qiu;Yong Pu;Wenzhong Bao - 通讯作者:
Wenzhong Bao
MHC class I‐associated presentation of exogenous peptides is not only enhanced but also prolonged by linking with a C‐terminal Lys‐Asp‐Glu‐Leu endoplasmic reticulum retrieval signal
通过与 C 末端 Lys-Asp-Glu-Leu 内质网检索信号连接,MHC I 类相关的外源肽呈递不仅得到增强,而且得到延长
- DOI:
- 发表时间:
2004 - 期刊:
- 影响因子:5.4
- 作者:
Li Wang;Yuzhang Wu;An Chen;Jingbo Zhang;Zhao Yang;Wei Niu;Miao Geng;B. Ni;Wei Zhou;L. Zou;M. Jiang - 通讯作者:
M. Jiang
Research on target detection method based on CNN
基于CNN的目标检测方法研究
- DOI:
10.1088/1742-6596/2252/1/012051 - 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
Wei Niu;Bo Gao;Wentao Zhan;Juan Cheng - 通讯作者:
Juan Cheng
Approximate Analytical Solution to the Temperature Field in Annular Thermoelectric Generator Made of Temperature- Dependent Material
- DOI:
https://doi.org/10.1109/TED.2021.3122951 - 发表时间:
2021 - 期刊:
- 影响因子:
- 作者:
Wei Niu;Xiaoshan Cao;Yifeng Hu;Fangfang Wang;Junping Shi - 通讯作者:
Junping Shi
Probing the atomic-scale ferromagnetism in van der Waals magnet CrSiTe3
- DOI:
doi: 10.1063/5.0069885 - 发表时间:
2021 - 期刊:
- 影响因子:
- 作者:
Wei Niu;Xiaoqian Zhang;Wei Wang;Jiabao Sun;Yongbing Xu;Liang He;Wenqing Liu;Yong Pu - 通讯作者:
Yong Pu
Wei Niu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Wei Niu', 18)}}的其他基金
Engineering Carboxylic Acid Reductase for the Biosyntheses of Industrial Chemicals
用于工业化学品生物合成的工程羧酸还原酶
- 批准号:
1805528 - 财政年份:2018
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
SusChEM: Novel 1,2-Propanediol Biosynthesis from Renewable Feedstocks through Enzyme Discovery
SusChEM:通过酶发现从可再生原料生物合成新型 1,2-丙二醇
- 批准号:
1438332 - 财政年份:2014
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
相似国自然基金
水凝胶改性陶瓷人工关节牢固结合界面的构筑与减磨润滑机理研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
锆酸铅基反铁电体畴动力学及其调控机理研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
载铁生物炭对土壤镉污染的吸附固定及微生物协同作用机制研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
SREBP转录因子BbSre1负调控球孢白僵菌抗真菌物质产生的机制研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
面向截肢患者运动感知重建的肌电假肢手关节运动反馈时变编码研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
面向水质应急快检的碳点/微流控限域增强发光传感研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
面向挠性压电太阳翼的物理信息混合建模与非同位控制方法研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
随机3维 Burgers 方程正则性研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
犬尿氨酸通过AhR/STAT3轴活化粒细胞样MDSCs促进慢性肾脏病心脏纤维化的机制研究
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
磁性的机器学习研究: 以图神经网络为中心
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
相似海外基金
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:
2414474 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403312 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402947 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402946 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403088 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
- 批准号:
2403399 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403089 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
- 批准号:
2403398 - 财政年份:2024
- 资助金额:
$ 15万 - 项目类别:
Standard Grant