权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters

合作研究：OAC 核心：CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持

基本信息

批准号：
2403088
负责人：
Bin Ren
金额：
$ 22.5万
依托单位：
College of William and Mary
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2024
资助国家：
美国
起止时间：
2024-10-01 至 2027-09-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2403088&HistoricalAwards=false
关键词：
Collaborative Research OAC Core CropDL

项目摘要

Machine Learning (ML) and Deep Learning (DL) (more specifically, Deep Neural Network (DNN)) workloads are beginning to dominate the High-Performance Computing (HPC) arena. Today, massive computational resources are required to train even a single state-of-the-art deep learning model (e.g., large language models or LLMs). As the need for training massive DNN models continues and expands from the private sector to NSF-supported scientists and engineers (who are more likely to use shared computing resources), efficient checkpointing is emerging as a critical need. Checkpointing not only helps deal with failures but also provides more scheduling flexibility on shared HPC resources, as a very long-running job can be broken into several shorter ones. The premise of the CropDL project is that efficient and automated application-level checkpoint and restart will be critical to facilitating the use of shared HPC clusters for long-running ML training tasks, drastically increasing the number of researchers that can successfully train large ML models for various applications. This project also contributes to education and diversity in multiple aspects, for example, 1) introducing courses (or course material) to bring attention to ML-related workloads in computer systems undergraduate and graduate education; 2) integrating research tasks from this project with synergistic research programs at universities to increase the participation of women and underrepresented minority groups; and 3) supporting and training PhD students in their research, creating momentum on systems and cyberinfrastructure research related to emerging ML workloads and popularizing integrative research that combines the properties of these workloads with the complexities of modern HPC hardware.The overarching goal of CropDL is to support application-level checkpoints/restarts of deep learning applications for better resiliency, faster average completion time, and higher resource utilization. Particularly, several properties of DL workloads (as compared to scientific computations) create distinct sets of opportunities and challenges for checkpointing: 1) limited communication patterns during parallel execution, which can enable efficient coordinated checkpoints, 2) many unique opportunities for compression of checkpoints, and possibly taking uncoordinated checkpoints, and 3) malleable execution, where restarting from a different number of nodes is possible. Based on this observation, the first direction of this project is to exploit the properties of the DNN model(s) to be trained during checkpointing. This includes asynchronous versioned checkpointing for DL applications under a wide variety of parallelism models as well as content-based data reduction (compression and sparsification) techniques to reduce checkpoint volumes. The second direction of research focuses on using current and upcoming HPC systems' resources efficiently while checkpointing. It formulates tasks, data, and I/O requirements from DL applications into DAG representations and develops methods to schedule them. It also supports efficient I/O for deep learning applications with emerging I/O platforms. The last direction is to automate checkpointing through a compilation system based on the computational graph of DL workloads. All these efforts consider a variety of parallelization schemes for DNNs, i.e., data, model, and/or pipelined parallelism.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

机器学习（ML）和深度学习（DL）（更具体地说，深度神经网络（DNN））工作负载开始主导高性能计算（HPC）领域。今天，即使是训练一个最先进的深度学习模型（例如，大型语言模型或法学硕士）也需要大量的计算资源。随着训练大规模深度神经网络模型的需求不断扩大，并从私营部门扩展到nsf支持的科学家和工程师（他们更有可能使用共享的计算资源），高效的检查点正在成为一种关键需求。检查点不仅有助于处理故障，而且还为共享HPC资源提供了更多的调度灵活性，因为一个非常长时间运行的作业可以分成几个较短的作业。CropDL项目的前提是，高效和自动化的应用程序级检查点和重启对于促进共享HPC集群用于长期运行的机器学习训练任务至关重要，这将大大增加能够成功训练大型机器学习模型的研究人员的数量。该项目还在多个方面促进了教育和多样性，例如：1)引入课程（或课程材料），以引起人们对计算机系统本科和研究生教育中与机器学习相关的工作量的关注；2)将该项目的研究任务与大学的协同研究项目相结合，以增加妇女和未被充分代表的少数群体的参与；3)支持和培训博士生的研究，为与新兴机器学习工作负载相关的系统和网络基础设施研究创造动力，并推广将这些工作负载的特性与现代高性能计算硬件的复杂性相结合的综合研究。CropDL的总体目标是支持深度学习应用程序的应用程序级检查点/重启，以获得更好的弹性、更快的平均完成时间和更高的资源利用率。特别是，深度学习工作负载的几个属性（与科学计算相比）为检查点创造了不同的机会和挑战：1)并行执行期间有限的通信模式，这可以实现高效的协调检查点；2)压缩检查点的许多独特机会，并可能采用不协调的检查点；3)可伸缩性执行，其中可以从不同数量的节点重新启动。基于这一观察，该项目的第一个方向是利用在检查点期间训练的DNN模型的属性。这包括用于各种并行模型下的深度学习应用程序的异步版本化检查点，以及用于减少检查点数量的基于内容的数据缩减（压缩和稀疏化）技术。第二个研究方向侧重于在检查点时有效地利用当前和即将到来的HPC系统资源。它将DL应用程序中的任务、数据和I/O需求公式化为DAG表示，并开发了调度它们的方法。它还支持基于新兴I/O平台的深度学习应用程序的高效I/O。最后一个方向是通过基于DL工作负载计算图的编译系统实现自动检查点。所有这些努力都考虑了dnn的各种并行化方案，即数据、模型和/或流水线并行。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Bin Ren其他文献

Development of arteriolar niche and self-renewal of breast cancer stem cells by lysophosphatidic Acid/protein kinase D signaling

通过溶血磷脂酸/蛋白激酶 D 信号传导实现小动脉生态位的发育和乳腺癌干细胞的自我更新

DOI：
发表时间：
2021
期刊：
影响因子：
0
作者：
Yinan Jiang;Yichen Guo;Jinjin Hao;R. Guenter;J. Lathia;A. Beck;R. Hattaway;D. Hurst;Q. Wang;Yehe Liu;Qi Cao;H. Krontiras;He;R. Silverstein;Bin Ren
通讯作者：
Bin Ren