CRII: OAC: Scalability of Deep-Learning Methods on HPC Systems: An I/O-centric Approach
CRII:OAC:HPC 系统上深度学习方法的可扩展性:以 I/O 为中心的方法
基本信息
- 批准号:2105044
- 负责人:
- 金额:$ 17.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-06-01 至 2023-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Machine learning (ML) algorithms have became key techniques in many scientific domains over the last few years. Thanks to the recent democratization of graphics processing units (GPUs), machine learning is mostly fueled by deep learning (DL) techniques that require extensive computational capabilities and vast volumes of data. Training large deep neural networks (DNN) is compute-intensive. However, thanks to GPUs, the cost of the compute-intensive components of the training has been reduced, while the relative cost of memory accesses has increased following the huge increase in the size of inputs datasets and in the complexity of the ML models. Due to their increasing requirements in terms of computational and memory capabilities, DNNs are now trained on distributed systems and have recently gained attention from the high-performance computing (HPC) community. A key challenge on HPC systems at extreme scale is the communication bottleneck, as communication is much slower than the required computations and also accounts for high energy consumption on large-scale machines. A lack of a comprehensive understanding of the different trade-offs, costs, and impacts induced by ML algorithms may severely impair science discoveries and AI breakthroughs in the near future. This project aims to address this problem by developing accurate performance models that can capture the complexity of training a DNN at scale in terms of I/O (communication) and, based on these models, producing efficient scheduling heuristics to reduce communication when training DNNs on HPC machines. Reducing data exchanges during the training phase decreases the execution time of this costly process and is likely to also reduce its energy consumption. The training of DNNs is becoming essential for many scientific domains, so optimizing the execution of this key component will help NSF fulfill its mission to advance and promote the progress of science. The proposed research will provide researchers with performance models that are key to supporting the development of novel middleware systems for large-scale ML on HPC platforms. Educational and outreach activities will include the development of pedagogic modules that will teach students key concepts of distributed computing and training of large neural networks and enable students to participate in workshops and conferences that serve the community. Training large neural networks on distributed HPC systems is challenging. DNN training involves complex communications patterns with some randomness due to the optimization method used to solve the network, which most of the time is stochastic gradient descent (SGD). Most distributed ML has been designed to run on cloud infrastructures, however HPC machines exhibit different characteristics in terms of hardware with fast interconnect networks and advanced communications capabilities, such as remote direct memory access (RDMA) and, in terms of software with the usage of the message passing interface (MPI) and OpenMP parallel programming models. This project will design performance models taking into account HPC characteristics that will give useful insights into the behavior of DNN training at scale, for example, how the data communication volume evolves with the DNN batch size or how to leverage HPC multi-layered storage, such as burst buffers, to improve DNN training performance. This project is organized around three research thrusts: (i) estimation of data movement costs when training DNNs on HPC machines (ii) augmenting performance models with energy metrics and (iii) developing bi-objective heuristics minimizing communication and energy while still providing training accuracy guarantees. In order to address these three research thrusts, this project will adopt a simulation-driven approach. The first step will be to characterize the I/O behaviors of DNNs when trained on HPC machines. Based on the analysis of the collected data, several performance models and scheduling heuristics will be designed. Then, a simulator of the HPC machine will be developed using the NSF-funded WRENCH project. This simulator will be calibrated with the data collected during the characterization phase. Finally, the performance models and the scheduling heuristics will be evaluated using the calibrated simulator. The project will also leverage the simulator to continuously improve the performance models and heuristics. This project will provide scientists with models to better understand performance trade-offs arising when training large-scale neural networks on complex distributed systems.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在过去的几年里,机器学习算法已经成为许多科学领域的关键技术。由于最近图形处理单元(GPU)的民主化,机器学习主要由深度学习(DL)技术推动,这些技术需要广泛的计算能力和海量数据。训练大型深度神经网络(DNN)是计算密集型的。然而,由于有了GPU,培训的计算密集型部分的成本降低了,而随着输入数据集的大小和ML模型的复杂性的大幅增加,存储器访问的相对成本增加了。由于DNN在计算和存储能力方面的需求不断增加,DNN现在在分布式系统上进行培训,最近得到了高性能计算(HPC)社区的关注。在极端规模的高性能计算系统上的一个关键挑战是通信瓶颈,因为通信比所需的计算慢得多,而且也是大型机器上高能耗的原因。如果缺乏对ML算法引起的不同权衡、成本和影响的全面了解,可能会在不久的将来严重损害科学发现和人工智能突破。该项目旨在通过开发准确的性能模型来解决这一问题,该模型能够捕捉大规模训练DNN在I/O(通信)方面的复杂性,并基于这些模型产生高效的调度启发式算法来减少在HPC机器上训练DNN时的通信。减少培训阶段的数据交换减少了这一昂贵过程的执行时间,也可能减少其能源消耗。对于许多科学领域来说,DNN的训练正变得至关重要,因此优化这一关键组件的执行将有助于NSF完成其推进和促进科学进步的使命。所提出的研究将为研究人员提供性能模型,这些模型是支持在高性能计算平台上开发用于大规模ML的新型中间件系统的关键。教育和外联活动将包括开发教学模块,向学生传授分布式计算的关键概念,培训大型神经网络,并使学生能够参加为社区服务的讲习班和会议。在分布式高性能计算系统上训练大型神经网络是具有挑战性的。DNN训练涉及到复杂的通信模式,具有一定的随机性,这是由于用于求解网络的优化方法,大多数时间是随机梯度下降(SGD)。大多数分布式ML被设计为在云基础设施上运行,然而,HPC机器在硬件方面表现出不同的特征,具有快速互连网络和高级通信能力,例如远程直接存储器访问(RDMA),并且在软件方面使用消息传递接口(MPI)和OpenMP并行编程模型。该项目将设计考虑HPC特征的性能模型,这些模型将为大规模DNN训练的行为提供有用的见解,例如,数据通信量如何随DNN批处理大小演变,或者如何利用HPC多层存储(如猝发缓冲区)来提高DNN训练性能。该项目围绕三个研究方向展开:(I)在HPC机器上训练DNN时对数据移动成本的估计;(Ii)使用能量度量增强性能模型;以及(Iii)开发双目标启发式算法,在仍提供训练精度保证的同时最小化通信和能量。为了解决这三个研究重点,该项目将采用模拟驱动的方法。第一步是表征在高性能计算机上训练的DNN的I/O行为。在对收集到的数据进行分析的基础上,设计了几种性能模型和调度启发式算法。然后,将使用NSF资助的扳手项目开发HPC机器的模拟器。该模拟器将使用在表征阶段收集的数据进行校准。最后,使用校准后的模拟器对性能模型和调度启发式算法进行评估。该项目还将利用模拟器不断改进性能模型和启发式方法。该项目将为科学家提供模型,以更好地理解在复杂分布式系统上训练大规模神经网络时产生的性能权衡。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Loic Pottier其他文献
Loic Pottier的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Loic Pottier', 18)}}的其他基金
Collaborative Research: CyberTraining: Implementation: Small: Integrating core CI literacy and skills into university curricula via simulation-driven activities
协作研究:网络培训:实施:小型:通过模拟驱动的活动将核心 CI 素养和技能融入大学课程
- 批准号:
1923539 - 财政年份:2019
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
相似国自然基金
Z8-12:OH和Z8-14:OAc分别维持梨小食心虫和李小食心虫性诱剂特异性的分子基础
- 批准号:
- 批准年份:2021
- 资助金额:35 万元
- 项目类别:地区科学基金项目
亚硝酰钌配合物[Ru(OAc)(2mqn)2NO]的光异构反应机理研究
- 批准号:21603131
- 批准年份:2016
- 资助金额:19.0 万元
- 项目类别:青年科学基金项目
机械化学条件下Mn(OAc)3促进的自由基串联反应研究
- 批准号:21242013
- 批准年份:2012
- 资助金额:10.0 万元
- 项目类别:专项基金项目
相似海外基金
CRII: OAC: A Compressor-Assisted Collective Communication Framework for GPU-Based Large-Scale Deep Learning
CRII:OAC:基于 GPU 的大规模深度学习的压缩器辅助集体通信框架
- 批准号:
2348465 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403312 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:
2414474 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
OAC Core: Cost-Adaptive Monitoring and Real-Time Tuning at Function-Level
OAC核心:功能级成本自适应监控和实时调优
- 批准号:
2402542 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
OAC Core: OAC Core Projects: GPU Geometric Data Processing
OAC 核心:OAC 核心项目:GPU 几何数据处理
- 批准号:
2403239 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: OAC: Dynamically Adaptive Unstructured Mesh Technologies for High-Order Multiscale Fluid Dynamics Simulations
CRII:OAC:用于高阶多尺度流体动力学仿真的动态自适应非结构化网格技术
- 批准号:
2348394 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: OAC: A Multi-fidelity Computational Framework for Discovering Governing Equations Under Uncertainty
CRII:OAC:用于发现不确定性下控制方程的多保真度计算框架
- 批准号:
2348495 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402947 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant