权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CRII: OAC: Scalability of Deep-Learning Methods on HPC Systems: An I/O-centric Approach

CRII：OAC：HPC 系统上深度学习方法的可扩展性：以 I/O 为中心的方法

基本信息

批准号：
2105044
负责人：
Loic Pottier
金额：
$ 17.5万
依托单位：
University of Southern California
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-06-01 至 2023-05-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2105044&HistoricalAwards=false
关键词：
CRII OAC Scalability Deep Learning

项目摘要

Machine learning (ML) algorithms have became key techniques in many scientific domains over the last few years. Thanks to the recent democratization of graphics processing units (GPUs), machine learning is mostly fueled by deep learning (DL) techniques that require extensive computational capabilities and vast volumes of data. Training large deep neural networks (DNN) is compute-intensive. However, thanks to GPUs, the cost of the compute-intensive components of the training has been reduced, while the relative cost of memory accesses has increased following the huge increase in the size of inputs datasets and in the complexity of the ML models. Due to their increasing requirements in terms of computational and memory capabilities, DNNs are now trained on distributed systems and have recently gained attention from the high-performance computing (HPC) community. A key challenge on HPC systems at extreme scale is the communication bottleneck, as communication is much slower than the required computations and also accounts for high energy consumption on large-scale machines. A lack of a comprehensive understanding of the different trade-offs, costs, and impacts induced by ML algorithms may severely impair science discoveries and AI breakthroughs in the near future. This project aims to address this problem by developing accurate performance models that can capture the complexity of training a DNN at scale in terms of I/O (communication) and, based on these models, producing efficient scheduling heuristics to reduce communication when training DNNs on HPC machines. Reducing data exchanges during the training phase decreases the execution time of this costly process and is likely to also reduce its energy consumption. The training of DNNs is becoming essential for many scientific domains, so optimizing the execution of this key component will help NSF fulfill its mission to advance and promote the progress of science. The proposed research will provide researchers with performance models that are key to supporting the development of novel middleware systems for large-scale ML on HPC platforms. Educational and outreach activities will include the development of pedagogic modules that will teach students key concepts of distributed computing and training of large neural networks and enable students to participate in workshops and conferences that serve the community. Training large neural networks on distributed HPC systems is challenging. DNN training involves complex communications patterns with some randomness due to the optimization method used to solve the network, which most of the time is stochastic gradient descent (SGD). Most distributed ML has been designed to run on cloud infrastructures, however HPC machines exhibit different characteristics in terms of hardware with fast interconnect networks and advanced communications capabilities, such as remote direct memory access (RDMA) and, in terms of software with the usage of the message passing interface (MPI) and OpenMP parallel programming models. This project will design performance models taking into account HPC characteristics that will give useful insights into the behavior of DNN training at scale, for example, how the data communication volume evolves with the DNN batch size or how to leverage HPC multi-layered storage, such as burst buffers, to improve DNN training performance. This project is organized around three research thrusts: (i) estimation of data movement costs when training DNNs on HPC machines (ii) augmenting performance models with energy metrics and (iii) developing bi-objective heuristics minimizing communication and energy while still providing training accuracy guarantees. In order to address these three research thrusts, this project will adopt a simulation-driven approach. The first step will be to characterize the I/O behaviors of DNNs when trained on HPC machines. Based on the analysis of the collected data, several performance models and scheduling heuristics will be designed. Then, a simulator of the HPC machine will be developed using the NSF-funded WRENCH project. This simulator will be calibrated with the data collected during the characterization phase. Finally, the performance models and the scheduling heuristics will be evaluated using the calibrated simulator. The project will also leverage the simulator to continuously improve the performance models and heuristics. This project will provide scientists with models to better understand performance trade-offs arising when training large-scale neural networks on complex distributed systems.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

在过去的几年里，机器学习算法已经成为许多科学领域的关键技术。由于最近图形处理单元(GPU)的民主化，机器学习主要由深度学习(DL)技术推动，这些技术需要广泛的计算能力和海量数据。训练大型深度神经网络(DNN)是计算密集型的。然而，由于有了GPU，培训的计算密集型部分的成本降低了，而随着输入数据集的大小和ML模型的复杂性的大幅增加，存储器访问的相对成本增加了。由于DNN在计算和存储能力方面的需求不断增加，DNN现在在分布式系统上进行培训，最近得到了高性能计算(HPC)社区的关注。在极端规模的高性能计算系统上的一个关键挑战是通信瓶颈，因为通信比所需的计算慢得多，而且也是大型机器上高能耗的原因。如果缺乏对ML算法引起的不同权衡、成本和影响的全面了解，可能会在不久的将来严重损害科学发现和人工智能突破。该项目旨在通过开发准确的性能模型来解决这一问题，该模型能够捕捉大规模训练DNN在I/O(通信)方面的复杂性，并基于这些模型产生高效的调度启发式算法来减少在HPC机器上训练DNN时的通信。减少培训阶段的数据交换减少了这一昂贵过程的执行时间，也可能减少其能源消耗。对于许多科学领域来说，DNN的训练正变得至关重要，因此优化这一关键组件的执行将有助于NSF完成其推进和促进科学进步的使命。所提出的研究将为研究人员提供性能模型，这些模型是支持在高性能计算平台上开发用于大规模ML的新型中间件系统的关键。教育和外联活动将包括开发教学模块，向学生传授分布式计算的关键概念，培训大型神经网络，并使学生能够参加为社区服务的讲习班和会议。在分布式高性能计算系统上训练大型神经网络是具有挑战性的。DNN训练涉及到复杂的通信模式，具有一定的随机性，这是由于用于求解网络的优化方法，大多数时间是随机梯度下降(SGD)。大多数分布式ML被设计为在云基础设施上运行，然而，HPC机器在硬件方面表现出不同的特征，具有快速互连网络和高级通信能力，例如远程直接存储器访问(RDMA)，并且在软件方面使用消息传递接口(MPI)和OpenMP并行编程模型。该项目将设计考虑HPC特征的性能模型，这些模型将为大规模DNN训练的行为提供有用的见解，例如，数据通信量如何随DNN批处理大小演变，或者如何利用HPC多层存储(如猝发缓冲区)来提高DNN训练性能。该项目围绕三个研究方向展开：(I)在HPC机器上训练DNN时对数据移动成本的估计；(Ii)使用能量度量增强性能模型；以及(Iii)开发双目标启发式算法，在仍提供训练精度保证的同时最小化通信和能量。为了解决这三个研究重点，该项目将采用模拟驱动的方法。第一步是表征在高性能计算机上训练的DNN的I/O行为。在对收集到的数据进行分析的基础上，设计了几种性能模型和调度启发式算法。然后，将使用NSF资助的扳手项目开发HPC机器的模拟器。该模拟器将使用在表征阶段收集的数据进行校准。最后，使用校准后的模拟器对性能模型和调度启发式算法进行评估。该项目还将利用模拟器不断改进性能模型和启发式方法。该项目将为科学家提供模型，以更好地理解在复杂分布式系统上训练大规模神经网络时产生的性能权衡。该奖项反映了NSF的法定使命，并通过使用基金会的智力优势和更广泛的影响审查标准进行评估，被认为值得支持。