权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CRII: OAC: A Compressor-Assisted Collective Communication Framework for GPU-Based Large-Scale Deep Learning

CRII：OAC：基于 GPU 的大规模深度学习的压缩器辅助集体通信框架

基本信息

批准号：
2348465
负责人：
Xiaodong Yu
金额：
$ 17.5万
依托单位：
Stevens Institute of Technology
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2024
资助国家：
美国
起止时间：
2024-06-01 至 2026-05-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2348465&HistoricalAwards=false
关键词：
CRII OAC Compressor Assisted Collective

项目摘要

The scale of modern deep learning expands rapidly due to larger training datasets, larger neural network models, and new algorithms/techniques. It presents significant challenges to the current distributed high-performance computing (HPC) infrastructures since larger-scale training incurs more expensive collective communication costs for passing more significant gradient messages among nodes. A more powerful hardware platform may not necessarily help overcome this performance bottleneck, as optimized middleware supports are demanded to unleash the platform's computing capacity fully. This project aims to close the gap between the training scale and the infrastructure's capability by providing gradient-specific lossy compression techniques and an optimized GPU-aware compressor-assisted collective communication framework to reduce the gradient message sizes and improve communication performance systematically. The deliverables can help the end-users to get significantly faster training speed with preserved training accuracy. The success of this research can promote progress in both traditional AI research, such as computer vision and natural language processing, and emerging AI for Science research for domain sciences, including cosmology, X-ray imaging, and drug discovery. This project also contributes to educational and engagement activities by leveraging the research outcome to develop new curriculums and teaching tools for mentoring college students and training K-12 students in HPC and AI areas.Using current collective communication libraries for large-scale distributed deep learning can yield significant communication overhead since the gradient messages are large. Applying lossy compression techniques to gradient messages could potentially reduce the communication overhead. However, several important open research questions should be investigated to ensure the performance gain: 1) Are the current lossy compressors efficient enough for gradient data? 2) How can lossy compressors efficiently integrate into a GPU-aware collective communication framework? 3) How could the GPU resources be efficiently shared among different tasks? This project addresses these questions and delivers a novel compressor-assisted GPU-aware collective communication framework for large-scale deep learning. Specifically, the team 1) investigates the efficiency of using error-bounded scientific data lossy compressors to compress gradient data and develops a new gradient compressor by leveraging the advantages of different existing compressors to achieve a better compression ratio and training accuracy; 2) designs the new compressor's GPU implementation and integrates it into the GPU-aware MPI, then optimizes the workflow to ultimately hide the gradient compressor's cost in the communication cost; 3) profiles the GPU resource utilization of both the deep learning training and the compressor-assist collective communications, and designs a new communication framework to enable task scheduling of training, compression, and collectives' computations (e.g., reduction) on the same GPU to achieve optimal resource sharing for the end-to-end deep learning training.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

现代深度学习的规模由于更大的训练数据集、更大的神经网络模型和新的算法/技术而迅速扩大。它对当前的分布式高性能计算（HPC）基础设施提出了重大挑战，因为更大规模的训练会导致更昂贵的集体通信成本，用于在节点之间传递更重要的梯度消息。更强大的硬件平台可能不一定有助于克服这种性能瓶颈，因为需要优化的中间件支持来充分释放平台的计算能力。该项目旨在通过提供特定于梯度的有损压缩技术和优化的GPU感知压缩器辅助集体通信框架来缩小训练规模和基础设施能力之间的差距，以减少梯度消息大小并系统地提高通信性能。交付成果可以帮助最终用户获得更快的训练速度，同时保持训练准确性。这项研究的成功可以促进传统人工智能研究（如计算机视觉和自然语言处理）和新兴人工智能领域科学研究（包括宇宙学，X射线成像和药物发现）的进展。该项目还利用研究成果开发新的教学工具和教学工具，用于指导大学生和培训K-12学生的HPC和AI领域，从而为教育和参与活动做出贡献。由于梯度消息很大，使用当前的集体通信库进行大规模分布式深度学习可能会产生显著的通信开销。将有损压缩技术应用于梯度消息可以潜在地减少通信开销。然而，几个重要的开放的研究问题，应调查，以确保性能增益：1）目前的有损压缩器足够有效的梯度数据？2)有损压缩器如何有效地集成到GPU感知的集体通信框架中？3)如何在不同的任务之间有效地共享GPU资源？该项目解决了这些问题，并为大规模深度学习提供了一个新颖的压缩机辅助GPU感知集体通信框架。具体而言，团队1）研究了使用误差受限的科学数据有损压缩器压缩梯度数据的效率，并通过利用现有不同压缩器的优势开发了一种新的梯度压缩器，以实现更好的压缩比和训练精度; 2）设计了新压缩器的GPU实现，并将其集成到GPU感知的MPI中，然后优化工作流程，最终将梯度压缩器的成本隐藏在通信成本中; 3）分析了深度学习训练和压缩器辅助集体通信的GPU资源利用率，并设计了一个新的通信框架，以实现训练，压缩，以及集体的计算（例如，该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。