CRII: OAC: A Compressor-Assisted Collective Communication Framework for GPU-Based Large-Scale Deep Learning
CRII:OAC:基于 GPU 的大规模深度学习的压缩器辅助集体通信框架
基本信息
- 批准号:2348465
- 负责人:
- 金额:$ 17.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2024
- 资助国家:美国
- 起止时间:2024-06-01 至 2026-05-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
The scale of modern deep learning expands rapidly due to larger training datasets, larger neural network models, and new algorithms/techniques. It presents significant challenges to the current distributed high-performance computing (HPC) infrastructures since larger-scale training incurs more expensive collective communication costs for passing more significant gradient messages among nodes. A more powerful hardware platform may not necessarily help overcome this performance bottleneck, as optimized middleware supports are demanded to unleash the platform's computing capacity fully. This project aims to close the gap between the training scale and the infrastructure's capability by providing gradient-specific lossy compression techniques and an optimized GPU-aware compressor-assisted collective communication framework to reduce the gradient message sizes and improve communication performance systematically. The deliverables can help the end-users to get significantly faster training speed with preserved training accuracy. The success of this research can promote progress in both traditional AI research, such as computer vision and natural language processing, and emerging AI for Science research for domain sciences, including cosmology, X-ray imaging, and drug discovery. This project also contributes to educational and engagement activities by leveraging the research outcome to develop new curriculums and teaching tools for mentoring college students and training K-12 students in HPC and AI areas.Using current collective communication libraries for large-scale distributed deep learning can yield significant communication overhead since the gradient messages are large. Applying lossy compression techniques to gradient messages could potentially reduce the communication overhead. However, several important open research questions should be investigated to ensure the performance gain: 1) Are the current lossy compressors efficient enough for gradient data? 2) How can lossy compressors efficiently integrate into a GPU-aware collective communication framework? 3) How could the GPU resources be efficiently shared among different tasks? This project addresses these questions and delivers a novel compressor-assisted GPU-aware collective communication framework for large-scale deep learning. Specifically, the team 1) investigates the efficiency of using error-bounded scientific data lossy compressors to compress gradient data and develops a new gradient compressor by leveraging the advantages of different existing compressors to achieve a better compression ratio and training accuracy; 2) designs the new compressor's GPU implementation and integrates it into the GPU-aware MPI, then optimizes the workflow to ultimately hide the gradient compressor's cost in the communication cost; 3) profiles the GPU resource utilization of both the deep learning training and the compressor-assist collective communications, and designs a new communication framework to enable task scheduling of training, compression, and collectives' computations (e.g., reduction) on the same GPU to achieve optimal resource sharing for the end-to-end deep learning training.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
现代深度学习的规模由于更大的训练数据集、更大的神经网络模型和新的算法/技术而迅速扩大。它对当前的分布式高性能计算(HPC)基础设施提出了重大挑战,因为更大规模的训练会导致更昂贵的集体通信成本,用于在节点之间传递更重要的梯度消息。更强大的硬件平台可能不一定有助于克服这种性能瓶颈,因为需要优化的中间件支持来充分释放平台的计算能力。该项目旨在通过提供特定于梯度的有损压缩技术和优化的GPU感知压缩器辅助集体通信框架来缩小训练规模和基础设施能力之间的差距,以减少梯度消息大小并系统地提高通信性能。交付成果可以帮助最终用户获得更快的训练速度,同时保持训练准确性。这项研究的成功可以促进传统人工智能研究(如计算机视觉和自然语言处理)和新兴人工智能领域科学研究(包括宇宙学,X射线成像和药物发现)的进展。该项目还利用研究成果开发新的教学工具和教学工具,用于指导大学生和培训K-12学生的HPC和AI领域,从而为教育和参与活动做出贡献。由于梯度消息很大,使用当前的集体通信库进行大规模分布式深度学习可能会产生显著的通信开销。将有损压缩技术应用于梯度消息可以潜在地减少通信开销。然而,几个重要的开放的研究问题,应调查,以确保性能增益:1)目前的有损压缩器足够有效的梯度数据?2)有损压缩器如何有效地集成到GPU感知的集体通信框架中?3)如何在不同的任务之间有效地共享GPU资源?该项目解决了这些问题,并为大规模深度学习提供了一个新颖的压缩机辅助GPU感知集体通信框架。具体而言,团队1)研究了使用误差受限的科学数据有损压缩器压缩梯度数据的效率,并通过利用现有不同压缩器的优势开发了一种新的梯度压缩器,以实现更好的压缩比和训练精度; 2)设计了新压缩器的GPU实现,并将其集成到GPU感知的MPI中,然后优化工作流程,最终将梯度压缩器的成本隐藏在通信成本中; 3)分析了深度学习训练和压缩器辅助集体通信的GPU资源利用率,并设计了一个新的通信框架,以实现训练,压缩,以及集体的计算(例如,该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Xiaodong Yu其他文献
Simulation on Supporting Characteristics of Heavy Hydrostatic Thrust Bearing
重型静压推力轴承支撑特性仿真
- DOI:
10.4028/www.scientific.net/amm.157-158.94 - 发表时间:
2012-02 - 期刊:
- 影响因子:0
- 作者:
Yanqin Zhang;Rui Li;Chunxi Dai;Junpeng Shao;Xiaodong Yu;Bai Qin - 通讯作者:
Bai Qin
Lab-on-a-chip for analysis of triglycerides based on a replaceable enzyme carrier using magnetic beads.
用于使用磁珠分析基于可更换酶载体的甘油三酯的芯片实验室。
- DOI:
- 发表时间:
2010 - 期刊:
- 影响因子:0
- 作者:
Shao;Xiaodong Yu;Jingjuan Xu;Hongyuan Chen - 通讯作者:
Hongyuan Chen
A Multi-Scale Spatial-Temporal Attention Model for Person Re-Identification in Videos
用于视频中人物重新识别的多尺度时空注意力模型
- DOI:
10.1109/tip.2019.2959653 - 发表时间:
2019-12 - 期刊:
- 影响因子:0
- 作者:
Wei Zhang;Xuanyu He;Xiaodong Yu;Weizhi Lu;Zhengjun Zha;Qi Tian - 通讯作者:
Qi Tian
Transient Simulation for a Pumped Storage Power Plant Considering Pressure Pulsation Based on Field Test
基于现场试验的考虑压力脉动的抽水蓄能电站暂态仿真
- DOI:
10.3390/en12132498 - 发表时间:
2019-06 - 期刊:
- 影响因子:3.2
- 作者:
Lei Zhang;Jian Zhang;Xiaodong Yu;Jiawen Lyu;Xiaoying Zhang - 通讯作者:
Xiaoying Zhang
Effect of Runoff Variability and Sea Level on Saltwater Intrusion A Case Study of Nandu River Estuary, China
径流变化和海平面对咸水入侵的影响——以中国南渡河口为例
- DOI:
10.1029/2018wr023285 - 发表时间:
- 期刊:
- 影响因子:5.4
- 作者:
Wei He;Jian Zhang;Xiaodong Yu - 通讯作者:
Xiaodong Yu
Xiaodong Yu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
Z8-12:OH和Z8-14:OAc分别维持梨小食心虫和李小食心虫性诱剂特异性的分子基础
- 批准号:
- 批准年份:2021
- 资助金额:35 万元
- 项目类别:地区科学基金项目
亚硝酰钌配合物[Ru(OAc)(2mqn)2NO]的光异构反应机理研究
- 批准号:21603131
- 批准年份:2016
- 资助金额:19.0 万元
- 项目类别:青年科学基金项目
机械化学条件下Mn(OAc)3促进的自由基串联反应研究
- 批准号:21242013
- 批准年份:2012
- 资助金额:10.0 万元
- 项目类别:专项基金项目
相似海外基金
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:
2414474 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403312 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
OAC Core: Cost-Adaptive Monitoring and Real-Time Tuning at Function-Level
OAC核心:功能级成本自适应监控和实时调优
- 批准号:
2402542 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
OAC Core: OAC Core Projects: GPU Geometric Data Processing
OAC 核心:OAC 核心项目:GPU 几何数据处理
- 批准号:
2403239 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: OAC: Dynamically Adaptive Unstructured Mesh Technologies for High-Order Multiscale Fluid Dynamics Simulations
CRII:OAC:用于高阶多尺度流体动力学仿真的动态自适应非结构化网格技术
- 批准号:
2348394 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
CRII: OAC: A Multi-fidelity Computational Framework for Discovering Governing Equations Under Uncertainty
CRII:OAC:用于发现不确定性下控制方程的多保真度计算框架
- 批准号:
2348495 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402947 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402946 - 财政年份:2024
- 资助金额:
$ 17.5万 - 项目类别:
Standard Grant