CRII: OAC: An Efficient Lossy Compression Framework for Reducing Memory Footprint for Extreme-Scale Deep Learning on GPU-Based HPC Systems

CRII:OAC:一种有效的有损压缩框架,可减少基于 GPU 的 HPC 系统上超大规模深度学习的内存占用

基本信息

  • 批准号:
    2303820
  • 负责人:
  • 金额:
    $ 17.46万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2022
  • 资助国家:
    美国
  • 起止时间:
    2022-10-01 至 2024-04-30
  • 项目状态:
    已结题

项目摘要

Deep learning (DL) has rapidly evolved to a state-of-the-art technique in many science and technology disciplines, such as scientific exploration, national security, smart environment, and healthcare. Many of these DL applications require using high-performance computing (HPC) resources to process large amounts of data. Researchers and scientists, for instance, are employing extreme-scale DL applications in HPC infrastructures to classify extreme weather patterns and high-energy particles. In recent years, using Graphics Processing Units (GPUs) to accelerate DL applications has attracted increasing attention. However, the ever-increasing scales of DL applications bring many challenges to today’s GPU-based HPC infrastructures. The key challenge is the huge gap (e.g., one to two orders of magnitude) between the memory requirement and its availability on GPUs. This project aims to fill this gap by developing a novel framework to reduce the memory demand effectively and efficiently via data compression technologies for extreme-scale DL applications. The proposed research will enhance the GPU-based HPC infrastructures in broad communities for many scientific disciplines that rely on DL technologies. The project will connect machine learning and HPC communities and increase interactions between them. Educational and engagement activities include developing new curriculum related to data compression, mentoring a selected group of high school students in a year-long research project for a regional Science Fair competition, and increasing the community's understanding of leveraging HPC infrastructures for DL technologies. The project will also encourage student interest in research related to DL technologies on HPC environment and promote research collaborations with multiple national laboratories.Existing state-of-the-art GPU memory saving methods for training extreme-scale deep neural networks (DNNs) suffer from high performance overhead and/or low memory footprint reduction. Error-bounded lossy compression is a promising approach to significantly reduce the memory footprint while still meeting the required analysis accuracy. This project will explore how to leverage error-bounded lossy compression on DNN intermediate data to reduce the memory footprint for extreme-scale DNN training. The project has a three-stage research plan. First, the team will comprehensively investigate the impacts of applying error-bounded lossy compression to DNN intermediate data on both validation accuracy and training performance, using different error-bounded lossy compressors, compression modes, and error bounds on the targeted DNNs and datasets. Second, the team will optimize the compression quality of suitable error-bounded lossy compressors on different intermediate data based on the impact analysis outcome, and design an efficient scheme to adaptively apply a best-fit compression solution. Finally, the team will optimize the compression performance on the proposed lossy compression framework for state-of-the-art GPUs. The team will evaluate the proposed framework on high-resolution climate analytics and high-energy particle physics applications and compare it with existing state-of-the-art techniques based on both the memory footprint reduction ratio and training performance improvements (e.g., throughput, time, epoch number). The project will enable scientists and researchers to train extreme-scale DNNs with a given set of computing resources in a fast and efficient manner, opening opportunities for new discoveries.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
深度学习(DL)已经迅速发展成为许多科学和技术学科的最先进技术,例如科学探索,国家安全,智能环境和医疗保健。这些DL应用程序中的许多都需要使用高性能计算(HPC)资源来处理大量数据。例如,研究人员和科学家正在HPC基础设施中使用极端规模的DL应用程序来对极端天气模式和高能粒子进行分类。近年来,使用图形处理单元(GPU)来加速DL应用引起了越来越多的关注。然而,不断增长的DL应用规模给当今基于GPU的HPC基础架构带来了许多挑战。关键的挑战是巨大的差距(例如,一到两个数量级)。该项目旨在通过开发一种新的框架来填补这一空白,以有效地减少内存需求,通过数据压缩技术的极端规模的DL应用。拟议的研究将增强许多依赖DL技术的科学学科的广泛社区中基于GPU的HPC基础设施。该项目将连接机器学习和HPC社区,并增加它们之间的互动。教育和参与活动包括开发与数据压缩相关的新课程,在为期一年的区域科学博览会竞赛研究项目中指导选定的一组高中生,以及提高社区对利用HPC基础设施进行DL技术的理解。该项目还将鼓励学生对HPC环境下DL技术相关研究的兴趣,并促进与多个国家实验室的研究合作。现有用于训练极端规模深度神经网络(DNN)的最先进GPU内存节省方法存在高性能开销和/或低内存占用减少的问题。误差受限有损压缩是一种很有前途的方法,可以显着减少内存占用,同时仍然满足所需的分析精度。该项目将探索如何利用DNN中间数据的错误限制有损压缩来减少极端规模DNN训练的内存占用。该项目有三个阶段的研究计划。首先,该团队将全面研究将误差有界有损压缩应用于DNN中间数据对验证准确性和训练性能的影响,使用不同的误差有界有损压缩器,压缩模式和目标DNN和数据集的误差界限。其次,该团队将根据影响分析结果优化合适的错误有界有损压缩器对不同中间数据的压缩质量,并设计一个有效的方案来自适应地应用最佳压缩解决方案。最后,该团队将在最先进的GPU上优化拟议的有损压缩框架的压缩性能。该团队将评估高分辨率气候分析和高能粒子物理应用的拟议框架,并将其与基于内存占用减少率和训练性能改进的现有最先进技术进行比较(例如,吞吐量、时间、历元数)。 该项目将使科学家和研究人员能够以快速有效的方式利用给定的计算资源训练极端规模的DNN,为新发现创造机会。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(19)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs
  • DOI:
    10.1016/j.jpdc.2021.02.013
  • 发表时间:
    2020-02
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Cody Rivera;Jieyang Chen;Nan Xiong;Jing Zhang;S. Song;Dingwen Tao
  • 通讯作者:
    Cody Rivera;Jieyang Chen;Nan Xiong;Jing Zhang;S. Song;Dingwen Tao
RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition
  • DOI:
    10.1109/dac18072.2020.9218499
  • 发表时间:
    2020-02
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Peiyan Dong;Siyue Wang;Wei Niu;Chengming Zhang;Sheng Lin;Z. Li;Yifan Gong;Bin Ren;X. Lin;Yanzhi Wang;Dingwen Tao
  • 通讯作者:
    Peiyan Dong;Siyue Wang;Wei Niu;Chengming Zhang;Sheng Lin;Z. Li;Yifan Gong;Bin Ren;X. Lin;Yanzhi Wang;Dingwen Tao
HBMax: Optimizing Memory Efficiency for Parallel Influence Maximization on Multicore Architectures
ClickTrain: efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning
  • DOI:
    10.1145/3447818.3459988
  • 发表时间:
    2020-11
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chengming Zhang;Geng Yuan;Wei Niu;Jiannan Tian;Sian Jin;Donglin Zhuang;Zhe Jiang;Yanzhi Wang;Bin Ren;S. Song;Dingwen Tao
  • 通讯作者:
    Chengming Zhang;Geng Yuan;Wei Niu;Jiannan Tian;Sian Jin;Donglin Zhuang;Zhe Jiang;Yanzhi Wang;Bin Ren;S. Song;Dingwen Tao
Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Dingwen Tao其他文献

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources
FastCLIP:一套优化技术,可利用有限的资源加速 CLIP 培训
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Xiyuan Wei;Fanjiang Ye;Ori Yonay;Xingyu Chen;Baixi Sun;Dingwen Tao;Tianbao Yang
  • 通讯作者:
    Tianbao Yang
Z-checker: A framework for assessing lossy compression of scientific data
Z-checker:评估科学数据有损压缩的框架
Extending checksum-based ABFT to tolerate soft errors online in iterative methods
扩展基于校验和的 ABFT 以容忍迭代方法中的在线软错误
Performance Optimization for Relative-Error-Bounded Lossy Compression on Scientific Data
科学数据的相对误差有限有损压缩的性能优化
  • DOI:
    10.1109/tpds.2020.2972548
  • 发表时间:
    2020-07
  • 期刊:
  • 影响因子:
    5.3
  • 作者:
    Xiangyu Zou;Tao Lu;Wen Xia;Xuan Wang;Weizhe Zhang;Haijun Zhang;Sheng Di;Dingwen Tao;Franck Cappello
  • 通讯作者:
    Franck Cappello
A High-Quality Workflow for Multi-Resolution Scientific Data Reduction and Visualization
用于多分辨率科学数据简化和可视化的高质量工作流程
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Daoce Wang;Pascal Grosset;Jesus Pulido;Tushar M. Athawale;Jiannan Tian;Kai Zhao;Z. Lukic;Axel Huebl;Zhe Wang;James P. Ahrens;Dingwen Tao
  • 通讯作者:
    Dingwen Tao

Dingwen Tao的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Dingwen Tao', 18)}}的其他基金

CAREER: A Highly Effective, Usable, Performant, Scalable Data Reduction Framework for HPC Systems and Applications
职业:适用于 HPC 系统和应用程序的高效、可用、高性能、可扩展的数据缩减框架
  • 批准号:
    2232120
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: Frameworks: FZ: A fine-tunable cyberinfrastructure framework to streamline specialized lossy compression development
合作研究:框架:FZ:一个可微调的网络基础设施框架,用于简化专门的有损压缩开发
  • 批准号:
    2311876
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Reimagining Communication Bottlenecks in GNN Acceleration through Collaborative Locality Enhancement and Compression Co-Design
协作研究:SHF:小型:通过协作局部性增强和压缩协同设计重新想象 GNN 加速中的通信瓶颈
  • 批准号:
    2326495
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CAREER: A Highly Effective, Usable, Performant, Scalable Data Reduction Framework for HPC Systems and Applications
职业:适用于 HPC 系统和应用程序的高效、可用、高性能、可扩展的数据缩减框架
  • 批准号:
    2312673
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CDS&E: Collaborative Research: HyLoC: Objective-driven Adaptive Hybrid Lossy Compression Framework for Extreme-Scale Scientific Applications
CDS
  • 批准号:
    2303064
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CEAPA: A Systematic Approach to Minimize Compression Error Propagation in HPC Applications
合作研究:OAC 核心:CEAPA:一种最小化 HPC 应用中压缩错误传播的系统方法
  • 批准号:
    2211539
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CEAPA: A Systematic Approach to Minimize Compression Error Propagation in HPC Applications
合作研究:OAC 核心:CEAPA:一种最小化 HPC 应用中压缩错误传播的系统方法
  • 批准号:
    2247060
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: ROCCI: Integrated Cyberinfrastructure for In Situ Lossy Compression Optimization Based on Post Hoc Analysis Requirements
合作研究:要素:ROCCI:基于事后分析要求的原位有损压缩优化的集成网络基础设施
  • 批准号:
    2247080
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: ROCCI: Integrated Cyberinfrastructure for In Situ Lossy Compression Optimization Based on Post Hoc Analysis Requirements
合作研究:要素:ROCCI:基于事后分析要求的原位有损压缩优化的集成网络基础设施
  • 批准号:
    2104024
  • 财政年份:
    2021
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CDS&E: Collaborative Research: HyLoC: Objective-driven Adaptive Hybrid Lossy Compression Framework for Extreme-Scale Scientific Applications
CDS
  • 批准号:
    2042084
  • 财政年份:
    2020
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant

相似国自然基金

Z8-12:OH和Z8-14:OAc分别维持梨小食心虫和李小食心虫性诱剂特异性的分子基础
  • 批准号:
  • 批准年份:
    2021
  • 资助金额:
    35 万元
  • 项目类别:
    地区科学基金项目
亚硝酰钌配合物[Ru(OAc)(2mqn)2NO]的光异构反应机理研究
  • 批准号:
    21603131
  • 批准年份:
    2016
  • 资助金额:
    19.0 万元
  • 项目类别:
    青年科学基金项目
机械化学条件下Mn(OAc)3促进的自由基串联反应研究
  • 批准号:
    21242013
  • 批准年份:
    2012
  • 资助金额:
    10.0 万元
  • 项目类别:
    专项基金项目

相似海外基金

CRII: OAC: A Compressor-Assisted Collective Communication Framework for GPU-Based Large-Scale Deep Learning
CRII:OAC:基于 GPU 的大规模深度学习的压缩器辅助集体通信框架
  • 批准号:
    2348465
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
OAC Core: Cost-Adaptive Monitoring and Real-Time Tuning at Function-Level
OAC核心:功能级成本自适应监控和实时调优
  • 批准号:
    2402542
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
OAC Core: OAC Core Projects: GPU Geometric Data Processing
OAC 核心:OAC 核心项目:GPU 几何数据处理
  • 批准号:
    2403239
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CRII: OAC: Dynamically Adaptive Unstructured Mesh Technologies for High-Order Multiscale Fluid Dynamics Simulations
CRII:OAC:用于高阶多尺度流体动力学仿真的动态自适应非结构​​化网格技术
  • 批准号:
    2348394
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CRII: OAC: A Multi-fidelity Computational Framework for Discovering Governing Equations Under Uncertainty
CRII:OAC:用于发现不确定性下控制方程的多保真度计算框架
  • 批准号:
    2348495
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了