CRII: OAC: An Efficient Lossy Compression Framework for Reducing Memory Footprint for Extreme-Scale Deep Learning on GPU-Based HPC Systems

CRII:OAC:一种有效的有损压缩框架,可减少基于 GPU 的 HPC 系统上超大规模深度学习的内存占用

基本信息

  • 批准号:
    2303820
  • 负责人:
  • 金额:
    $ 17.46万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2022
  • 资助国家:
    美国
  • 起止时间:
    2022-10-01 至 2024-04-30
  • 项目状态:
    已结题

项目摘要

Deep learning (DL) has rapidly evolved to a state-of-the-art technique in many science and technology disciplines, such as scientific exploration, national security, smart environment, and healthcare. Many of these DL applications require using high-performance computing (HPC) resources to process large amounts of data. Researchers and scientists, for instance, are employing extreme-scale DL applications in HPC infrastructures to classify extreme weather patterns and high-energy particles. In recent years, using Graphics Processing Units (GPUs) to accelerate DL applications has attracted increasing attention. However, the ever-increasing scales of DL applications bring many challenges to today’s GPU-based HPC infrastructures. The key challenge is the huge gap (e.g., one to two orders of magnitude) between the memory requirement and its availability on GPUs. This project aims to fill this gap by developing a novel framework to reduce the memory demand effectively and efficiently via data compression technologies for extreme-scale DL applications. The proposed research will enhance the GPU-based HPC infrastructures in broad communities for many scientific disciplines that rely on DL technologies. The project will connect machine learning and HPC communities and increase interactions between them. Educational and engagement activities include developing new curriculum related to data compression, mentoring a selected group of high school students in a year-long research project for a regional Science Fair competition, and increasing the community's understanding of leveraging HPC infrastructures for DL technologies. The project will also encourage student interest in research related to DL technologies on HPC environment and promote research collaborations with multiple national laboratories.Existing state-of-the-art GPU memory saving methods for training extreme-scale deep neural networks (DNNs) suffer from high performance overhead and/or low memory footprint reduction. Error-bounded lossy compression is a promising approach to significantly reduce the memory footprint while still meeting the required analysis accuracy. This project will explore how to leverage error-bounded lossy compression on DNN intermediate data to reduce the memory footprint for extreme-scale DNN training. The project has a three-stage research plan. First, the team will comprehensively investigate the impacts of applying error-bounded lossy compression to DNN intermediate data on both validation accuracy and training performance, using different error-bounded lossy compressors, compression modes, and error bounds on the targeted DNNs and datasets. Second, the team will optimize the compression quality of suitable error-bounded lossy compressors on different intermediate data based on the impact analysis outcome, and design an efficient scheme to adaptively apply a best-fit compression solution. Finally, the team will optimize the compression performance on the proposed lossy compression framework for state-of-the-art GPUs. The team will evaluate the proposed framework on high-resolution climate analytics and high-energy particle physics applications and compare it with existing state-of-the-art techniques based on both the memory footprint reduction ratio and training performance improvements (e.g., throughput, time, epoch number). The project will enable scientists and researchers to train extreme-scale DNNs with a given set of computing resources in a fast and efficient manner, opening opportunities for new discoveries.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
深度学习已经迅速发展成为科学探索、国家安全、智能环境和医疗保健等许多科学技术领域的尖端技术。其中许多数字图书馆应用程序需要使用高性能计算(HPC)资源来处理大量数据。例如,研究人员和科学家正在HPC基础设施中使用极端规模的DL应用程序来对极端天气模式和高能粒子进行分类。近年来,使用图形处理器(GPU)来加速数字图书馆的应用越来越引起人们的关注。然而,不断增长的DL应用规模给当今基于GPU的HPC基础设施带来了许多挑战。关键的挑战是内存需求和GPU上的可用性之间的巨大差距(例如,一到两个数量级)。该项目旨在通过开发一种新的框架来填补这一空白,该框架通过针对极端规模的DL应用程序的数据压缩技术来有效和高效地减少内存需求。拟议的研究将在广泛的社区中为许多依赖DL技术的科学学科增强基于GPU的HPC基础设施。该项目将连接机器学习和HPC社区,并增加它们之间的互动。教育和参与活动包括开发与数据压缩相关的新课程,指导一组选定的高中生参加为期一年的地区性科学博览会竞赛研究项目,以及增加社区对利用高性能计算基础设施进行数字图书馆技术的了解。该项目还将鼓励学生对高性能计算环境下DL技术相关研究的兴趣,并促进与多个国家实验室的研究合作。现有的用于训练极大规模深度神经网络(DNN)的最先进的GPU内存节省方法存在高性能开销和/或低内存占用的问题。误差有界的有损压缩是一种很有前途的方法,可以显著减少内存占用,同时仍能满足所需的分析精度。本项目将探讨如何在DNN中间数据上利用差错有界有损压缩来减少极端规模DNN训练的内存占用。该项目有一个分三个阶段的研究计划。首先,该团队将使用目标DNN和数据集上的不同错误有界有损压缩器、压缩模式和误差界,全面调查对DNN中间数据应用错误有界有损压缩对验证精度和训练性能的影响。其次,该团队将根据影响分析结果,在不同的中间数据上优化合适的有误差有损压缩器的压缩质量,并设计一种高效的方案,以自适应地应用最佳匹配的压缩解决方案。最后,该团队将在建议的针对最先进的GPU的有损压缩框架上优化压缩性能。该团队将评估拟议的高分辨率气候分析和高能粒子物理应用框架,并将其与基于内存占用量减少比率和训练性能改进(例如,吞吐量、时间、纪元数)的现有最先进技术进行比较。该项目将使科学家和研究人员能够快速有效地利用给定的计算资源集训练极端规模的DNN,为新发现打开机会。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(19)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs
  • DOI:
    10.1016/j.jpdc.2021.02.013
  • 发表时间:
    2020-02
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Cody Rivera;Jieyang Chen;Nan Xiong;Jing Zhang;S. Song;Dingwen Tao
  • 通讯作者:
    Cody Rivera;Jieyang Chen;Nan Xiong;Jing Zhang;S. Song;Dingwen Tao
RTMobile: Beyond Real-Time Mobile Acceleration of RNNs for Speech Recognition
  • DOI:
    10.1109/dac18072.2020.9218499
  • 发表时间:
    2020-02
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Peiyan Dong;Siyue Wang;Wei Niu;Chengming Zhang;Sheng Lin;Z. Li;Yifan Gong;Bin Ren;X. Lin;Yanzhi Wang;Dingwen Tao
  • 通讯作者:
    Peiyan Dong;Siyue Wang;Wei Niu;Chengming Zhang;Sheng Lin;Z. Li;Yifan Gong;Bin Ren;X. Lin;Yanzhi Wang;Dingwen Tao
HBMax: Optimizing Memory Efficiency for Parallel Influence Maximization on Multicore Architectures
ClickTrain: efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning
  • DOI:
    10.1145/3447818.3459988
  • 发表时间:
    2020-11
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Chengming Zhang;Geng Yuan;Wei Niu;Jiannan Tian;Sian Jin;Donglin Zhuang;Zhe Jiang;Yanzhi Wang;Bin Ren;S. Song;Dingwen Tao
  • 通讯作者:
    Chengming Zhang;Geng Yuan;Wei Niu;Jiannan Tian;Sian Jin;Donglin Zhuang;Zhe Jiang;Yanzhi Wang;Bin Ren;S. Song;Dingwen Tao
Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Dingwen Tao其他文献

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources
FastCLIP:一套优化技术,可利用有限的资源加速 CLIP 培训
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Xiyuan Wei;Fanjiang Ye;Ori Yonay;Xingyu Chen;Baixi Sun;Dingwen Tao;Tianbao Yang
  • 通讯作者:
    Tianbao Yang
Z-checker: A framework for assessing lossy compression of scientific data
Z-checker:评估科学数据有损压缩的框架
Extending checksum-based ABFT to tolerate soft errors online in iterative methods
扩展基于校验和的 ABFT 以容忍迭代方法中的在线软错误
Performance Optimization for Relative-Error-Bounded Lossy Compression on Scientific Data
科学数据的相对误差有限有损压缩的性能优化
  • DOI:
    10.1109/tpds.2020.2972548
  • 发表时间:
    2020-07
  • 期刊:
  • 影响因子:
    5.3
  • 作者:
    Xiangyu Zou;Tao Lu;Wen Xia;Xuan Wang;Weizhe Zhang;Haijun Zhang;Sheng Di;Dingwen Tao;Franck Cappello
  • 通讯作者:
    Franck Cappello
A High-Quality Workflow for Multi-Resolution Scientific Data Reduction and Visualization
用于多分辨率科学数据简化和可视化的高质量工作流程
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Daoce Wang;Pascal Grosset;Jesus Pulido;Tushar M. Athawale;Jiannan Tian;Kai Zhao;Z. Lukic;Axel Huebl;Zhe Wang;James P. Ahrens;Dingwen Tao
  • 通讯作者:
    Dingwen Tao

Dingwen Tao的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Dingwen Tao', 18)}}的其他基金

CAREER: A Highly Effective, Usable, Performant, Scalable Data Reduction Framework for HPC Systems and Applications
职业:适用于 HPC 系统和应用程序的高效、可用、高性能、可扩展的数据缩减框架
  • 批准号:
    2232120
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: Frameworks: FZ: A fine-tunable cyberinfrastructure framework to streamline specialized lossy compression development
合作研究:框架:FZ:一个可微调的网络基础设施框架,用于简化专门的有损压缩开发
  • 批准号:
    2311876
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Reimagining Communication Bottlenecks in GNN Acceleration through Collaborative Locality Enhancement and Compression Co-Design
协作研究:SHF:小型:通过协作局部性增强和压缩协同设计重新想象 GNN 加速中的通信瓶颈
  • 批准号:
    2326495
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CAREER: A Highly Effective, Usable, Performant, Scalable Data Reduction Framework for HPC Systems and Applications
职业:适用于 HPC 系统和应用程序的高效、可用、高性能、可扩展的数据缩减框架
  • 批准号:
    2312673
  • 财政年份:
    2023
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CDS&E: Collaborative Research: HyLoC: Objective-driven Adaptive Hybrid Lossy Compression Framework for Extreme-Scale Scientific Applications
CDS
  • 批准号:
    2303064
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CEAPA: A Systematic Approach to Minimize Compression Error Propagation in HPC Applications
合作研究:OAC 核心:CEAPA:一种最小化 HPC 应用中压缩错误传播的系统方法
  • 批准号:
    2211539
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CEAPA: A Systematic Approach to Minimize Compression Error Propagation in HPC Applications
合作研究:OAC 核心:CEAPA:一种最小化 HPC 应用中压缩错误传播的系统方法
  • 批准号:
    2247060
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: ROCCI: Integrated Cyberinfrastructure for In Situ Lossy Compression Optimization Based on Post Hoc Analysis Requirements
合作研究:要素:ROCCI:基于事后分析要求的原位有损压缩优化的集成网络基础设施
  • 批准号:
    2247080
  • 财政年份:
    2022
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: ROCCI: Integrated Cyberinfrastructure for In Situ Lossy Compression Optimization Based on Post Hoc Analysis Requirements
合作研究:要素:ROCCI:基于事后分析要求的原位有损压缩优化的集成网络基础设施
  • 批准号:
    2104024
  • 财政年份:
    2021
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CDS&E: Collaborative Research: HyLoC: Objective-driven Adaptive Hybrid Lossy Compression Framework for Extreme-Scale Scientific Applications
CDS
  • 批准号:
    2042084
  • 财政年份:
    2020
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant

相似国自然基金

Z8-12:OH和Z8-14:OAc分别维持梨小食心虫和李小食心虫性诱剂特异性的分子基础
  • 批准号:
  • 批准年份:
    2021
  • 资助金额:
    35 万元
  • 项目类别:
    地区科学基金项目
亚硝酰钌配合物[Ru(OAc)(2mqn)2NO]的光异构反应机理研究
  • 批准号:
    21603131
  • 批准年份:
    2016
  • 资助金额:
    19.0 万元
  • 项目类别:
    青年科学基金项目
机械化学条件下Mn(OAc)3促进的自由基串联反应研究
  • 批准号:
    21242013
  • 批准年份:
    2012
  • 资助金额:
    10.0 万元
  • 项目类别:
    专项基金项目

相似海外基金

CRII: OAC: A Compressor-Assisted Collective Communication Framework for GPU-Based Large-Scale Deep Learning
CRII:OAC:基于 GPU 的大规模深度学习的压缩器辅助集体通信框架
  • 批准号:
    2348465
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
OAC Core: Cost-Adaptive Monitoring and Real-Time Tuning at Function-Level
OAC核心:功能级成本自适应监控和实时调优
  • 批准号:
    2402542
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
OAC Core: OAC Core Projects: GPU Geometric Data Processing
OAC 核心:OAC 核心项目:GPU 几何数据处理
  • 批准号:
    2403239
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CRII: OAC: Dynamically Adaptive Unstructured Mesh Technologies for High-Order Multiscale Fluid Dynamics Simulations
CRII:OAC:用于高阶多尺度流体动力学仿真的动态自适应非结构​​化网格技术
  • 批准号:
    2348394
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
CRII: OAC: A Multi-fidelity Computational Framework for Discovering Governing Equations Under Uncertainty
CRII:OAC:用于发现不确定性下控制方程的多保真度计算框架
  • 批准号:
    2348495
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 17.46万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了