Collaborative Research: OAC Core: CEAPA: A Systematic Approach to Minimize Compression Error Propagation in HPC Applications
合作研究:OAC 核心:CEAPA:一种最小化 HPC 应用中压缩错误传播的系统方法
基本信息
- 批准号:2211538
- 负责人:
- 金额:$ 35万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-08-15 至 2025-07-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Today’s high-performance computing (HPC) applications produce vast volumes of data for post-analysis, presenting a major storage and I/O burden for HPC systems. To significantly reduce this burden, researchers have explored to use lossy compression techniques. While lossy compression can effectively reduce the size of data, it also introduces errors to the compressed data that often lead to incorrect computation results. As a result, scientists hesitate to use lossy compression in their scientific research. Thus, there is a critical need to develop an effective method to identify compression strategies which minimize error impact for a diversity of programs. This project aims to develop a systematic approach that helps scientists automatically select a lossy compression algorithm with the lowest error impact based their HPC programs and target compression ratios. It also integrates educational and outreach activities including student training and development of new curriculum on trustworthy data reduction and dependable HPC systems. Modeling compression error propagation in HPC programs is challenging because existing lossy compressors are developed with distinct principles that generate largely different compression errors on diverse HPC data. This project includes four key thrusts: (1) developing an accurate and efficient fault injection infrastructure that integrates with the fault models of commonly used lossy compression algorithms; (2) designing a fine-grained approach to characterize error propagation in HPC programs through program analysis and deposition based on the data dependencies and life cycle of compressed data; (3) developing a predictive model using machine learning techniques to select a compression strategy that minimizes the error impact on a given program and compression ratio; and (4) integrating the technique with domain-specific error impact metrics in real-world HPC applications and demonstrates the effectiveness of the technique by selecting compression strategies that give low error impact for the same ratios. Not only this project has an enormous positive impact on HPC cyberinfrastructure, but it also helps redefine the optimization of lossy compression techniques with emphasis on both efficiency and error impact.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
当今的高性能计算(HPC)应用程序会产生大量的数据用于后期分析,这给HPC系统带来了巨大的存储和I/O负担。为了显著减少这种负担,研究人员已经探索使用有损压缩技术。虽然有损压缩可以有效地减少数据的大小,但它也会向压缩数据引入错误,这些错误通常会导致不正确的计算结果。因此,科学家们对在科学研究中使用有损压缩犹豫不决。因此,迫切需要开发一种有效的方法来识别压缩策略,使错误对各种程序的影响最小化。该项目旨在开发一种系统化的方法,帮助科学家根据HPC程序和目标压缩比自动选择具有最低错误影响的有损压缩算法。它还整合了教育和推广活动,包括学生培训和开发关于可靠数据简化和可靠HPC系统的新课程。在HPC程序中建模压缩误差传播是具有挑战性的,因为现有的有损压缩器是以不同的原理开发的,这些原理在不同的HPC数据上产生很大程度上不同的压缩误差。该项目主要包括四个方面的工作:(1)开发一个准确、高效的故障注入基础设施,该基础设施与常用有损压缩算法的故障模型相集成;(2)设计一种细粒度的方法,通过基于压缩数据的数据依赖性和生命周期的程序分析和沉积来表征HPC程序中的错误传播;(3)使用机器学习技术来开发预测模型,以选择最小化对给定程序和压缩比的错误影响的压缩策略;以及(4)在真实的中将该技术与特定于域的错误影响度量相结合。世界HPC应用程序,并证明了该技术的有效性,通过选择压缩策略,使低错误的影响相同的比率。该项目不仅对HPC网络基础设施产生了巨大的积极影响,而且还有助于重新定义有损压缩技术的优化,重点关注效率和错误影响。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响评审标准进行评估,被认为值得支持。
项目成果
期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Feature-Driven Fixed-Ratio Lossy Compression Framework for Real-World Scientific Datasets
- DOI:10.1109/icde55515.2023.00116
- 发表时间:2023-04
- 期刊:
- 影响因子:0
- 作者:Md. Hasanur Rahman;S. Di;Kai Zhao;Robert Underwood;Guanpeng Li;F. Cappello
- 通讯作者:Md. Hasanur Rahman;S. Di;Kai Zhao;Robert Underwood;Guanpeng Li;F. Cappello
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Guanpeng Li其他文献
Towards analytically evaluating the error resilience of GPU Programs
分析评估 GPU 程序的错误恢复能力
- DOI:
- 发表时间:
2019 - 期刊:
- 影响因子:0
- 作者:
Abdul Rehman Anwer;Guanpeng Li;K. Pattabiraman;Siva Kumar;Sastry Hari;Michael B. Sullivan;Timothy Tsai - 通讯作者:
Timothy Tsai
Understanding Error Propagation in GPGPU Applications
了解 GPGPU 应用程序中的错误传播
- DOI:
10.1109/sc.2016.20 - 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Guanpeng Li;K. Pattabiraman;Chen;P. Bose - 通讯作者:
P. Bose
A Low-cost Fault Corrector for Deep Neural Networks through Range Restriction
通过范围限制的深度神经网络低成本故障校正器
- DOI:
10.1109/dsn48987.2021.00018 - 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Zitao Chen;Guanpeng Li;K. Pattabiraman - 通讯作者:
K. Pattabiraman
Fine-Grained Characterization of Faults Causing Long Latency Crashes in Programs
导致程序中长时间延迟崩溃的故障的细粒度表征
- DOI:
10.1109/dsn.2015.36 - 发表时间:
2015 - 期刊:
- 影响因子:0
- 作者:
Guanpeng Li;Qining Lu;K. Pattabiraman - 通讯作者:
K. Pattabiraman
Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors
评估具有实际硬件错误的编译器 IR 级选择性指令重复
- DOI:
10.1109/ftxs49593.2019.00010 - 发表时间:
2019 - 期刊:
- 影响因子:0
- 作者:
Chun;Guanpeng Li;M. Erez - 通讯作者:
M. Erez
Guanpeng Li的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403312 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:
2414474 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402947 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402946 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403088 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403090 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
- 批准号:
2403399 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403089 - 财政年份:2024
- 资助金额:
$ 35万 - 项目类别:
Standard Grant