CAREER: Rethinking HPC Resilience in the Exascale Era

职业:重新思考百亿亿次时代的 HPC 弹性

基本信息

  • 批准号:
    2001124
  • 负责人:
  • 金额:
    $ 40.67万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-08-12 至 2024-12-31
  • 项目状态:
    已结题

项目摘要

Resilience is one of the key exascale research challenges in high-performancecomputing (HPC). Due to much high error rates, exascale supercomputers couldmake little progress in computations, or might generate incorrect results due tofailures, rendering the exascale performance useless. Thechallenge is how to achieve a complete HPC resilience at exascale in a way thatdoes not increase the performance overhead, the power consumption, and thecomplexity of underlying hardware. To this end, this research project designsand develops low-cost hardware/software cooperative techniques for HPCresilience in the exascale era. This project involves four research goals: (1) low-cost soft error resiliencefor CPUs; intelligent compiler-architecture interaction can validate the lack oferrors and performs fine-grained recovery, thus eliminating SDC. (2)compiler-directed soft error resilience for commodity GPUs; it can remove thepower-hungry error-correcting code (ECC) logic from the GPU register fileswithout compromising their resilience. (3) lightweight nonvolatile memory (NVM)persistence; it can mitigate the overhead of traditional heavyweight HPCcheckpointing and support whole-system persistence for applications withoutirrevocable operations. (4) low-cost timing error resilience for aggressivevoltage scaling to maximize the energy-efficiency with program correctnessguarantee.The resulting artifacts and technologies are expected to contribute to thenation's competitiveness by addressing the challenge of building reliable HPCsystems. The research outcome impacts a broad range of any disciplines thatneed correct computation results thus requiring reliable computing systemscovering from embedded systems to HPC cloud. Consequently, use of the proposedtechniques will make the execution of current and emerging applications muchmore reliable, and therefore directly affect our way of life.There will be three types of data generated from this research project: (1)algorithms and models, (2) software prototype, (3) testing infrastructureincluding simulators and evaluation benchmarks and their traces, (4) educationalmaterials. All of our software tools will be open source and made available tothe public, laboratories and industry.
弹性是高性能计算(HPC)中的一个关键的亿级研究挑战。由于错误率高得多,兆级超级计算机在计算中几乎没有进展,或者可能由于故障而产生不正确的结果,从而使兆级性能变得无用。 挑战是如何在不增加性能开销、功耗和底层硬件复杂性的情况下,实现兆级的完整HPC弹性。 为此,本研究项目设计并开发了低成本的硬件/软件协同技术,以实现兆兆级时代的HPC弹性。该项目涉及四个研究目标:(1)低成本的CPU软错误容错;智能编译器-架构交互可以验证错误的缺失并执行细粒度恢复,从而消除SDC。 (2)针对商用GPU的编译器指导的软错误恢复;它可以从GPU寄存器文件中删除功耗高的纠错码(ECC)逻辑,而不会影响其恢复能力。 (3)轻量级非易失性存储器(NVM)持久性;它可以减轻传统重量级HPC检查点的开销,并支持应用程序的全系统持久性,而无需可中断操作。 (4)低成本的时序误差弹性,用于积极的电压缩放,以最大限度地提高能源效率,并保证程序的正确性。由此产生的人工制品和技术,预计将有助于通过解决建设可靠的HPC系统的挑战,提高国家的竞争力。 该研究成果影响了广泛的任何学科,这些学科需要正确的计算结果,因此需要从嵌入式系统到HPC云的可靠计算系统。因此,使用所提出的技术将使当前和新兴的应用程序的执行更加可靠,因此直接影响我们的生活方式。本研究项目将产生三种类型的数据:(1)算法和模型,(2)软件原型,(3)测试基础设施,包括模拟器和评估基准及其跟踪,(4)教育材料。我们所有的软件工具都将是开源的,并向公众、实验室和工业界开放。

项目成果

期刊论文数量(16)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Compiler-directed soft error resilience for lightweight GPU register file protection
Unbounded Hardware Transactional Memory for a Hybrid DRAM/NVM Memory System
Capri: Compiler and Architecture Support for Whole-System Persistence
Persistent Processor Architecture
ReplayCache: Enabling Volatile Cachesfor Energy Harvesting Systems
  • DOI:
    10.1145/3466752.3480102
  • 发表时间:
    2021-10
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jianping Zeng;Jongouk Choi;Xinwei Fu;Ajay Paddayuru Shreepathi;Dongyoon Lee;Changwoo Min;Changhee Jung
  • 通讯作者:
    Jianping Zeng;Jongouk Choi;Xinwei Fu;Ajay Paddayuru Shreepathi;Dongyoon Lee;Changwoo Min;Changhee Jung
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Changhee Jung其他文献

Low-cost soft error resilience with unified data verification and fine-grained recovery for acoustic sensor based detection
低成本的软错误恢复能力,具有统一的数据验证和细粒度恢复,用于基于声学传感器的检测
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Qingrui Liu;Changhee Jung;Dongyoon Lee;Devesh Tiwari
  • 通讯作者:
    Devesh Tiwari
Adaptive execution techniques of parallel programs for multiprocessors
多处理器并行程序的自适应执行技术
  • DOI:
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jaejin Lee;Jungho Park;Honggyu Kim;Changhee Jung;Daeseob Lim;Sang
  • 通讯作者:
    Sang
ProRace
职业竞赛
  • DOI:
    10.1145/3093336.3037708
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Tong Zhang;Changhee Jung;Dongyoon Lee
  • 通讯作者:
    Dongyoon Lee
CommAnalyzer: Automated Estimation of Communication Cost on HPC Clusters Using Sequential Code
CommAnalyzer:使用顺序代码自动估计 HPC 集群上的通信成本
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    A. Helal;Changhee Jung;Wu;Y. Hanafy
  • 通讯作者:
    Y. Hanafy
Soft Error Resilience at Near-Zero Cost
以接近零成本的软错误恢复能力

Changhee Jung的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Changhee Jung', 18)}}的其他基金

Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
  • 批准号:
    2314681
  • 财政年份:
    2023
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Continuing Grant
Collaborative Research: SHF: Small: Enabling Caches and GPUs for Energy Harvesting Systems
合作研究:SHF:小型:为能量收集系统启用缓存和 GPU
  • 批准号:
    2153749
  • 财政年份:
    2022
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Standard Grant
CAREER: Rethinking HPC Resilience in the Exascale Era
职业:重新思考百亿亿次时代的 HPC 弹性
  • 批准号:
    1750503
  • 财政年份:
    2018
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Continuing Grant
SHF: Small: Compiler and Architectural Techniques for Soft Error Resilience
SHF:小型:软错误恢复能力的编译器和架构技术
  • 批准号:
    1527463
  • 财政年份:
    2015
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Standard Grant

相似海外基金

Care and Repair: Rethinking Contemporary Curation for Conditions of Crisis
护理与修复:重新思考危机条件下的当代策展
  • 批准号:
    DP240102206
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Discovery Projects
A Brave New World for Japanese Shakespeare Adaptations: Rethinking Shakespeare Studies through Adaptations
日本莎士比亚改编的美丽新世界:通过改编重新思考莎士比亚研究
  • 批准号:
    23K21920
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
PROTSENS Rethinking Alternative PROTein Extraction: Decoding SENsory-Protein Extraction Relationships
PROTSENS 重新思考替代性蛋白质提取:解码感觉-蛋白质提取关系
  • 批准号:
    EP/Z000785/1
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Fellowship
Caring Communities 1800-present: Rethinking Children's Social Care
关爱社区 1800 年至今:重新思考儿童的社会关怀
  • 批准号:
    MR/X034968/1
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Fellowship
High-rise landscapes: The afterlives of tower block 'failure' and rethinking urban futures
高层景观:塔楼“失败”的后遗症和重新思考城市未来
  • 批准号:
    MR/Y003586/1
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Fellowship
CAREER: Rethinking Spiking Neural Networks from a Dynamical System Perspective
职业:从动态系统的角度重新思考尖峰神经网络
  • 批准号:
    2337646
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Continuing Grant
CAREER: A multimethod approach to rethinking the dynamics of inhibitory control under stress
职业生涯:重新思考压力下抑制控制动态的多种方法
  • 批准号:
    2338789
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Continuing Grant
Rethinking Mao’s China from a Global Economic Perspective: A History
从全球经济的角度重新思考毛泽东时代的中国:一段历史
  • 批准号:
    DE240100091
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Discovery Early Career Researcher Award
CAREER: Rethinking System Stack for the Load-Store I/O Era
职业:重新思考加载-存储 I/O 时代的系统堆栈
  • 批准号:
    2339901
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Continuing Grant
Rethinking Antarctic Sea Level Projections (RASP)
重新思考南极海平面预测 (RASP)
  • 批准号:
    NE/Y001451/1
  • 财政年份:
    2024
  • 资助金额:
    $ 40.67万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了