EAGER: Recomputation-Based Checkpointing for Sparse Matrices

EAGER:基于重新计算的稀疏矩阵检查点

基本信息

项目摘要

High-performance computing (HPC) is essential for maintaining the US international competitive edge and leadership in science, technology, engineering, and mathematics (STEM). Advances in HPC are vital to national interests by providing infrastructure for scientific discovery that improves the national health, prosperity, welfare, and defense. To solve large-scale scientific problems, HPC relies on an increasing number of nodes and components, which makes it likelier for long-running computation to be interrupted with failures before completing. A critical technique to ensure computation completion is checkpointing. Checkpointing allows snapshots of the computation to be saved so that when a failure occurs, computation state can be restored from the last snapshot and continues execution, rather than restarting from the beginning. The research in this project seeks to advance the state-of-the-art checkpointing technique by making it significantly faster and lowering its cost. This project also plans to contribute to the training of future workforce by providing students with exposure to the mechanisms and inefficiencies of current checkpointing mechanisms on NVMM, and the new in-place checkpointing. The project seeks to increase participation of minority and under-represented groups and involves undergraduates in research.Prior approaches to checkpointing rely on taking a snapshot of the system state (system-level checkpointing) or the application state (application-level checkpointing) and saving it to secondary non-volatile storage. With the advent of non-volatile main memory (NVMM), a new approach to checkpointing becomes possible. In contrast to traditional approaches to checkpointing that rely on storing separate snapshots in a separate secondary storage, the project uses a new approach where checkpoints can be constructed in-place in the NVMM utilizing the working data structures used by the applications. This allows only very minimal additional state beyond what the program already saves to memory, making checkpointing significantly faster and incurring lower cost, in turn providing further HPC scaling.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
高性能计算(HPC)对于保持美国在科学、技术、工程和数学(STEM)领域的国际竞争优势和领导地位至关重要。HPC的进步对国家利益至关重要,因为它为科学发现提供了基础设施,从而改善了国家的健康、繁荣、福利和国防。为了解决大规模的科学问题,HPC依赖于越来越多的节点和组件,这使得长时间运行的计算在完成之前更有可能因故障而中断。确保计算完成的关键技术是检查点。检查点允许保存计算的快照,以便在发生故障时,计算状态可以从最后一个快照恢复并继续执行,而不是从头开始。该项目的研究旨在通过提高速度和降低成本来推进最先进的检查点技术。该项目还计划通过让学生了解NVMM上当前检查点机制的机制和效率低下以及新的就地检查点,为培训未来的劳动力做出贡献。该项目旨在提高少数群体和代表性不足的群体的参与,并让大学生参与研究。以前的检查点方法依赖于拍摄系统状态(系统级检查点)或应用程序状态(应用程序级检查点)的快照,并将其保存到二级非易失性存储器。随着非易失性主存(NVMM)的出现,一种新的检查点设置方法成为可能。 与依赖于将单独的快照存储在单独的辅助存储中的传统检查点方法相比,该项目使用了一种新方法,可以利用应用程序使用的工作数据结构在NVMM中就地构建检查点。这只允许在程序已经保存到内存中的状态之外添加非常少的额外状态,从而使检查点操作速度显著加快,成本降低,进而提供进一步的HPC扩展。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Yan Solihin其他文献

Avoiding TLB Shootdowns Through Self-Invalidating TLB Entries
通过自我失效 TLB 条目避免 TLB 被击落
Analytically modeling the memory hierarchy performance of modern processor systems
对现代处理器系统的内存层次结构性能进行分析建模
  • DOI:
  • 发表时间:
    2011
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yan Solihin;Fang Liu
  • 通讯作者:
    Fang Liu
耳介伝達関数および耳介画像を用いた個人認証についての検討
利用耳廓传递函数和耳廓图像进行个人认证的研究
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Reem Elkhouly;Mohammad Alshboul;Akihiro Hayashi;Yan Solihin;Keiji Kimura;井谷俊仁,喜多俊輔 梶川嘉延
  • 通讯作者:
    井谷俊仁,喜多俊輔 梶川嘉延
Persistent Memory: Abstractions, Abstractions, and Abstractions
持久内存:抽象、抽象、还是抽象
  • DOI:
    10.1109/mm.2018.2885589
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    3.6
  • 作者:
    Yan Solihin
  • 通讯作者:
    Yan Solihin
Helper thread prefetching for loosely-coupled multiprocessor systems
松耦合多处理器系统的辅助线程预取

Yan Solihin的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Yan Solihin', 18)}}的其他基金

Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
  • 批准号:
    2312206
  • 财政年份:
    2023
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Continuing Grant
Collaborative Research: CNS Core: Medium: Understanding and Strengthening Memory Security for Non-Volatile Memory
合作研究:CNS 核心:中:理解和加强非易失性内存的内存安全性
  • 批准号:
    2106629
  • 财政年份:
    2021
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Continuing Grant
Collaborative Research: PPoSS: Planning: Scaling Secure Serverless Computing on Hetergeneous Datacenters
协作研究:PPoSS:规划:在异构数据中心上扩展安全无服务器计算
  • 批准号:
    2028836
  • 财政年份:
    2020
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SHF: Small: Collaborative Research: Efficient Memory Persistency for GPUs
SHF:小型:协作研究:GPU 的高效内存持久性
  • 批准号:
    1908079
  • 财政年份:
    2019
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
CNS Core: Medium: Collaborative Research: Persistent memory objects for consistent sharing in Non-Volatile Main Memories
CNS 核心:中:协作研究:用于非易失性主存储器中一致共享的持久内存对象
  • 批准号:
    1900724
  • 财政年份:
    2019
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Continuing Grant
EAGER: Recomputation-Based Checkpointing for Sparse Matrices
EAGER:基于重新计算的稀疏矩阵检查点
  • 批准号:
    1829142
  • 财政年份:
    2018
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SI2-SSE: TLDS: Transactional Lock-Free Data Structures
SI2-SSE:TLDS:事务性无锁数据结构
  • 批准号:
    1740095
  • 财政年份:
    2017
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SHF: Small: Towards a Versatile Analytical Modeling Toolset for Evaluating Memory Hierarchy Design
SHF:小型:用于评估内存层次结构设计的多功能分析建模工具集
  • 批准号:
    1116540
  • 财政年份:
    2011
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
SHF: Small: Collaborative Research: Beyond Secure Processors - Securing Systems Against Hardware
SHF:小型:协作研究:超越安全处理器 - 保护系统免受硬件攻击
  • 批准号:
    0915501
  • 财政年份:
    2009
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
CSR:Small:Efficient and Predictable Memory Hierarchies for High-Performance Embedded Systems
CSR:小型:高性能嵌入式系统的高效且可预测的内存层次结构
  • 批准号:
    0915503
  • 财政年份:
    2009
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant

相似海外基金

EAGER: Recomputation-Based Checkpointing for Sparse Matrices
EAGER:基于重新计算的稀疏矩阵检查点
  • 批准号:
    1829142
  • 财政年份:
    2018
  • 资助金额:
    $ 23.62万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了