SHF: Small: Collaborative Research: ALETHEIA: A Framework for Automatic Detection/Correction of Corruptions in Extreme Scale Scientific Executions
SHF:小型:协作研究:ALETHEIA:超大规模科学执行中腐败自动检测/纠正的框架
基本信息
- 批准号:1617488
- 负责人:
- 金额:$ 25万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2016
- 资助国家:美国
- 起止时间:2016-06-15 至 2021-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Trusting scientific applications requires guaranteeing the validity of computed results. Unfortunately, many examples of scientific computations have led to incorrect results, sometimes with catastrophic consequences. Currently known validation techniques cover only a fraction of the possible corruptions that numerical simulation and data analytics applications may suffer during execution. As science processes grow in size and complexity, the reliability and validity of their constituent steps is increasingly difficult to ascertain. Assessing validity in the presence of potential data corruptions is a serious and insufficiently recognized problem. Corruption may occur at all levels of computing, from the hardware to the application. An important aspect of these corruptions is that until they are discovered, all executions are at risk of being corrupted silently. In some documented cases, months have elapsed between the discovery of a corruption and notification to users. In the meantime, a potentially large number of executions may be corrupted, and incorrect conclusions may result. It may be difficult, after the fact, to check whether executions have actually been corrupted or not, so that even if corruptions do not lead to mistakes, they may lead to significant productivity losses. Virtually all simulations producing very large results need to reduce their data volume in some way before saving it --one technique is called lossy compression. This project strives to validate the end result of the simulation coupled with lossy compression. This approach is useful for scientific simulations in such diverse areas as climate, cosmology, fluid dynamics, weather, and astrophysics --the drivers of this project. This collaborative project applies the principle of an external algorithmic observer (EAO), where the product of a scientific application is compared with that of a surrogate function of much lower complexity. Corruptions are corrected using a variation of triple modular redundancy: if a corruption is detected, a second surrogate function is executed, and the correct value is chosen from the two results that are most in agreement. This new online detection/correction approach involves approximate comparison of the lossy compressed results of the scientific application and the surrogate function. The project explores the detection performance of surrogate functions, lossy compressors, and approximate comparison techniques. The project also explores how to select the surrogate, lossy compression, and approximate functions to optimize objectives and constraints set by the users. The evaluation considers a set of five applications spanning different computational methods, producing large datasets with I/O bottlenecks, and covering a variety of science problem domains relevant to the NSF. In addition to serving the needs of scientists working in the fields listed above, this project will enhance the research experience of undergraduate students. A summer school focused on resilience is planned for summer 2016, and corruption detection/correction will be a major topic. The project is also organizing tutorials in major science conferences that include online detection/correction of numerical simulations.
信任科学应用需要保证计算结果的有效性。不幸的是,许多科学计算的例子导致了错误的结果,有时会带来灾难性的后果。目前已知的验证技术仅覆盖数值模拟和数据分析应用程序在执行期间可能遭受的损坏的一小部分。随着科学过程的规模和复杂性的增长,其组成步骤的可靠性和有效性越来越难以确定。在存在潜在数据损坏的情况下评估有效性是一个严重且未得到充分认识的问题。损坏可能发生在计算的所有级别,从硬件到应用程序。这些腐败行为的一个重要方面是,在它们被发现之前,所有的处决都有被悄悄腐败的风险。在一些记录在案的案例中,从发现腐败到通知用户之间需要几个月的时间。与此同时,潜在的大量处决可能会被破坏,并可能导致错误的结论。事后可能很难检查执行是否确实被腐败了,因此即使腐败不会导致错误,它们也可能导致显著的生产力损失。实际上,所有产生非常大结果的模拟在保存之前都需要以某种方式减少数据量--一种称为有损压缩的技术。本项目致力于验证结合有损压缩的模拟的最终结果。这种方法对气候、宇宙学、流体动力学、天气和天体物理学等不同领域的科学模拟非常有用,这些领域都是该项目的驱动力。这个合作项目应用了外部算法观察者(EAO)的原理,将科学应用程序的产品与复杂度低得多的代理函数的产品进行比较。使用三重模冗余的变体来纠正损坏:如果检测到损坏,则执行第二代理函数,并从最一致的两个结果中选择正确的值。这种新的在线检测/校正方法包括对科学应用和替代函数的有损压缩结果进行近似比较。该项目探索了代理函数、有损压缩器和近似比较技术的检测性能。该项目还探索了如何选择代理、有损压缩和近似函数来优化用户设置的目标和约束。该评估考虑了一组跨越不同计算方法的五个应用程序,产生了具有I/O瓶颈的大型数据集,并涵盖了与NSF相关的各种科学问题领域。除了满足在上述领域工作的科学家的需求外,该项目还将增强本科生的研究体验。计划在2016年夏季举办一个以复原力为重点的暑期班,发现/纠正腐败将是一个主要主题。该项目还在主要科学会议上组织教程,其中包括在线检测/修正数值模拟。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Marc Snir其他文献
Toward Training a Large 3D Cosmological CNN with Hybrid Parallelization
使用混合并行化训练大型 3D 宇宙学 CNN
- DOI:
- 发表时间:
2019 - 期刊:
- 影响因子:0
- 作者:
Yosuke Oyama;Naoya Maruyama;Nikoli Dryden;Peter Harrington;Jan Balewski;Satoshi Matsuoka;Marc Snir;Peter Nugent;Brian Van Essen - 通讯作者:
Brian Van Essen
Guest Editorial: Special Issue on Network and Parallel Computing for Emerging Architectures and Applications
- DOI:
10.1007/s10766-019-00634-1 - 发表时间:
2019-03-23 - 期刊:
- 影响因子:0.900
- 作者:
Feng Zhang;Jidong Zhai;Marc Snir;Hai Jin;Hironori Kasahara;Mateo Valero - 通讯作者:
Mateo Valero
Exploring the Efficiency of Renewable Energy-based Modular Data Centers at Scale
大规模探索基于可再生能源的模块化数据中心的效率
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Jinghan Sun;Zibo Gong;Anup Agarwal;Shadi Noghabi;Ranveer Chandra;Marc Snir;Jian Huang - 通讯作者:
Jian Huang
Design and Analysis of the Network Software Stack of an Asynchronous Many-task System -- The LCI parcelport of HPX
异步多任务系统网络软件栈的设计与分析——HPX LCI Parcelport
- DOI:
- 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Jiakun Yan;Hartmut Kaiser;Marc Snir - 通讯作者:
Marc Snir
Marc Snir的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Marc Snir', 18)}}的其他基金
OAC Core: Small: Collaborative Research: Scalable Run-Time for Highly Parallel, Heterogeneous Systems
OAC 核心:小型:协作研究:高度并行、异构系统的可扩展运行时
- 批准号:
1908144 - 财政年份:2019
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
SHF: Medium: Collaborative Research: ECC: Ephemeral Coherence Cohort for I/O Containerization and Disaggregation
SHF:媒介:协作研究:ECC:I/O 容器化和分解的临时一致性队列
- 批准号:
1763540 - 财政年份:2018
- 资助金额:
$ 25万 - 项目类别:
Continuing Grant
XPS: FP: Collaborative Research: Parallel Irregular Programs: From High-Level Specifications to Run-time Optimizations
XPS:FP:协作研究:并行不规则程序:从高级规范到运行时优化
- 批准号:
1337217 - 财政年份:2013
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
G8 Initiative: Collaborative Research: ECS: Enabling Climate Simulation at Extreme Scale
G8 倡议:合作研究:ECS:实现极端规模的气候模拟
- 批准号:
1062790 - 财政年份:2011
- 资助金额:
$ 25万 - 项目类别:
Continuing Grant
Deterministic Parallel Programming for High Performance Computing
高性能计算的确定性并行编程
- 批准号:
0833128 - 财政年份:2008
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Communication Complexity of Parallel Algorithms
并行算法的通信复杂性
- 批准号:
8203307 - 财政年份:1982
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
相似国自然基金
昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
- 批准号:
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
- 批准号:n/a
- 批准年份:2022
- 资助金额:10.0 万元
- 项目类别:省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
- 批准号:32000033
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
- 批准号:31972324
- 批准年份:2019
- 资助金额:58.0 万元
- 项目类别:面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
- 批准号:81900988
- 批准年份:2019
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
- 批准号:31870821
- 批准年份:2018
- 资助金额:56.0 万元
- 项目类别:面上项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
- 批准号:31802058
- 批准年份:2018
- 资助金额:26.0 万元
- 项目类别:青年科学基金项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
- 批准号:31772128
- 批准年份:2017
- 资助金额:60.0 万元
- 项目类别:面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
- 批准号:81704176
- 批准年份:2017
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
- 批准号:91640114
- 批准年份:2016
- 资助金额:85.0 万元
- 项目类别:重大研究计划
相似海外基金
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331302 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: LEGAS: Learning Evolving Graphs At Scale
协作研究:SHF:小型:LEGAS:大规模学习演化图
- 批准号:
2331301 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Efficient and Scalable Privacy-Preserving Neural Network Inference based on Ciphertext-Ciphertext Fully Homomorphic Encryption
合作研究:SHF:小型:基于密文-密文全同态加密的高效、可扩展的隐私保护神经网络推理
- 批准号:
2412357 - 财政年份:2024
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Technical Debt Management in Dynamic and Distributed Systems
合作研究:SHF:小型:动态和分布式系统中的技术债务管理
- 批准号:
2232720 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Quasi Weightless Neural Networks for Energy-Efficient Machine Learning on the Edge
合作研究:SHF:小型:用于边缘节能机器学习的准失重神经网络
- 批准号:
2326895 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Enabling Efficient 3D Perception: An Architecture-Algorithm Co-Design Approach
协作研究:SHF:小型:实现高效的 3D 感知:架构-算法协同设计方法
- 批准号:
2334624 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Sub-millisecond Topological Feature Extractor for High-Rate Machine Learning
合作研究:SHF:小型:用于高速机器学习的亚毫秒拓扑特征提取器
- 批准号:
2234921 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Reimagining Communication Bottlenecks in GNN Acceleration through Collaborative Locality Enhancement and Compression Co-Design
协作研究:SHF:小型:通过协作局部性增强和压缩协同设计重新想象 GNN 加速中的通信瓶颈
- 批准号:
2326494 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Quasi Weightless Neural Networks for Energy-Efficient Machine Learning on the Edge
合作研究:SHF:小型:用于边缘节能机器学习的准失重神经网络
- 批准号:
2326894 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Sub-millisecond Topological Feature Extractor for High-Rate Machine Learning
合作研究:SHF:小型:用于高速机器学习的亚毫秒拓扑特征提取器
- 批准号:
2234920 - 财政年份:2023
- 资助金额:
$ 25万 - 项目类别:
Standard Grant