Application Based Fault Tolerance in High Performance Computing Applications
高性能计算应用中基于应用程序的容错
基本信息
- 批准号:1834202
- 负责人:
- 金额:--
- 依托单位:
- 依托单位国家:英国
- 项目类别:Studentship
- 财政年份:2017
- 资助国家:英国
- 起止时间:2017 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
As we are moving towards Exascale systems, the probability of faults occurring increases with the number ofcomponents in the system. Some of these faults, such as Soft Errors (SE), can introduce noise to the data, whichdepending on the system, might be impossible to correct or detect meaning that the computation is corruptedand can potentially return invalid results. Most common FT techniques implemented in hardware use ErrorCorrecting Codes (ECC) methods, which can always correct any detected errors that are correctable. Ahardware implementation allows to minimise the runtime performance overhead at the cost of additionalhardware complexity and memory bandwidth. These implementations also need more energy during additionalcomputations and memory transfers, and as the current supercomputers already consume over 10MW (enoughto power a small town), removing this additional hardware can improve energy efficiency of Exascale systems.We investigate high-performance software alternatives of FT techniques which have a distinct advantage asthey do not require the additional hardware, which at Exascale will prove to be highly beneficial. ApplicationBased Fault Tolerance (ABFT) techniques also bring more flexibility to how faults are dealt with when theyoccur and this can lead to much greater performance. By investigating common High Performance Computing(HPC) computation and communication patterns we derive new methods for Fault Tolerance (FT). ABFTtechniques allow the application to decide whether a particular error needs to be corrected or can be ignored, forexample a bit flip in the less significant bits of the mantissa for a double precision floating point number mayconverge to a correct value after few iterations of the algorithm and hence error correcting is not required.ABFT can also be used to provide 1 FT to hardware that does not provide ECC capabilities, such as embeddedprocessors or consumer GPUs. Even if the hardware does provide FT, it can usually be turned off, and usingABFT instead would free up the resources required by the hardware, such as memory and memory bandwidth,which would improve the applications performance. In particular we investigate new ABFT techniques for theHPC dwarfs and apply Information and Coding theory to derive innovative methods that can detect and correct(multiple) errors. We then look into techniques that apply to a subset of the dwarfs, which are highly optimisedand can included in a software library so that they are ready to use out of the box. A big benefit of hardware FTtechniques is that it does not require the user to change their code in order to protect their application fromfaults, whereas ABFT techniques are application dependent and often require changes to the source code of theapplication. To mitigate this problem we investigate automatic detection of fault vulnerable sections of theapplication using techniques such as machine learning and then apply the ABFT techniques during thecompilation of the application. This approach minimises the efforts required from the programmer to adaptthese FT techniques and make their application fault tolerant.
随着我们走向亿级系统,故障发生的概率随着系统中组件的数量增加而增加。其中一些错误,如软错误(SE),可能会向数据引入噪声,根据系统的不同,这些噪声可能无法纠正或检测到计算已损坏,并可能返回无效结果。硬件中实现的大多数常见FT技术使用纠错码(ECC)方法,该方法总是可以纠正任何检测到的可纠正的错误。硬件实现允许以额外的硬件复杂性和内存带宽为代价来最小化运行时性能开销。在额外的计算和内存传输过程中,这些实现还需要更多的能量,并且由于当前的超级计算机已经消耗超过10 mW(足以为一个小镇供电),因此移除这些额外的硬件可以提高艾斯卡系统的能效。我们研究了FT技术的高性能软件替代方案,它们具有明显的优势,因为它们不需要额外的硬件,这在艾斯卡将被证明是非常有益的。基于应用程序的容错(ABFT)技术还为故障的处理带来了更大的灵活性,这可以带来更高的性能。通过研究常见的高性能计算(HPC)计算和通信模式,我们得到了新的容错(FT)方法。ABFT技术允许应用程序决定是否需要纠正或可以忽略特定的错误,例如,双精度浮点数的尾数的较低有效位的比特翻转可能在算法的几次迭代后收敛到正确值,因此不需要纠错。ABFT还可以用于向不提供ECC能力的硬件提供1FT,例如嵌入式处理器或消费者GPU。即使硬件确实提供FT,通常也可以将其关闭,而使用ABFT将释放硬件所需的资源,如内存和内存带宽,这将提高应用程序的性能。特别是,我们研究了用于HPC矮星的新的ABFT技术,并应用信息和编码理论来推导出能够检测和纠正(多个)错误的创新方法。然后,我们研究适用于矮人子集的技术,这些技术经过高度优化,可以包含在软件库中,以便随时可以使用。硬件FT技术的一大好处是,它不需要用户为了保护他们的应用程序免受故障而更改他们的代码,而ABFT技术依赖于应用程序,并且通常需要更改应用程序的源代码。为了缓解这个问题,我们研究了使用机器学习等技术自动检测应用程序的故障易受攻击部分,然后在应用程序的编译过程中应用ABFT技术。这种方法最大限度地减少了程序员适应这些FT技术并使其应用程序容错所需的工作。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
其他文献
Internet-administered, low-intensity cognitive behavioral therapy for parents of children treated for cancer: A feasibility trial (ENGAGE).
针对癌症儿童父母的互联网管理、低强度认知行为疗法:可行性试验 (ENGAGE)。
- DOI:
10.1002/cam4.5377 - 发表时间:
2023-03 - 期刊:
- 影响因子:4
- 作者:
- 通讯作者:
Differences in child and adolescent exposure to unhealthy food and beverage advertising on television in a self-regulatory environment.
在自我监管的环境中,儿童和青少年在电视上接触不健康食品和饮料广告的情况存在差异。
- DOI:
10.1186/s12889-023-15027-w - 发表时间:
2023-03-23 - 期刊:
- 影响因子:4.5
- 作者:
- 通讯作者:
The association between rheumatoid arthritis and reduced estimated cardiorespiratory fitness is mediated by physical symptoms and negative emotions: a cross-sectional study.
类风湿性关节炎与估计心肺健康降低之间的关联是由身体症状和负面情绪介导的:一项横断面研究。
- DOI:
10.1007/s10067-023-06584-x - 发表时间:
2023-07 - 期刊:
- 影响因子:3.4
- 作者:
- 通讯作者:
ElasticBLAST: accelerating sequence search via cloud computing.
ElasticBLAST:通过云计算加速序列搜索。
- DOI:
10.1186/s12859-023-05245-9 - 发表时间:
2023-03-26 - 期刊:
- 影响因子:3
- 作者:
- 通讯作者:
Amplified EQCM-D detection of extracellular vesicles using 2D gold nanostructured arrays fabricated by block copolymer self-assembly.
使用通过嵌段共聚物自组装制造的 2D 金纳米结构阵列放大 EQCM-D 检测细胞外囊泡。
- DOI:
10.1039/d2nh00424k - 发表时间:
2023-03-27 - 期刊:
- 影响因子:9.7
- 作者:
- 通讯作者:
的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('', 18)}}的其他基金
An implantable biosensor microsystem for real-time measurement of circulating biomarkers
用于实时测量循环生物标志物的植入式生物传感器微系统
- 批准号:
2901954 - 财政年份:2028
- 资助金额:
-- - 项目类别:
Studentship
Exploiting the polysaccharide breakdown capacity of the human gut microbiome to develop environmentally sustainable dishwashing solutions
利用人类肠道微生物群的多糖分解能力来开发环境可持续的洗碗解决方案
- 批准号:
2896097 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
A Robot that Swims Through Granular Materials
可以在颗粒材料中游动的机器人
- 批准号:
2780268 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Likelihood and impact of severe space weather events on the resilience of nuclear power and safeguards monitoring.
严重空间天气事件对核电和保障监督的恢复力的可能性和影响。
- 批准号:
2908918 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Proton, alpha and gamma irradiation assisted stress corrosion cracking: understanding the fuel-stainless steel interface
质子、α 和 γ 辐照辅助应力腐蚀开裂:了解燃料-不锈钢界面
- 批准号:
2908693 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Field Assisted Sintering of Nuclear Fuel Simulants
核燃料模拟物的现场辅助烧结
- 批准号:
2908917 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Assessment of new fatigue capable titanium alloys for aerospace applications
评估用于航空航天应用的新型抗疲劳钛合金
- 批准号:
2879438 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Developing a 3D printed skin model using a Dextran - Collagen hydrogel to analyse the cellular and epigenetic effects of interleukin-17 inhibitors in
使用右旋糖酐-胶原蛋白水凝胶开发 3D 打印皮肤模型,以分析白细胞介素 17 抑制剂的细胞和表观遗传效应
- 批准号:
2890513 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Understanding the interplay between the gut microbiome, behavior and urbanisation in wild birds
了解野生鸟类肠道微生物组、行为和城市化之间的相互作用
- 批准号:
2876993 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
相似国自然基金
Data-driven Recommendation System Construction of an Online Medical Platform Based on the Fusion of Information
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:外国青年学者研究基金项目
Incentive and governance schenism study of corporate green washing behavior in China: Based on an integiated view of econfiguration of environmental authority and decoupling logic
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:外国学者研究基金项目
Exploring the Intrinsic Mechanisms of CEO Turnover and Market Reaction: An Explanation Based on Information Asymmetry
- 批准号:W2433169
- 批准年份:2024
- 资助金额:万元
- 项目类别:外国学者研究基金项目
A study on prototype flexible multifunctional graphene foam-based sensing grid (柔性多功能石墨烯泡沫传感网格原型研究)
- 批准号:
- 批准年份:2020
- 资助金额:20 万元
- 项目类别:
基于tag-based单细胞转录组测序解析造血干细胞发育的可变剪接
- 批准号:81900115
- 批准年份:2019
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
应用Agent-Based-Model研究围术期单剂量地塞米松对手术切口愈合的影响及机制
- 批准号:81771933
- 批准年份:2017
- 资助金额:50.0 万元
- 项目类别:面上项目
Reality-based Interaction用户界面模型和评估方法研究
- 批准号:61170182
- 批准年份:2011
- 资助金额:57.0 万元
- 项目类别:面上项目
Multistage,haplotype and functional tests-based FCAR 基因和IgA肾病相关关系研究
- 批准号:30771013
- 批准年份:2007
- 资助金额:30.0 万元
- 项目类别:面上项目
差异蛋白质组技术结合Array-based CGH 寻找骨肉瘤分子标志物
- 批准号:30470665
- 批准年份:2004
- 资助金额:8.0 万元
- 项目类别:面上项目
GaN-based稀磁半导体材料与自旋电子共振隧穿器件的研究
- 批准号:60376005
- 批准年份:2003
- 资助金额:20.0 万元
- 项目类别:面上项目
相似海外基金
Establishing new fault model along the Sagami trough based on the tectonic geomorphology, including submarine topography and submerged marine terraces
基于构造地貌,包括海底地形和海底阶地,建立沿相模海槽的新断层模型
- 批准号:
22H00755 - 财政年份:2022
- 资助金额:
-- - 项目类别:
Grant-in-Aid for Scientific Research (B)
CISE-MSI: RCBP-ED: CNS: MBARKA: A Multi-tier Basic Architecture for fault-toleRant and K-secure IoT-based Autonomous campus monitoring systems
CISE-MSI:RCBP-ED:CNS:MBARKA:用于容错和 K-secure 基于物联网的自主校园监控系统的多层基本架构
- 批准号:
2219785 - 财政年份:2022
- 资助金额:
-- - 项目类别:
Standard Grant
Hybrid Data-driven Physics-based Modeling for Machine Fault Detection, Diagnosis, and Prediction
用于机器故障检测、诊断和预测的混合数据驱动的基于物理的建模
- 批准号:
RGPIN-2019-03967 - 财政年份:2022
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Fault Model-Based Testing
基于故障模型的测试
- 批准号:
RGPIN-2017-03900 - 财政年份:2022
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Fault-based Testing of Evolving Real-time Systems
不断发展的实时系统的基于故障的测试
- 批准号:
RGPIN-2020-07248 - 财政年份:2022
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Excellence in Research: Actor-Based Modeling and Control of Distributed Networked Autonomous Systems with Fault-Tolerant Protocol Settings
卓越研究:具有容错协议设置的分布式网络自治系统的基于参与者的建模和控制
- 批准号:
2053412 - 财政年份:2021
- 资助金额:
-- - 项目类别:
Standard Grant
Fault-based Testing of Evolving Real-time Systems
不断发展的实时系统的基于故障的测试
- 批准号:
RGPIN-2020-07248 - 财政年份:2021
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Fault Model-Based Testing
基于故障模型的测试
- 批准号:
RGPIN-2017-03900 - 财政年份:2021
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Silicon-based Fault-Tolerant Quantum Computing
硅基容错量子计算
- 批准号:
MR/V023284/1 - 财政年份:2021
- 资助金额:
-- - 项目类别:
Fellowship
Hybrid Data-driven Physics-based Modeling for Machine Fault Detection, Diagnosis, and Prediction
用于机器故障检测、诊断和预测的混合数据驱动的基于物理的建模
- 批准号:
RGPIN-2019-03967 - 财政年份:2021
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual














{{item.name}}会员




