Fault-Tolerant Computing for Machine Learning Applications
机器学习应用的容错计算
基本信息
- 批准号:RGPIN-2020-06884
- 负责人:
- 金额:$ 2.4万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2020
- 资助国家:加拿大
- 起止时间:2020-01-01 至 2021-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Machine learning has been successfully used in consumer applications, such as image/speech recognition. In recent years there has been a growing interest for adoption of machine learning in autonomous systems. In order to achieve the goal of full autonomy, not only lifelong machine learning paradigm must come of age, but also its guaranteed operation on degradable hardware is a necessity. Motivated by the above, the aim of this project is to investigate how machine learning models can be adapted in a gracefully degradable manner in the presence of hardware faults that arise in--field.
In the early stages of this research program it will be necessary to understand what is unique about how faulty hardware interacts with machine learning applications. In particular, in addition to the set of sequentially--redundant faults, i.e., faults that cannot be excited in any reachable state of digital hardware, machine learning applications are expected to give rise to a large set of application--redundant faults, i.e., faults that cannot affect an observable output under a given set of application constraints, e.g., model parameters during the inference phase. Furthermore, since most machine learning applications have a user--acceptable loss in prediction accuracy, it is equally important to understand which types of hardware faults produce a tolerable vs an intolerable loss in prediction accuracy. Subsequently we will focus on developing novel methods for in--field test, diagnosis and fault tolerance that are specific to the characteristics of machine learning workloads. For example, during the inference phase of machine learning application, if one of the operands for a hardware multiplier is a constant then many of the multiplier's internal nets will not be observable; hence the hardware faults on the respective nets will be tolerated. This simple observation can lead to in--depth investigations on how to re-map/re-schedule nodes/operations on the large number of multiplier blocks present in machine learning hardware in order to tolerate a set of known faults. Alternatively it is also worth investigating how to update the parameters of a machine learning model in order to bypass the faults, while guaranteeing a tolerable loss in prediction accuracy. On another line of thought, in reinforcement learning environments used for autonomous systems, one needs to ensure that learning can continue despite the presence of hardware faults. This raises the question whether the existing on-line learning algorithms can be redefined in order to ensure that model parameters can be adjusted not only to the unique operating environment but also to the faulty hardware.
As summarized above, machine learning workloads bring new dimensions to the field of fault--tolerant computing. It is the main focus of this research program to investigate these new dimensions and develop fault- tolerant computing methods adaptable to a broad spectrum of hardware architectures.
机器学习已经成功地应用于消费者应用,例如图像/语音识别。近年来,在自主系统中采用机器学习的兴趣越来越大。为了实现完全自主的目标,不仅终身机器学习范式必须成熟,而且其在可降级硬件上的保证运行也是必要的。出于以上原因,本项目的目的是调查机器学习模型如何在现场出现硬件故障的情况下以优雅的可降级方式进行调整。
在该研究计划的早期阶段,有必要了解故障硬件如何与机器学习应用程序交互的独特之处。具体地说,除了顺序冗余故障集(即在数字硬件的任何可达状态下不能被激发的故障)之外,机器学习应用预计还会产生大量应用冗余故障,即在给定的应用约束集(例如推理阶段期间的模型参数)下不能影响可观测输出的故障。此外,由于大多数机器学习应用程序在预测精度方面存在用户可接受的损失,因此了解哪些类型的硬件故障在预测精度方面产生可容忍的损失与不可容忍的损失同样重要。随后,我们将专注于开发针对机器学习工作负载特征的现场测试、诊断和容错的新方法。例如,在机器学习应用的推理阶段,如果硬件乘法器的一个操作数是常量,则该乘法器的许多内部网络将不可观察;因此,将容忍相应网络上的硬件故障。这种简单的观察可以导致对如何在机器学习硬件中存在的大量乘法器块上重新映射/重新调度节点/操作以便容忍一组已知故障的深入研究。或者,也值得研究如何更新机器学习模型的参数,以便绕过故障,同时保证预测精度的可容忍损失。另一方面,在用于自主系统的强化学习环境中,需要确保在存在硬件故障的情况下仍能继续学习。这就提出了一个问题,即是否可以重新定义现有的在线学习算法,以确保模型参数不仅可以针对独特的运行环境进行调整,还可以针对故障硬件进行调整。
如上所述,机器学习工作负载为容错计算领域带来了新的维度。研究这些新的维度并开发适用于广泛硬件体系结构的容错计算方法是本研究计划的主要重点。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Nicolici, Nicola其他文献
A Parallel Computing Platform for Real-Time Haptic Interaction with Deformable Bodies
- DOI:
10.1109/toh.2009.50 - 发表时间:
2010-07-01 - 期刊:
- 影响因子:2.9
- 作者:
Mafi, Ramin;Sirouspour, Shahin;Nicolici, Nicola - 通讯作者:
Nicolici, Nicola
Nicolici, Nicola的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Nicolici, Nicola', 18)}}的其他基金
Fault-Tolerant Computing for Machine Learning Applications
机器学习应用的容错计算
- 批准号:
RGPIN-2020-06884 - 财政年份:2022
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Fault-Tolerant Computing for Machine Learning Applications
机器学习应用的容错计算
- 批准号:
RGPIN-2020-06884 - 财政年份:2021
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Systematic and Structural Methods for Post-Silicon Validation
用于硅后验证的系统性和结构性方法
- 批准号:
RGPIN-2015-05312 - 财政年份:2019
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Systematic and Structural Methods for Post-Silicon Validation
用于硅后验证的系统性和结构性方法
- 批准号:
RGPIN-2015-05312 - 财政年份:2018
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Systematic and Structural Methods for Post-Silicon Validation
用于硅后验证的系统性和结构性方法
- 批准号:
478097-2015 - 财政年份:2017
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Accelerator Supplements
Systematic and Structural Methods for Post-Silicon Validation
用于硅后验证的系统性和结构性方法
- 批准号:
RGPIN-2015-05312 - 财政年份:2017
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Systematic and Structural Methods for Post-Silicon Validation
用于硅后验证的系统性和结构性方法
- 批准号:
RGPIN-2015-05312 - 财政年份:2016
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Systematic and Structural Methods for Post-Silicon Validation
用于硅后验证的系统性和结构性方法
- 批准号:
RGPIN-2015-05312 - 财政年份:2015
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Systematic and Structural Methods for Post-Silicon Validation
用于硅后验证的系统性和结构性方法
- 批准号:
478097-2015 - 财政年份:2015
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Accelerator Supplements
Hardware accelerators for biomedical applications
适用于生物医学应用的硬件加速器
- 批准号:
239003-2010 - 财政年份:2014
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
相似海外基金
CAREER: Towards Fault-tolerant Edge Computing for Cyber-Physical Systems: Distributed Primitives for Coordination under Cyber Attacks
职业:走向网络物理系统的容错边缘计算:网络攻击下协调的分布式原语
- 批准号:
2334021 - 财政年份:2023
- 资助金额:
$ 2.4万 - 项目类别:
Continuing Grant
CAREER: Towards Fault-tolerant Edge Computing for Cyber-Physical Systems: Distributed Primitives for Coordination under Cyber Attacks
职业:走向网络物理系统的容错边缘计算:网络攻击下协调的分布式原语
- 批准号:
2238020 - 财政年份:2023
- 资助金额:
$ 2.4万 - 项目类别:
Continuing Grant
Fault-Tolerant Computing for Machine Learning Applications
机器学习应用的容错计算
- 批准号:
RGPIN-2020-06884 - 财政年份:2022
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
CAREER: Noise-Tailored Architectures for Fault-Tolerant Continuous-Variable Quantum Computing
职业:用于容错连续变量量子计算的噪声定制架构
- 批准号:
2145223 - 财政年份:2022
- 资助金额:
$ 2.4万 - 项目类别:
Continuing Grant
Realistic fault modelling to enable optimization of low power IoT and Cognitive fault-tolerant computing systems
现实故障建模可优化低功耗物联网和认知容错计算系统
- 批准号:
EP/T026022/1 - 财政年份:2021
- 资助金额:
$ 2.4万 - 项目类别:
Research Grant
Realistic fault modelling to enable optimization of low power IoT and Cognitive fault-tolerant computing systems
现实故障建模可优化低功耗物联网和认知容错计算系统
- 批准号:
EP/T023244/1 - 财政年份:2021
- 资助金额:
$ 2.4万 - 项目类别:
Research Grant
Silicon-based Fault-Tolerant Quantum Computing
硅基容错量子计算
- 批准号:
MR/V023284/1 - 财政年份:2021
- 资助金额:
$ 2.4万 - 项目类别:
Fellowship
Fault-Tolerant Computing for Machine Learning Applications
机器学习应用的容错计算
- 批准号:
RGPIN-2020-06884 - 财政年份:2021
- 资助金额:
$ 2.4万 - 项目类别:
Discovery Grants Program - Individual
Fault-tolerant Mobile Agent Computing
容错移动代理计算
- 批准号:
518231-2018 - 财政年份:2020
- 资助金额:
$ 2.4万 - 项目类别:
Alexander Graham Bell Canada Graduate Scholarships - Doctoral
Fault-tolerant Mobile Agent Computing
容错移动代理计算
- 批准号:
518231-2018 - 财政年份:2019
- 资助金额:
$ 2.4万 - 项目类别:
Alexander Graham Bell Canada Graduate Scholarships - Doctoral