Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
基本信息
- 批准号:RGPIN-2018-04512
- 负责人:
- 金额:$ 2.04万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2022
- 资助国家:加拿大
- 起止时间:2022-01-01 至 2023-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Software bugs significantly hamper software reliability leading to service outages, data loss, and security vulnerabilities. Unfortunately, most production software failures go undiagnosed. This occurs for two reasons. First, the sheer number of crashes in widely used software for example, Microsoft reported seeing over a million crashes a day in one recent publication and staff at Mozilla reported to us they see tens of thousands of crashes a day. Second, current crash reporting technologies, such as core dumps and stack traces, provide very little context. Consequently, automatically grouping (i.e., the set of crashes caused by the same bug), diagnosing (i.e., this is a concurrency bug, buffer overflow, etc.), and prioritizing bugs (i.e., whether a bug is potentially exploitable) has been impractical.To address this, we have been building a toolchain that allows developers to record and replay production applications, called Castor. Castor has several goals. First, we are attempting to provide a platform for application record/replay with low enough performance and space overhead to make it practical to leave on by default in production, capturing bugs when they occur. In prior work, we have shown that it is possible to record even the most demanding multi-core scientific and server workloads with CPU overheads of a few percent. Moving forward, we hope to demonstrate new techniques to achieve this with modest space overheads and explore new approaches to bug triage and diagnosis that this can enable.The primary objective of this research proposal is to improve developer tools such that no crash is left undiagnosed. We want to continue building on our record/replay work, improving its core capabilities in terms of usability, performance and space efficiency, and to use it in conjunction with dynamic analysis, program slicing, and other techniques to develop better tools for grouping, diagnosing and prioritizing bugs automatically.A secondary objective is to explore alternative uses of record/replay. First, we will study techniques to generate a cycle-accurate replay good enough for any performance analysis tools. Second, we want to investigate helping developers automatically create fault tolerant processes that support tolerating non-fail-stop errors. Related to this we can extend the ideas to make fault tolerant software robust to exploits with little to no overhead.This research program will contribute to my broader research goal of improving systems reliability in a world ever more dependent on computers. We will have the opportunity to make an impact in the way software is developed and maintained. It will also provide a rich training ground for HQP in a field that is in high demand in Canada.
软件缺陷严重影响软件可靠性,导致服务中断、数据丢失和安全漏洞。不幸的是,大多数生产软件故障都没有得到诊断。出现这种情况有两个原因。首先,在广泛使用的软件中,崩溃的绝对数量例如,微软在最近的一份出版物中报告说,每天看到超过一百万次崩溃,Mozilla的工作人员向我们报告说,他们每天看到数万次崩溃。其次,当前的崩溃报告技术,如核心转储和堆栈跟踪,提供的上下文很少。因此,自动分组(即,由相同错误引起的崩溃集),诊断(即,这是并发错误、缓冲区溢出等),以及对错误进行优先级排序(即,为了解决这个问题,我们一直在构建一个工具链,允许开发人员记录和重放生产应用程序,称为Castor。Castor有几个目标。首先,我们试图提供一个平台,用于应用程序记录/重放,具有足够低的性能和空间开销,以便在生产中默认保留,并在错误发生时捕获错误。在之前的工作中,我们已经证明,即使是最苛刻的多核科学和服务器工作负载,也可以记录几个百分点的CPU开销。展望未来,我们希望展示新的技术,以实现这一点与适度的空间开销,并探索新的方法,以错误分类和诊断,这可以启用。这项研究提案的主要目标是改善开发工具,使没有崩溃是未被诊断。我们希望继续在我们的记录/重放工作的基础上,在可用性、性能和空间效率方面改进其核心功能,并将其与动态分析、程序切片和其他技术结合使用,以开发更好的工具,用于自动分组、诊断和优先排序错误。首先,我们将研究生成足以用于任何性能分析工具的周期精确重放的技术。第二,我们希望研究如何帮助开发人员自动创建支持容忍非故障停止错误的容错流程。与此相关,我们可以扩展的想法,使容错软件强大的漏洞很少或没有overhead.This研究计划将有助于我更广泛的研究目标,提高系统的可靠性,在世界上越来越依赖于计算机。 我们将有机会对软件开发和维护的方式产生影响。 它还将为加拿大高需求领域的HQP提供丰富的培训基地。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Mashtizadeh, Ali其他文献
Mashtizadeh, Ali的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Mashtizadeh, Ali', 18)}}的其他基金
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2021
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2020
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Continuous performance optimization on odern Architectures
现代架构的持续性能优化
- 批准号:
536639-2018 - 财政年份:2020
- 资助金额:
$ 2.04万 - 项目类别:
Collaborative Research and Development Grants
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2019
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Continuous performance optimization on odern Architectures
现代架构的持续性能优化
- 批准号:
536639-2018 - 财政年份:2019
- 资助金额:
$ 2.04万 - 项目类别:
Collaborative Research and Development Grants
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
DGECR-2018-00321 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Launch Supplement
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
相似国自然基金
Graphon mean field games with partial observation and application to failure detection in distributed systems
- 批准号:
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
基于“阳化气、阴成形”理论探讨龟鹿二仙胶调控 HIF-1α/Systems Xc-通路抑制铁死亡治疗少弱精子症的作用机理
- 批准号:
- 批准年份:2024
- 资助金额:15.0 万元
- 项目类别:省市级项目
EstimatingLarge Demand Systems with MachineLearning Techniques
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:外国学者研究基金
Understanding complicated gravitational physics by simple two-shell systems
- 批准号:12005059
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
Simulation and certification of the ground state of many-body systems on quantum simulators
- 批准号:
- 批准年份:2020
- 资助金额:40 万元
- 项目类别:
全基因组系统作图(systems mapping)研究三种细菌种间互作遗传机制
- 批准号:31971398
- 批准年份:2019
- 资助金额:58.0 万元
- 项目类别:面上项目
The formation and evolution of planetary systems in dense star clusters
- 批准号:11043007
- 批准年份:2010
- 资助金额:10.0 万元
- 项目类别:专项基金项目
相似海外基金
A Synchrophasor-Assisted Control Framework for Improving Power Quality, Reliability, and Resiliency of Modern Power Systems
用于提高现代电力系统的电能质量、可靠性和弹性的同步相量辅助控制框架
- 批准号:
RGPIN-2021-02940 - 财政年份:2022
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2021
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
A Synchrophasor-Assisted Control Framework for Improving Power Quality, Reliability, and Resiliency of Modern Power Systems
用于提高现代电力系统的电能质量、可靠性和弹性的同步相量辅助控制框架
- 批准号:
RGPIN-2021-02940 - 财政年份:2021
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2020
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Collaborative Research: Improving Energy Reliability by Co-Optimization Planning for Interdependent Electricity and Natural Gas Infrastructure Systems
合作研究:通过相互依赖的电力和天然气基础设施系统的协同优化规划提高能源可靠性
- 批准号:
1906780 - 财政年份:2019
- 资助金额:
$ 2.04万 - 项目类别:
Standard Grant
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2019
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
DGECR-2018-00321 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Launch Supplement
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
- 批准号:
RGPIN-2018-04512 - 财政年份:2018
- 资助金额:
$ 2.04万 - 项目类别:
Discovery Grants Program - Individual
Collaborative Research: Improving Energy Reliability by Co-Optimization Planning for Interdependent Electricity and Natural Gas Infrastructure Systems
合作研究:通过相互依赖的电力和天然气基础设施系统的协同优化规划提高能源可靠性
- 批准号:
1635472 - 财政年份:2017
- 资助金额:
$ 2.04万 - 项目类别:
Standard Grant
Collaborative Research: Improving Energy Reliability by Co-Optimization Planning for Interdependent Electricity and Natural Gas Infrastructure Systems
合作研究:通过相互依赖的电力和天然气基础设施系统的协同优化规划提高能源可靠性
- 批准号:
1635339 - 财政年份:2017
- 资助金额:
$ 2.04万 - 项目类别:
Standard Grant