Improving Systems Reliability Through Record/Replay

通过记录/重放提高系统可靠性

基本信息

  • 批准号:
    RGPIN-2018-04512
  • 负责人:
  • 金额:
    $ 2.04万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2018
  • 资助国家:
    加拿大
  • 起止时间:
    2018-01-01 至 2019-12-31
  • 项目状态:
    已结题

项目摘要

Software bugs significantly hamper software reliability leading to service outages, data loss, and security vulnerabilities. Unfortunately, most production software failures go undiagnosed. This occurs for two reasons. First, the sheer number of crashes in widely used software – for example, Microsoft reported seeing over a million crashes a day in one recent publication and staff at Mozilla reported to us they see tens of thousands of crashes a day. Second, current crash reporting technologies, such as core dumps and stack traces, provide very little context. Consequently, automatically grouping (i.e., the set of crashes caused by the same bug), diagnosing (i.e., this is a concurrency bug, buffer overflow, etc.), and prioritizing bugs (i.e., whether a bug is potentially exploitable) has been impractical.******To address this, we have been building a toolchain that allows developers to record and replay production applications, called Castor. Castor has several goals. First, we are attempting to provide a platform for application record/replay with low enough performance and space overhead to make it practical to leave on by default in production, capturing bugs when they occur. In prior work, we have shown that it is possible to record even the most demanding multi-core scientific and server workloads with CPU overheads of a few percent. Moving forward, we hope to demonstrate new techniques to achieve this with modest space overheads and explore new approaches to bug triage and diagnosis that this can enable.******The primary objective of this research proposal is to improve developer tools such that no crash is left undiagnosed. We want to continue building on our record/replay work, improving its core capabilities in terms of usability, performance and space efficiency, and to use it in conjunction with dynamic analysis, program slicing, and other techniques to develop better tools for grouping, diagnosing and prioritizing bugs automatically.******A secondary objective is to explore alternative uses of record/replay. First, we will study techniques to generate a cycle-accurate replay good enough for any performance analysis tools. Second, we want to investigate helping developers automatically create fault tolerant processes that support tolerating non-fail-stop errors. Related to this we can extend the ideas to make fault tolerant software robust to exploits with little to no overhead.******This research program will contribute to my broader research goal of improving systems reliability in a world ever more dependent on computers. We will have the opportunity to make an impact in the way software is developed and maintained. It will also provide a rich training ground for HQP in a field that is in high demand in Canada.
软件bug会严重影响软件的可靠性,导致服务中断、数据丢失和安全漏洞。不幸的是,大多数生产软件的故障都没有被诊断出来。出现这种情况有两个原因。首先,广泛使用的软件崩溃的绝对数量——例如,微软在最近的一份出版物中报告说,他们每天看到超过100万次崩溃,Mozilla的员工向我们报告说,他们每天看到数万次崩溃。其次,当前的崩溃报告技术,如核心转储和堆栈跟踪,提供的上下文很少。因此,自动分组(例如,由同一错误引起的一组崩溃)、诊断(例如,这是并发错误、缓冲区溢出等)和对错误进行优先级排序(例如,错误是否可能被利用)是不切实际的。******为了解决这个问题,我们一直在构建一个工具链,允许开发人员记录和重播生产应用程序,称为Castor。Castor有几个目标。首先,我们试图提供一个性能和空间开销足够低的应用程序记录/重放平台,以便在生产环境中保留默认状态,并在出现错误时捕获它们。在之前的工作中,我们已经证明,即使是最苛刻的多核科学和服务器工作负载,也可以用几个百分点的CPU开销来记录。展望未来,我们希望展示新技术,以适度的空间开销实现这一目标,并探索能够支持的bug分类和诊断的新方法。******本研究计划的主要目标是改进开发人员工具,这样就不会有未诊断的崩溃。我们希望继续构建我们的记录/重放工作,在可用性、性能和空间效率方面改进其核心功能,并将其与动态分析、程序切片和其他技术结合使用,以开发更好的工具来自动分组、诊断和优先处理错误。******第二个目标是探索记录/重播的其他用途。首先,我们将学习生成周期精确回放的技术,足以用于任何性能分析工具。其次,我们希望研究如何帮助开发人员自动创建容错流程,以支持容忍非故障停止错误。与此相关的是,我们可以扩展这些思想,使容错软件在很少甚至没有开销的情况下健壮地应对攻击。******这个研究计划将有助于我更广泛的研究目标,即在一个越来越依赖计算机的世界中提高系统可靠性。我们将有机会对软件开发和维护的方式产生影响。它还将为HQP在加拿大高需求领域提供丰富的培训场地。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Mashtizadeh, Ali其他文献

Mashtizadeh, Ali的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Mashtizadeh, Ali', 18)}}的其他基金

Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2022
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2021
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Continuous performance optimization on odern Architectures
现代架构的持续性能优化
  • 批准号:
    536639-2018
  • 财政年份:
    2020
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Collaborative Research and Development Grants
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2020
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Continuous performance optimization on odern Architectures
现代架构的持续性能优化
  • 批准号:
    536639-2018
  • 财政年份:
    2019
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Collaborative Research and Development Grants
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2019
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    DGECR-2018-00321
  • 财政年份:
    2018
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Launch Supplement

相似国自然基金

Graphon mean field games with partial observation and application to failure detection in distributed systems
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
EstimatingLarge Demand Systems with MachineLearning Techniques
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    外国学者研究基金
Understanding complicated gravitational physics by simple two-shell systems
  • 批准号:
    12005059
  • 批准年份:
    2020
  • 资助金额:
    24.0 万元
  • 项目类别:
    青年科学基金项目
Simulation and certification of the ground state of many-body systems on quantum simulators
  • 批准号:
  • 批准年份:
    2020
  • 资助金额:
    40 万元
  • 项目类别:
全基因组系统作图(systems mapping)研究三种细菌种间互作遗传机制
  • 批准号:
    31971398
  • 批准年份:
    2019
  • 资助金额:
    58.0 万元
  • 项目类别:
    面上项目
The formation and evolution of planetary systems in dense star clusters
  • 批准号:
    11043007
  • 批准年份:
    2010
  • 资助金额:
    10.0 万元
  • 项目类别:
    专项基金项目

相似海外基金

Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2022
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
A Synchrophasor-Assisted Control Framework for Improving Power Quality, Reliability, and Resiliency of Modern Power Systems
用于提高现代电力系统的电能质量、可靠性和弹性的同步相量辅助控制框架
  • 批准号:
    RGPIN-2021-02940
  • 财政年份:
    2022
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2021
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
A Synchrophasor-Assisted Control Framework for Improving Power Quality, Reliability, and Resiliency of Modern Power Systems
用于提高现代电力系统的电能质量、可靠性和弹性的同步相量辅助控制框架
  • 批准号:
    RGPIN-2021-02940
  • 财政年份:
    2021
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2020
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Collaborative Research: Improving Energy Reliability by Co-Optimization Planning for Interdependent Electricity and Natural Gas Infrastructure Systems
合作研究:通过相互依赖的电力和天然气基础设施系统的协同优化规划提高能源可靠性
  • 批准号:
    1906780
  • 财政年份:
    2019
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Standard Grant
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    RGPIN-2018-04512
  • 财政年份:
    2019
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Grants Program - Individual
Improving Systems Reliability Through Record/Replay
通过记录/重放提高系统可靠性
  • 批准号:
    DGECR-2018-00321
  • 财政年份:
    2018
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Discovery Launch Supplement
Collaborative Research: Improving Energy Reliability by Co-Optimization Planning for Interdependent Electricity and Natural Gas Infrastructure Systems
合作研究:通过相互依赖的电力和天然气基础设施系统的协同优化规划提高能源可靠性
  • 批准号:
    1635472
  • 财政年份:
    2017
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Standard Grant
Collaborative Research: Improving Energy Reliability by Co-Optimization Planning for Interdependent Electricity and Natural Gas Infrastructure Systems
合作研究:通过相互依赖的电力和天然气基础设施系统的协同优化规划提高能源可靠性
  • 批准号:
    1635339
  • 财政年份:
    2017
  • 资助金额:
    $ 2.04万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了