Unreliable Failure Detectors for Reliable Distributed Systems

用于可靠分布式系统的不可靠故障检测器

基本信息

  • 批准号:
    9402896
  • 负责人:
  • 金额:
    $ 23万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing grant
  • 财政年份:
    1995
  • 资助国家:
    美国
  • 起止时间:
    1995-05-01 至 1998-10-31
  • 项目状态:
    已结题

项目摘要

The starting point for this research is a fundamental problem in fault-tolerant distributed computing: reaching agreement among processes in a system that is subject to failures. It is well-known that this problem, called Consensus, has no deterministic solution in asynchronous systems, even if it is assumed that communication is reliable and no more than one process may fail. The impossibility of solving Consensus ( and other related problems such as Atomic Broadcast) is one of the most severe obstacles to implementing fault-tolerant applications in asynchronous systems. In recent work the PI has introduced a novel approach to circumvent such impossibility results: he showed that unreliable failure detectors can be used to solve Consensus (and Atomic Broadcast), even if the information that they provide about failures is highly inaccurate, e.g., even if they make an infinite number of mistakes. Since such failure detectors can be implemented in realistic distributed systems, and since various considerations make the asynchronous models especially attractive, this work suggests an approach to fault-tolerance that is viable in practice. The objectives of this research are to broaden the applicability of this approach by removing the limitations of the earlier work, and to explore in more concrete terms its practicability. Specific goals include: (1) tolerating communication failures, including network partitions (the earlier work assumed reliable links); (2) tolerating process failures of various types (the earlier work dealt with crash failures only); (3) considering shared-memory systems (the earlier work dealt with message-passing systems); and (4) solving other problems that are central to fault-tolerant distributed computing, including Group Membership and Group Multicasts (the earlier work solved Consensus and Atomic Broadcast). In order to assess the cost and benefit of using unreliable failure detectors complexity questions are also explored. Finally, the practicality of this approach is validated by implementation on an experimental platform.
本研究的出发点是容错分布式计算中的一个基本问题:在可能发生故障的系统中,进程之间达成一致。众所周知,这个被称为共识的问题在异步系统中没有确定性的解决方案,即使假设通信是可靠的,并且不会有超过一个进程失败。不可能解决共识(以及其他相关问题,如原子广播)是在异步系统中实现容错应用程序的最严重障碍之一。在最近的工作中,PI引入了一种新的方法来规避这种不可能的结果:他表明,不可靠的故障检测器可以用于解决共识(和原子广播),即使它们提供的关于故障的信息非常不准确,例如,即使它们犯了无限多的错误。由于这种故障检测器可以在实际的分布式系统中实现,并且由于各种考虑使异步模型特别有吸引力,因此这项工作提出了一种在实践中可行的容错方法。本研究的目的是通过消除早期工作的局限性来扩大这种方法的适用性,并以更具体的方式探索其实用性。具体目标包括:(1)容忍通信故障,包括网络分区(早期的工作假设有可靠的链路);(2)容忍各种类型的流程故障(早期的工作只处理崩溃故障);(3)考虑共享内存系统(早期的工作涉及消息传递系统);(4)解决其他对容错分布式计算至关重要的问题,包括组成员和组播(早期的工作解决了共识和原子广播)。为了评估使用不可靠故障检测器的成本和收益,还探讨了复杂性问题。最后,通过实验平台的实现验证了该方法的实用性。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Sam Toueg其他文献

The minimum information about failures for solving non-local tasks in message-passing systems
  • DOI:
    10.1007/s00446-011-0146-4
  • 发表时间:
    2011-11-17
  • 期刊:
  • 影响因子:
    2.100
  • 作者:
    Carole Delporte-Gallet;Hugues Fauconnier;Sam Toueg
  • 通讯作者:
    Sam Toueg
Adaptive progress: a gracefully-degrading liveness property
  • DOI:
    10.1007/s00446-010-0106-4
  • 发表时间:
    2010-06-25
  • 期刊:
  • 影响因子:
    2.100
  • 作者:
    Marcos K. Aguilera;Sam Toueg
  • 通讯作者:
    Sam Toueg
The weakest failure detector to solve nonuniform consensus
  • DOI:
    10.1007/s00446-006-0019-4
  • 发表时间:
    2007-02-02
  • 期刊:
  • 影响因子:
    2.100
  • 作者:
    Jonathan Eisler;Vassos Hadzilacos;Sam Toueg
  • 通讯作者:
    Sam Toueg
On implementing SWMR registers from SWSR registers in systems with Byzantine failures
  • DOI:
    10.1007/s00446-024-00465-5
  • 发表时间:
    2024-06-06
  • 期刊:
  • 影响因子:
    2.100
  • 作者:
    Xing Hu;Sam Toueg
  • 通讯作者:
    Sam Toueg

Sam Toueg的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Sam Toueg', 18)}}的其他基金

Broadcast and Multicast: Two Paradigms for Fault-Tolerant Distributed Computing
广播和组播:容错分布式计算的两种范式
  • 批准号:
    9102231
  • 财政年份:
    1991
  • 资助金额:
    $ 23万
  • 项目类别:
    Standard Grant
Abstractions that Simplify the Design and Verification of Fault-Tolerant Distributed Protocols
简化容错分布式协议设计和验证的抽象
  • 批准号:
    8901780
  • 财政年份:
    1989
  • 资助金额:
    $ 23万
  • 项目类别:
    Continuing grant
Fault-Tolerant Distributed Computing Systems
容错分布式计算系统
  • 批准号:
    8601864
  • 财政年份:
    1986
  • 资助金额:
    $ 23万
  • 项目类别:
    Continuing grant
Routing, Broadcasting and Deadlock-Prevention in Packet-Switching Networks (Computer Research)
包交换网络中的路由、广播和死锁预防(计算机研究)
  • 批准号:
    8303135
  • 财政年份:
    1983
  • 资助金额:
    $ 23万
  • 项目类别:
    Continuing grant

相似国自然基金

Graphon mean field games with partial observation and application to failure detection in distributed systems
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目

相似海外基金

LIB Sparks - Gases, sparks and flames - a numerical study of lithium-ion battery failure in closed spaces and its mitigation
LIB Sparks - 气体、火花和火焰 - 封闭空间内锂离子电池故障及其缓解的数值研究
  • 批准号:
    EP/Y027639/1
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Fellowship
Alterations in macrophage metabolism in heart failure with preserved ejection
射血保留性心力衰竭患者巨噬细胞代谢的改变
  • 批准号:
    502586
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
SBIR Phase II: In-vivo validation of a volume-manufacturable and factory-calibrated wearable NT-proBNP monitoring system for heart failure treatment
SBIR II 期:用于心力衰竭治疗的可批量生产和工厂校准的可穿戴 NT-proBNP 监测系统的体内验证
  • 批准号:
    2335105
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Cooperative Agreement
High-rise landscapes: The afterlives of tower block 'failure' and rethinking urban futures
高层景观:塔楼“失败”的后遗症和重新思考城市未来
  • 批准号:
    MR/Y003586/1
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Fellowship
Predictive Assessment of Material Failure
材料失效的预测评估
  • 批准号:
    2904642
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Studentship
A hybrid Deep Learning-assisted Finite Element technique to predict dynamic failure evolution in advanced ceramics (DeLFE)
用于预测先进陶瓷动态失效演化的混合深度学习辅助有限元技术 (DeLFE)
  • 批准号:
    EP/Y004671/1
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Research Grant
CAREER: Recycled Polymers of Enhanced Strength and Toughness: Predicting Failure and Unraveling Deformation to Enable Circular Transitions
职业:增强强度和韧性的再生聚合物:预测失效和解开变形以实现圆形过渡
  • 批准号:
    2338508
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Standard Grant
CAREER: Understanding Fiber Bundle Failure Mechanics for Ultra-high Reliability Applications
职业:了解超高可靠性应用的光纤束失效机制
  • 批准号:
    2339223
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Standard Grant
Computational MultiPhysics Analysis of 3D Structural Damage and Failure
3D 结构损伤和失效的计算多物理场分析
  • 批准号:
    DP240101471
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Discovery Projects
Understanding the link between bone marrow failure and chronic inflammation through the lens of VEXAS syndrome
从 VEXAS 综合征的角度了解骨髓衰竭与慢性炎症之间的联系
  • 批准号:
    MR/Y011945/1
  • 财政年份:
    2024
  • 资助金额:
    $ 23万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了