Collaborative Research: CNS Core: Small: A new framework for building fail-slow fault-tolerant distributed systems

合作研究:CNS Core:Small:构建慢速容错分布式系统的新框架

基本信息

  • 批准号:
    2130560
  • 负责人:
  • 金额:
    $ 25万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-10-01 至 2024-09-30
  • 项目状态:
    已结题

项目摘要

This project targets a long-lasting and an increasingly pervasive challenge of distributed system design and implementation—fail-slow fault tolerance. Most existing fault-tolerant distributed systems are developed and tested to tolerate faults where a node has completely stopped, but they often do not perform well with the “fail-slow” faults, where a faulty node has not crashed but is operating at a degraded speed far below the standard performance. Fail-slow faults can happen for various reasons including hardware (e.g., an overheated chip), software (e.g., the process uses up all the memory), network (e.g., a loose cable), and human errors (e.g., the administrator launches too many processes on the same node). In many current fault-tolerant distributed systems, the fail-slow nodes can damage the entire system performance by holding up the healthy nodes in their execution. For example, a healthy node may keep buffering outbound messages to the slow nodes until it uses up its memory and crash. Improving fail-slow fault-tolerance is an important issue as fail-slow faults have been reported to be common in large-scale distributed systems deployed in modern data centers. The performance issues they cause are more hidden and hard to debug. To help improve this situation, this work will develop a set of novel, transformative technologies, including distributed-system programming support, design patterns, and runtime verification techniques, that will be encapsulated in a unified programming framework and will dramatically improve the performance and fault-tolerance of modern distributed systems.This research may have a major impact on industry and society, since distributed systems are the cornerstones of modern computing infrastructures such as cloud computing, cluster and datacenter technologies, and high performance computing. In particular, this work will be done in collaboration with widely used distributed databases, specifically MongoDB and TiDB. The PIs envision this effort as a catalyst for multidisciplinary research and education on distributed systems technologies at Stony Brook University and the University of Illinois. The PIs will use this work as a core that they hope will eventually grow to agglutinate other faculty of diverse expertise with interests in cloud computing, distributed systems, and software engineering technologies. Both universities are experiencing an unprecedented surge of students in Computer Science. The PIs are working with the department to broaden the course offerings with multidisciplinary courses in the general area of cloud computing, distributed systems, reliable systems, and software engineering. The PIs will incorporate the topics in this proposal in the courses they are teaching. The PIs have a long-standing commitment to undergraduate education and research, and to broaden participation to under-represented minorities. They will use this work to involve undergraduates and under-represented students in their research groups.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
这个项目的目标是分布式系统设计和实现的一个长期的和越来越普遍的挑战-失败-缓慢容错。 大多数现有的容错分布式系统的开发和测试,以容忍故障,一个节点已经完全停止,但他们往往不能很好地执行与“故障慢”的故障,其中一个故障节点没有崩溃,但正在运行在一个降级的速度远远低于标准的性能。慢失效故障可能由于各种原因而发生,包括硬件(例如,过热的芯片),软件(例如,该过程用尽所有存储器),网络(例如,松动的电缆),以及人为错误(例如,管理员在同一节点上启动太多进程)。在目前的许多容错分布式系统中,失效缓慢的节点可能会通过在它们的执行中阻碍健康的节点而损害整个系统的性能。例如,一个健康的节点可能会一直将出站消息缓冲到慢节点,直到它耗尽内存并崩溃。在现代数据中心部署的大规模分布式系统中,慢故障已被报道为常见故障,因此提高慢故障容错性是一个重要的问题。它们导致的性能问题更加隐蔽,难以调试。为了帮助改善这种情况,这项工作将开发一套新颖的、变革性的技术,包括分布式系统编程支持、设计模式和运行时验证技术,这些技术将被封装在一个统一的编程框架中,并将极大地提高现代分布式系统的性能和容错能力。这项研究可能对工业和社会产生重大影响,因为分布式系统是诸如云计算、集群和数据中心技术以及高性能计算之类的现代计算基础设施的基石。特别是,这项工作将与广泛使用的分布式数据库合作完成,特别是MongoDB和TiDB。PI将这一努力视为斯托尼布鲁克大学和伊利诺伊大学分布式系统技术多学科研究和教育的催化剂。PI将把这项工作作为一个核心,他们希望最终能够发展到凝聚其他在云计算,分布式系统和软件工程技术方面感兴趣的不同专业知识的教师。这两所大学都在经历计算机科学学生前所未有的激增。PI正在与该部门合作,在云计算,分布式系统,可靠系统和软件工程的一般领域扩大多学科课程的课程设置。PI将把本提案中的主题纳入他们所教授的课程中。PI长期致力于本科教育和研究,并扩大代表性不足的少数民族的参与。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Relational Debugging - Pinpointing Root Causes of Performance Problems
关系调试 - 查明性能问题的根本原因
Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yinfang Chen;Xudong Sun;Suman Nath;Ze Yang;Tianyi Xu
  • 通讯作者:
    Yinfang Chen;Xudong Sun;Suman Nath;Ze Yang;Tianyi Xu
Automatic Reliability Testing For Cluster Management Controllers
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Xudong Sun;Wenqing Luo;Jiawei Tyler Gu;Aishwarya Ganesan;Ramnatthan Alagappan;Michael Gasch;Lalith Suresh;Tianyin Xu
  • 通讯作者:
    Xudong Sun;Wenqing Luo;Jiawei Tyler Gu;Aishwarya Ganesan;Ramnatthan Alagappan;Michael Gasch;Lalith Suresh;Tianyin Xu
Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems
  • DOI:
    10.1145/3552326.3587448
  • 发表时间:
    2023-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Lilia Tang;Chaitanya Bhandari;Yongle Zhang;Anna Karanika;Shuyang Ji;Indranil Gupta;Tianyi Xu
  • 通讯作者:
    Lilia Tang;Chaitanya Bhandari;Yongle Zhang;Anna Karanika;Shuyang Ji;Indranil Gupta;Tianyi Xu
DepFast: Orchestrating Code of Quorum Systems
DepFast:编排 Quorum 系统代码
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Tianyin Xu其他文献

Decentralizing Microblogging Services by Differentiating User Traffic Demands
差异化用户流量需求,打造去中心化微博服务
  • DOI:
    10.1515/pik-2012-0065
  • 发表时间:
    2013
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Lei Jiao;Tianyin Xu;Yang Chen;Xiaoming Fu
  • 通讯作者:
    Xiaoming Fu
Configuration Testing: Testing Configuration Values Together with Code Logic
  • DOI:
  • 发表时间:
    2019-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Tianyin Xu
  • 通讯作者:
    Tianyin Xu
Trend and Attribution Analysis of Runoff Changes in the Weihe River Basin in the Last 50 Years
近50年渭河流域径流变化趋势及归因分析
  • DOI:
    10.3390/w14010047
  • 发表时间:
    2021-12
  • 期刊:
  • 影响因子:
    3.4
  • 作者:
    Junjie Xu;Xichao Gao;Zhiyong Yang;Tianyin Xu
  • 通讯作者:
    Tianyin Xu
A Survey of Clustering Methods in Mining Data Streaming
挖掘数据流中的聚类方法综述
  • DOI:
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Tianyin Xu
  • 通讯作者:
    Tianyin Xu

Tianyin Xu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Tianyin Xu', 18)}}的其他基金

CAREER: Rethinking Configuration Management for Cloud and Datacenter Systems
职业:重新思考云和数据中心系统的配置管理
  • 批准号:
    2145295
  • 财政年份:
    2022
  • 资助金额:
    $ 25万
  • 项目类别:
    Continuing Grant
SHF: Small: Science and Tools for Intelligent Developer Testing
SHF:小型:智能开发人员测试的科学和工具
  • 批准号:
    1816615
  • 财政年份:
    2018
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
  • 批准号:
    2230945
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Medium: Movement of Computation and Data in Splitkernel-disaggregated, Data-intensive Systems
合作研究:CNS 核心:媒介:Splitkernel 分解的数据密集型系统中的计算和数据移动
  • 批准号:
    2406598
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Continuing Grant
Collaborative Research: CNS Core: Small: SmartSight: an AI-Based Computing Platform to Assist Blind and Visually Impaired People
合作研究:中枢神经系统核心:小型:SmartSight:基于人工智能的计算平台,帮助盲人和视障人士
  • 批准号:
    2418188
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Medium: Reconfigurable Kernel Datapaths with Adaptive Optimizations
协作研究:CNS 核心:中:具有自适应优化的可重构内核数据路径
  • 批准号:
    2345339
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: NSF-AoF: CNS Core: Small: Towards Scalable and Al-based Solutions for Beyond-5G Radio Access Networks
合作研究:NSF-AoF:CNS 核心:小型:面向超 5G 无线接入网络的可扩展和基于人工智能的解决方案
  • 批准号:
    2225578
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: Creating An Extensible Internet Through Interposition
合作研究:CNS核心:小:通过介入创建可扩展的互联网
  • 批准号:
    2242503
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: Adaptive Smart Surfaces for Wireless Channel Morphing to Enable Full Multiplexing and Multi-user Gains
合作研究:CNS 核心:小型:用于无线信道变形的自适应智能表面,以实现完全复用和多用户增益
  • 批准号:
    2343959
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: Efficient Ways to Enlarge Practical DNA Storage Capacity by Integrating Bio-Computer Technologies
合作研究:中枢神经系统核心:小型:通过集成生物计算机技术扩大实用 DNA 存储容量的有效方法
  • 批准号:
    2343863
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
  • 批准号:
    2341378
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Medium: Innovating Volumetric Video Streaming with Motion Forecasting, Intelligent Upsampling, and QoE Modeling
合作研究:CNS 核心:中:通过运动预测、智能上采样和 QoE 建模创新体积视频流
  • 批准号:
    2409008
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了