Collaborative Research: CNS Core: Small: A new framework for building fail-slow fault-tolerant distributed systems

合作研究:CNS Core:Small:构建慢速容错分布式系统的新框架

基本信息

  • 批准号:
    2130560
  • 负责人:
  • 金额:
    $ 25万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-10-01 至 2024-09-30
  • 项目状态:
    已结题

项目摘要

This project targets a long-lasting and an increasingly pervasive challenge of distributed system design and implementation—fail-slow fault tolerance. Most existing fault-tolerant distributed systems are developed and tested to tolerate faults where a node has completely stopped, but they often do not perform well with the “fail-slow” faults, where a faulty node has not crashed but is operating at a degraded speed far below the standard performance. Fail-slow faults can happen for various reasons including hardware (e.g., an overheated chip), software (e.g., the process uses up all the memory), network (e.g., a loose cable), and human errors (e.g., the administrator launches too many processes on the same node). In many current fault-tolerant distributed systems, the fail-slow nodes can damage the entire system performance by holding up the healthy nodes in their execution. For example, a healthy node may keep buffering outbound messages to the slow nodes until it uses up its memory and crash. Improving fail-slow fault-tolerance is an important issue as fail-slow faults have been reported to be common in large-scale distributed systems deployed in modern data centers. The performance issues they cause are more hidden and hard to debug. To help improve this situation, this work will develop a set of novel, transformative technologies, including distributed-system programming support, design patterns, and runtime verification techniques, that will be encapsulated in a unified programming framework and will dramatically improve the performance and fault-tolerance of modern distributed systems.This research may have a major impact on industry and society, since distributed systems are the cornerstones of modern computing infrastructures such as cloud computing, cluster and datacenter technologies, and high performance computing. In particular, this work will be done in collaboration with widely used distributed databases, specifically MongoDB and TiDB. The PIs envision this effort as a catalyst for multidisciplinary research and education on distributed systems technologies at Stony Brook University and the University of Illinois. The PIs will use this work as a core that they hope will eventually grow to agglutinate other faculty of diverse expertise with interests in cloud computing, distributed systems, and software engineering technologies. Both universities are experiencing an unprecedented surge of students in Computer Science. The PIs are working with the department to broaden the course offerings with multidisciplinary courses in the general area of cloud computing, distributed systems, reliable systems, and software engineering. The PIs will incorporate the topics in this proposal in the courses they are teaching. The PIs have a long-standing commitment to undergraduate education and research, and to broaden participation to under-represented minorities. They will use this work to involve undergraduates and under-represented students in their research groups.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该项目的目标是分布式系统设计和实现中一个长期且日益普遍的挑战——慢速容错。 大多数现有的容错分布式系统都是为了容忍节点完全停止的故障而开发和测试的,但它们通常在“慢速故障”故障中表现不佳,即故障节点没有崩溃,但运行速度远低于标准性能。发生慢速故障的原因有多种,包括硬件(例如,芯片过热)、软件(例如,进程耗尽了所有内存)、网络(例如,电缆松动)和人为错误(例如,管理员在同一节点上启动了太多进程)。在当前的许多容错分布式系统中,故障慢速节点可能会阻碍健康节点的执行,从而损害整个系统的性能。例如,健康节点可能会继续缓冲发送到慢速节点的出站消息,直到耗尽内存并崩溃。提高故障慢速容错能力是一个重要问题,因为据报道,故障慢速故障在现代数据中心部署的大规模分布式系统中很常见。它们引起的性能问题更加隐蔽且难以调试。为了帮助改善这种情况,这项工作将开发一套新颖的、变革性的技术,包括分布式系统编程支持、设计模式和运行时验证技术,这些技术将被封装在统一的编程框架中,并将极大地提高现代分布式系统的性能和容错能力。这项研究可能会对工业和社会产生重大影响,因为分布式系统是云计算、集群等现代计算基础设施的基石。 和数据中心技术以及高性能计算。特别是,这项工作将与广泛使用的分布式数据库(特别是 MongoDB 和 TiDB)合作完成。 PI 预计这项工作将成为石溪大学和伊利诺伊大学分布式系统技术多学科研究和教育的催化剂。 PI 将使用这项工作作为核心,他们希望这项工作最终能够凝聚其他对云计算、分布式系统和软件工程技术感兴趣的具有不同专业知识的教师。两所大学的计算机科学专业学生数量都出现了前所未有的激增。 PI 正在与该部门合作,通过云计算、分布式系统、可靠系统和软件工程等领域的多学科课程来扩大课程设置。 PI 会将本提案中的主题纳入他们所教授的课程中。 PI 长期致力于本科教育和研究,并扩大代表性不足的少数群体的参与。他们将利用这项工作让本科生和代表性不足的学生参与到他们的研究小组中。该奖项反映了 NSF 的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Relational Debugging - Pinpointing Root Causes of Performance Problems
关系调试 - 查明性能问题的根本原因
Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yinfang Chen;Xudong Sun;Suman Nath;Ze Yang;Tianyi Xu
  • 通讯作者:
    Yinfang Chen;Xudong Sun;Suman Nath;Ze Yang;Tianyi Xu
Automatic Reliability Testing For Cluster Management Controllers
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Xudong Sun;Wenqing Luo;Jiawei Tyler Gu;Aishwarya Ganesan;Ramnatthan Alagappan;Michael Gasch;Lalith Suresh;Tianyin Xu
  • 通讯作者:
    Xudong Sun;Wenqing Luo;Jiawei Tyler Gu;Aishwarya Ganesan;Ramnatthan Alagappan;Michael Gasch;Lalith Suresh;Tianyin Xu
Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems
  • DOI:
    10.1145/3552326.3587448
  • 发表时间:
    2023-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Lilia Tang;Chaitanya Bhandari;Yongle Zhang;Anna Karanika;Shuyang Ji;Indranil Gupta;Tianyi Xu
  • 通讯作者:
    Lilia Tang;Chaitanya Bhandari;Yongle Zhang;Anna Karanika;Shuyang Ji;Indranil Gupta;Tianyi Xu
DepFast: Orchestrating Code of Quorum Systems
DepFast:编排 Quorum 系统代码
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Tianyin Xu其他文献

Decentralizing Microblogging Services by Differentiating User Traffic Demands
差异化用户流量需求,打造去中心化微博服务
  • DOI:
    10.1515/pik-2012-0065
  • 发表时间:
    2013
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Lei Jiao;Tianyin Xu;Yang Chen;Xiaoming Fu
  • 通讯作者:
    Xiaoming Fu
Configuration Testing: Testing Configuration Values Together with Code Logic
  • DOI:
  • 发表时间:
    2019-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Tianyin Xu
  • 通讯作者:
    Tianyin Xu
Trend and Attribution Analysis of Runoff Changes in the Weihe River Basin in the Last 50 Years
近50年渭河流域径流变化趋势及归因分析
  • DOI:
    10.3390/w14010047
  • 发表时间:
    2021-12
  • 期刊:
  • 影响因子:
    3.4
  • 作者:
    Junjie Xu;Xichao Gao;Zhiyong Yang;Tianyin Xu
  • 通讯作者:
    Tianyin Xu
A Survey of Clustering Methods in Mining Data Streaming
挖掘数据流中的聚类方法综述
  • DOI:
  • 发表时间:
    2007
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Tianyin Xu
  • 通讯作者:
    Tianyin Xu

Tianyin Xu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Tianyin Xu', 18)}}的其他基金

CAREER: Rethinking Configuration Management for Cloud and Datacenter Systems
职业:重新思考云和数据中心系统的配置管理
  • 批准号:
    2145295
  • 财政年份:
    2022
  • 资助金额:
    $ 25万
  • 项目类别:
    Continuing Grant
SHF: Small: Science and Tools for Intelligent Developer Testing
SHF:小型:智能开发人员测试的科学和工具
  • 批准号:
    1816615
  • 财政年份:
    2018
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: CNS Core: Medium: Reconfigurable Kernel Datapaths with Adaptive Optimizations
协作研究:CNS 核心:中:具有自适应优化的可重构内核数据路径
  • 批准号:
    2345339
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
  • 批准号:
    2230945
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: NSF-AoF: CNS Core: Small: Towards Scalable and Al-based Solutions for Beyond-5G Radio Access Networks
合作研究:NSF-AoF:CNS 核心:小型:面向超 5G 无线接入网络的可扩展和基于人工智能的解决方案
  • 批准号:
    2225578
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Medium: Movement of Computation and Data in Splitkernel-disaggregated, Data-intensive Systems
合作研究:CNS 核心:媒介:Splitkernel 分解的数据密集型系统中的计算和数据移动
  • 批准号:
    2406598
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Continuing Grant
Collaborative Research: CNS Core: Small: SmartSight: an AI-Based Computing Platform to Assist Blind and Visually Impaired People
合作研究:中枢神经系统核心:小型:SmartSight:基于人工智能的计算平台,帮助盲人和视障人士
  • 批准号:
    2418188
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: Creating An Extensible Internet Through Interposition
合作研究:CNS核心:小:通过介入创建可扩展的互联网
  • 批准号:
    2242503
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: Adaptive Smart Surfaces for Wireless Channel Morphing to Enable Full Multiplexing and Multi-user Gains
合作研究:CNS 核心:小型:用于无线信道变形的自适应智能表面,以实现完全复用和多用户增益
  • 批准号:
    2343959
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: Efficient Ways to Enlarge Practical DNA Storage Capacity by Integrating Bio-Computer Technologies
合作研究:中枢神经系统核心:小型:通过集成生物计算机技术扩大实用 DNA 存储容量的有效方法
  • 批准号:
    2343863
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
  • 批准号:
    2341378
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
Collaborative Research: CISE-MSI: RCBP-RF: CNS: ESD4CDaT - Efficient System Design for Cancer Detection and Treatment
合作研究:CISE-MSI:RCBP-RF:CNS:ESD4CDaT - 癌症检测和治疗的高效系统设计
  • 批准号:
    2318573
  • 财政年份:
    2023
  • 资助金额:
    $ 25万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了