Collaborative Research: CNS Core: Small: A new framework for building fail-slow fault-tolerant distributed systems
合作研究:CNS Core:Small:构建慢速容错分布式系统的新框架
基本信息
- 批准号:2130590
- 负责人:
- 金额:$ 24.95万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-10-01 至 2024-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
This project targets a long-lasting and an increasingly pervasive challenge of distributed system design and implementation—fail-slow fault tolerance. Most existing fault-tolerant distributed systems are developed and tested to tolerate faults where a node has completely stopped, but they often do not perform well with the “fail-slow” faults, where a faulty node has not crashed but is operating at a degraded speed far below the standard performance. Fail-slow faults can happen for various reasons including hardware (e.g., an overheated chip), software (e.g., the process uses up all the memory), network (e.g., a loose cable), and human errors (e.g., the administrator launches too many processes on the same node). In many current fault-tolerant distributed systems, the fail-slow nodes can damage the entire system performance by holding up the healthy nodes in their execution. For example, a healthy node may keep buffering outbound messages to the slow nodes until it uses up its memory and crash. Improving fail-slow fault-tolerance is an important issue as fail-slow faults have been reported to be common in large-scale distributed systems deployed in modern data centers. The performance issues they cause are more hidden and hard to debug. To help improve this situation, this work will develop a set of novel, transformative technologies, including distributed-system programming support, design patterns, and runtime verification techniques, that will be encapsulated in a unified programming framework and will dramatically improve the performance and fault-tolerance of modern distributed systems.This research may have a major impact on industry and society, since distributed systems are the cornerstones of modern computing infrastructures such as cloud computing, cluster and datacenter technologies, and high performance computing. In particular, this work will be done in collaboration with widely used distributed databases, specifically MongoDB and TiDB. The PIs envision this effort as a catalyst for multidisciplinary research and education on distributed systems technologies at Stony Brook University and the University of Illinois. The PIs will use this work as a core that they hope will eventually grow to agglutinate other faculty of diverse expertise with interests in cloud computing, distributed systems, and software engineering technologies. Both universities are experiencing an unprecedented surge of students in Computer Science. The PIs are working with the department to broaden the course offerings with multidisciplinary courses in the general area of cloud computing, distributed systems, reliable systems, and software engineering. The PIs will incorporate the topics in this proposal in the courses they are teaching. The PIs have a long-standing commitment to undergraduate education and research, and to broaden participation to under-represented minorities. They will use this work to involve undergraduates and under-represented students in their research groups.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
这个项目的目标是分布式系统设计和实现的一个长期的和越来越普遍的挑战-失败-缓慢容错。 大多数现有的容错分布式系统的开发和测试,以容忍故障,一个节点已经完全停止,但他们往往不能很好地执行与“故障慢”的故障,其中一个故障节点没有崩溃,但正在运行在一个降级的速度远远低于标准的性能。慢失效故障可能由于各种原因而发生,包括硬件(例如,过热的芯片),软件(例如,该过程用尽所有存储器),网络(例如,松动的电缆),以及人为错误(例如,管理员在同一节点上启动太多进程)。在目前的许多容错分布式系统中,失效缓慢的节点可能会通过在它们的执行中阻碍健康的节点而损害整个系统的性能。例如,一个健康的节点可能会一直将出站消息缓冲到慢节点,直到它耗尽内存并崩溃。在现代数据中心部署的大规模分布式系统中,慢故障已被报道为常见故障,因此提高慢故障容错性是一个重要的问题。它们导致的性能问题更加隐蔽,难以调试。为了帮助改善这种情况,这项工作将开发一套新颖的、变革性的技术,包括分布式系统编程支持、设计模式和运行时验证技术,这些技术将被封装在一个统一的编程框架中,并将极大地提高现代分布式系统的性能和容错能力。这项研究可能对工业和社会产生重大影响,因为分布式系统是诸如云计算、集群和数据中心技术以及高性能计算之类的现代计算基础设施的基石。特别是,这项工作将与广泛使用的分布式数据库合作完成,特别是MongoDB和TiDB。PI将这一努力视为斯托尼布鲁克大学和伊利诺伊大学分布式系统技术多学科研究和教育的催化剂。PI将把这项工作作为一个核心,他们希望最终能够发展到凝聚其他在云计算,分布式系统和软件工程技术方面感兴趣的不同专业知识的教师。这两所大学都在经历计算机科学学生前所未有的激增。PI正在与该部门合作,在云计算,分布式系统,可靠系统和软件工程的一般领域扩大多学科课程的课程设置。PI将把本提案中的主题纳入他们所教授的课程中。PI长期致力于本科教育和研究,并扩大代表性不足的少数民族的参与。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(4)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Rolis: a software approach to efficiently replicating multi-core transactions
- DOI:10.1145/3492321.3519561
- 发表时间:2022-03
- 期刊:
- 影响因子:0
- 作者:Weihai Shen;Ansh Khanna;Sebastian Angel;S. Sen;Shuai Mu
- 通讯作者:Weihai Shen;Ansh Khanna;Sebastian Angel;S. Sen;Shuai Mu
DepFast: Orchestrating Code of Quorum Systems
DepFast:编排 Quorum 系统代码
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Luo, Xuhao;Shen, Weihai;Mu, Shuai;Xu, Tianyin
- 通讯作者:Xu, Tianyin
NCC: Natural Concurrency Control for Strictly Serializable Datastores by Avoiding the Timestamp-Inversion Pitfall
- DOI:10.48550/arxiv.2305.14270
- 发表时间:2023-05
- 期刊:
- 影响因子:0
- 作者:Haonan Lu;Shuai Mu;S. Sen;Wyatt Lloyd
- 通讯作者:Haonan Lu;Shuai Mu;S. Sen;Wyatt Lloyd
Waverunner: An Elegant Approach to Hardware Acceleration of State Machine Replication
Waverunner:状态机复制硬件加速的优雅方法
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Alimadadi, Mohammadreza;Mai, Hieu;Cho, Shenghsun;Ferdman, Michael;Milder, Peter;Mu, Shuai
- 通讯作者:Mu, Shuai
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Shuai Mu其他文献
CDW-MH Phase Transition in Quasi-One-Dimensional Halogen-Bridged Metal Complexes & Recent Progresses in Halogen-Bridged Metal Complexes (Toward Electronic Devices)
准一维卤桥金属配合物中的 CDW-MH 相变
- DOI:
- 发表时间:
2015 - 期刊:
- 影响因子:0
- 作者:
Shuai Mu;Shinya Takaishi;Masahiro Yamashita;高石慎也;高石慎也 - 通讯作者:
高石慎也
住民との協働における地方自治体(職員)が持つべき戦略的視点-ブラジル・クリチバ市における開発的実践の分析から-
地方政府(官员)与居民合作时应具备的战略视角 - 巴西库里蒂巴发展实践分析 -
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Shuai Mu;Shinya Takaishi;Masahiro Yamashita;南 友二郎;南 友二郎;南 友二郎;南 友二郎;南 友二郎 - 通讯作者:
南 友二郎
Synergistic surface ligand modification of Ni-Pt bimetallic nanozymes: Enhanced catalytic activity and versatile detection of penicillin
- DOI:
10.1016/j.snb.2024.136724 - 发表时间:
2025-01-15 - 期刊:
- 影响因子:
- 作者:
Shuai Mu;Yi Yang;Taihe Han;Jia Liu;Zixiang Zhu;Haixue Zheng;Haixia Zhang - 通讯作者:
Haixia Zhang
An Improved, Scalable and Impurity-Free Process for Lixivaptan
Lixivaptan 的改进、可扩展且无杂质的工艺
- DOI:
10.1002/jhet.2176 - 发表时间:
2015 - 期刊:
- 影响因子:2.4
- 作者:
Shuai Mu;Duan Niu;Y. Liu;Zhang Dashuai;Dengke Liu;Chang - 通讯作者:
Chang
DPh-BTBT/P2V2TT共結晶の合成・構造および光物性
DPh-BTBT/P2V2TT共晶的合成、结构及光学性质
- DOI:
- 发表时间:
2014 - 期刊:
- 影响因子:0
- 作者:
Shuai Mu;高石慎也;山下正廣 - 通讯作者:
山下正廣
Shuai Mu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Shuai Mu', 18)}}的其他基金
Collaborative Research: CISE: Large: Systems Support for Run-Anywhere Serverless
协作研究:CISE:大型:对 Run-Anywhere Serverless 的系统支持
- 批准号:
2321725 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Continuing Grant
CAREER: Rethinking Replication in Highly Available and Reliable Data Stores
职业:重新思考高可用且可靠的数据存储中的复制
- 批准号:
2238768 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Continuing Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: CNS Core: Medium: Reconfigurable Kernel Datapaths with Adaptive Optimizations
协作研究:CNS 核心:中:具有自适应优化的可重构内核数据路径
- 批准号:
2345339 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
- 批准号:
2230945 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: NSF-AoF: CNS Core: Small: Towards Scalable and Al-based Solutions for Beyond-5G Radio Access Networks
合作研究:NSF-AoF:CNS 核心:小型:面向超 5G 无线接入网络的可扩展和基于人工智能的解决方案
- 批准号:
2225578 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Medium: Movement of Computation and Data in Splitkernel-disaggregated, Data-intensive Systems
合作研究:CNS 核心:媒介:Splitkernel 分解的数据密集型系统中的计算和数据移动
- 批准号:
2406598 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Continuing Grant
Collaborative Research: CNS Core: Small: SmartSight: an AI-Based Computing Platform to Assist Blind and Visually Impaired People
合作研究:中枢神经系统核心:小型:SmartSight:基于人工智能的计算平台,帮助盲人和视障人士
- 批准号:
2418188 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: Creating An Extensible Internet Through Interposition
合作研究:CNS核心:小:通过介入创建可扩展的互联网
- 批准号:
2242503 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: Adaptive Smart Surfaces for Wireless Channel Morphing to Enable Full Multiplexing and Multi-user Gains
合作研究:CNS 核心:小型:用于无线信道变形的自适应智能表面,以实现完全复用和多用户增益
- 批准号:
2343959 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: Efficient Ways to Enlarge Practical DNA Storage Capacity by Integrating Bio-Computer Technologies
合作研究:中枢神经系统核心:小型:通过集成生物计算机技术扩大实用 DNA 存储容量的有效方法
- 批准号:
2343863 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
- 批准号:
2341378 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Medium: Innovating Volumetric Video Streaming with Motion Forecasting, Intelligent Upsampling, and QoE Modeling
合作研究:CNS 核心:中:通过运动预测、智能上采样和 QoE 建模创新体积视频流
- 批准号:
2409008 - 财政年份:2023
- 资助金额:
$ 24.95万 - 项目类别:
Continuing Grant