CNS Core: Small: Testing and detecting software upgrade failures in data-intensive distributed systems
CNS 核心:小型:测试和检测数据密集型分布式系统中的软件升级故障
基本信息
- 批准号:2300562
- 负责人:
- 金额:$ 60万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2026-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
In the current big data era, Internet services are often built on top of data-intensive distributed systems. Such distributed systems have to go through frequent software upgrade as vendors need to add new features, improve performance, and deploy patches. With the rise of continuous deployment in the industry, the frequency of distributed system software upgrade could reach thousands of deployments in a single day in a major Internet company. Unfortunately, distributed systems could experience upgrade failures – failures happen during software upgrade. These failures often have large-scale impact as upgrade is performed on the entire system. They are typically mitigated in the production environment with canary deployment, which slowly rollout updates from a small scale to the entire cluster and downgrade if a failure is encountered. However, canary deployment easily takes hours and creates a dilemma between safe and fast upgrade. In addition, many upgrade failures have persistent impact and cannot be easily resolved by downgrading. Despite the severe consequence of upgrade failures and challenges faced by production mitigation techniques, there are no existing testing and program analysis techniques that focus on testing and analyzing the distributed system upgrade procedure systematically. This work proposes to develop such techniques optimized to detect upgrade failures in early stages through exploring the effectiveness of unique properties of the distributed system software upgrade procedure. Data-intensive distributed systems deployed in public or private clouds are nowadays a cornerstone of many critical computing systems. The proposed techniques should dramatically improve the reliability of data-intensive distributed systems during upgrade and, consequently, reduce service disruptions and improve the availability of cloud systems. In addition, improved reliability of the upgrade procedure will lead to more timely feedbacks about new features in production, which is critical for developers’ productivity and the quality of the resulting software.In this project we plan to (1) implement differential testing between two standard distributed system upgrade procedures – full-stop upgrade and rolling upgrade, (2) explore utilizing source code difference between versions to design differential test oracles, feedback metrics, and input mutation strategies, that are specially tuned to trigger and detect upgrade failures, (3) design static program analysis guided by source code difference to detect data format incompatibilities between versions, and (4) validate the testing and detection techniques proposed through direct experimentation on real-world data-intensive distributed systems. The proposed fault localization and static analysis techniques will reduce the valuable time and effort that developers spend on root cause diagnosis, which is extremely challenging for bugs in distributed systems. All products of the project will be open sourced to ensure a widespread impact.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在当前的大数据时代,互联网服务往往建立在数据密集型分布式系统之上。这样的分布式系统必须经历频繁的软件升级,因为供应商需要添加新功能、提高性能和部署补丁。随着行业持续部署的兴起,在一家大型互联网公司,分布式系统软件升级的频率可以在一天内达到数千次部署。不幸的是,分布式系统可能会遇到升级故障--故障发生在软件升级过程中。当在整个系统上执行升级时,这些故障通常会产生大规模影响。在使用金丝雀部署的生产环境中,这些问题通常会得到缓解,金丝雀部署会缓慢地从小范围向整个集群推出更新,并在遇到故障时降级。然而,金丝雀部署很容易花费数小时,并在安全和快速升级之间制造两难境地。此外,许多升级失败具有持续性影响,不能通过降级轻松解决。尽管升级失败带来了严重的后果和生产缓解技术面临的挑战,但现有的测试和程序分析技术还没有集中于系统地测试和分析分布式系统升级过程。这项工作建议通过探索分布式系统软件升级过程独特属性的有效性来开发这种优化的技术,以便在早期阶段检测升级失败。部署在公共云或私有云中的数据密集型分布式系统如今是许多关键计算系统的基石。建议的技术应能在升级过程中显著提高数据密集型分布式系统的可靠性,从而减少服务中断并提高云系统的可用性。此外,升级过程的可靠性的提高将导致对生产中新功能的更及时的反馈,这对开发人员的工作效率和最终软件的质量至关重要。在本项目中,我们计划(1)在两个标准的分布式系统升级过程--完全停止升级和滚动升级--之间进行差异测试,(2)探索利用版本之间的源代码差异来设计差异测试预言、反馈度量和输入突变策略,这些策略经过专门调整来触发和检测升级失败,(3)设计以源代码差异为导向的静态程序分析,以检测版本之间的数据格式不兼容,以及(4)通过在真实数据密集型分布式系统上的直接实验来验证所提出的测试和检测技术。提出的故障定位和静态分析技术将减少开发人员在根本原因诊断上花费的宝贵时间和精力,而根本原因诊断对于分布式系统中的错误来说是极具挑战性的。该项目的所有产品都将是开源的,以确保广泛的影响。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Yongle Zhang其他文献
Crystal structure of 4-{[(1H-1,2,4-triazol-1-y1)methyl-]sulfanyl}phenol
4-{[(1H-1,2,4-三唑-1-y1)甲基-]硫基}苯酚的晶体结构
- DOI:
- 发表时间:
2014 - 期刊:
- 影响因子:0
- 作者:
Yongle Zhang;Jing Wang - 通讯作者:
Jing Wang
Frank-Wolfe type methods for nonconvex inequality-constrained problems
非凸不等式约束问题的 Frank-Wolfe 型方法
- DOI:
- 发表时间:
2021 - 期刊:
- 影响因子:0
- 作者:
Liaoyuan Zeng;Yongle Zhang;Guoyin Li;Ting Kei Pong - 通讯作者:
Ting Kei Pong
Elucidating structure-activity relationships in CdS/ZnO heterojunctions for synergistic adsorption-photocatalysis of uranium (VI) removal
阐明硫化镉/氧化锌异质结中对铀(VI)协同吸附 - 光催化去除的构效关系
- DOI:
10.1016/j.seppur.2025.133806 - 发表时间:
2025-12-05 - 期刊:
- 影响因子:9.000
- 作者:
Yihao Quan;Sen Lu;Qingliang Wang;Hongqiang Wang;Eming Hu;Xi Xin;Yizhe Su;Yongle Zhang;Jiacheng Bao - 通讯作者:
Jiacheng Bao
Relaxation Inertial Projection Algorithms for Solving Monotone Variational Inequality Problems
求解单调变分不等式问题的松弛惯性投影算法
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0.2
- 作者:
Yan Zhang;Denglian Yang;Yongle Zhang - 通讯作者:
Yongle Zhang
Retraction-based first-order feasible methods for difference-of-convex programs with smooth inequality and simple geometric constraints
- DOI:
https://doi.org/10.1007/s10444-022-10002-2 - 发表时间:
2023 - 期刊:
- 影响因子:
- 作者:
Yongle Zhang;Guoyin Li;Ting Kei Pong;Shiqi Xu - 通讯作者:
Shiqi Xu
Yongle Zhang的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
胆固醇羟化酶CH25H非酶活依赖性促进乙型肝炎病毒蛋白Core及Pre-core降解的分子机制研究
- 批准号:82371765
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
锕系元素5f-in-core的GTH赝势和基组的开发
- 批准号:22303037
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于合成致死策略搭建Core-matched前药共组装体克服肿瘤耐药的机制研究
- 批准号:
- 批准年份:2022
- 资助金额:52 万元
- 项目类别:
鼠伤寒沙门氏菌LPS core经由CD209/SphK1促进树突状细胞迁移加重炎症性肠病的机制研究
- 批准号:
- 批准年份:2022
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于外泌体精准调控的“核-壳”(core-shell)同步血管化骨组织工程策略的应用与机制探讨
- 批准号:
- 批准年份:2020
- 资助金额:55 万元
- 项目类别:
肌营养不良蛋白聚糖Core M3型甘露糖肽的精确制备及功能探索
- 批准号:92053110
- 批准年份:2020
- 资助金额:70.0 万元
- 项目类别:重大研究计划
Core-1-O型聚糖黏蛋白缺陷诱导胃炎发生并介导慢性胃炎向胃癌转化的分子机制研究
- 批准号:81902805
- 批准年份:2019
- 资助金额:20.5 万元
- 项目类别:青年科学基金项目
原始地球增生晚期的Core-merging大碰撞事件:地核增生、核幔平衡与核幔边界结构的新认识
- 批准号:41973063
- 批准年份:2019
- 资助金额:65.0 万元
- 项目类别:面上项目
CORDEX-CORE区域气候模拟与预估研讨会
- 批准号:41981240365
- 批准年份:2019
- 资助金额:1.5 万元
- 项目类别:国际(地区)合作与交流项目
RBM38通过协助Pol-ε结合、招募core调控HBV复制
- 批准号:31900138
- 批准年份:2019
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
相似海外基金
CNS Core: Small: Core Scheduling Techniques and Programming Abstractions for Scalable Serverless Edge Computing Engine
CNS Core:小型:可扩展无服务器边缘计算引擎的核心调度技术和编程抽象
- 批准号:
2322919 - 财政年份:2024
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
CNS Core: Small: Network Wide Sensing by Leveraging Cellular Communication Networks
CNS 核心:小型:利用蜂窝通信网络进行全网络传感
- 批准号:
2343469 - 财政年份:2024
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
CNS Core: Small: Intelligent Fault Injection to Expose and Reproduce Production-Grade Bugs in Cloud Systems
CNS 核心:小型:智能故障注入以暴露和重现云系统中的生产级错误
- 批准号:
2317698 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
CNS Core: Small: Repurposing Smartphones to Minimize Carbon
CNS 核心:小型:重新利用智能手机以最大限度地减少碳排放
- 批准号:
2233894 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: A Compilation System for Mapping Deep Learning Models to Tensorized Instructions (DELITE)
合作研究:CNS Core:Small:将深度学习模型映射到张量化指令的编译系统(DELITE)
- 批准号:
2230945 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
Collaborative Research: NSF-AoF: CNS Core: Small: Towards Scalable and Al-based Solutions for Beyond-5G Radio Access Networks
合作研究:NSF-AoF:CNS 核心:小型:面向超 5G 无线接入网络的可扩展和基于人工智能的解决方案
- 批准号:
2225578 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
CNS Core: Small: Toward Opportunistic, Fast, and Robust In-Cache AI Acceleration at the Edge
CNS 核心:小型:在边缘实现机会主义、快速且稳健的缓存内 AI 加速
- 批准号:
2228028 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: SmartSight: an AI-Based Computing Platform to Assist Blind and Visually Impaired People
合作研究:中枢神经系统核心:小型:SmartSight:基于人工智能的计算平台,帮助盲人和视障人士
- 批准号:
2418188 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
CNS Core: Small: Redesigning I/O Across Heterogeneous Systems
CNS 核心:小型:跨异构系统重新设计 I/O
- 批准号:
2231724 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant
Collaborative Research: CNS Core: Small: Creating An Extensible Internet Through Interposition
合作研究:CNS核心:小:通过介入创建可扩展的互联网
- 批准号:
2242503 - 财政年份:2023
- 资助金额:
$ 60万 - 项目类别:
Standard Grant