CAREER: Towards Gray-Fault Tolerant Cloud through Harnessing and Enhancing System Observability

职业:通过利用和增强系统可观测性迈向灰色容错云

基本信息

项目摘要

Cloud systems are the crucial infrastructure to many services existing today. Ensuring cloud software runs continuously without disruptions is both vital and challenging. Decades of research have developed mature techniques to detect and mask faults in distributed systems. But these techniques often use a simple model that assumes a system component either works or completely stops. Numerous real-world cloud incidents, however, suggest that production cloud systems frequently experience gray failures---a degraded operational mode in which a system component appears to be working but is in fact severely impaired. Gray failures cannot be effectively dealt with by current solutions. The overall objective of this proposal is to develop a holistic approach to detect, pinpoint and diagnose gray failures in production cloud systems. To realize the objective, four synergistic research activities are proposed. Specifically, the project conducts a study on real-world gray failure cases in popular distributed systems, measure and characterize the observability of existing systems. The project then designs a novel hybrid analysis that automatically inserts report-generation hooks across the whole systems stack to harness observability for detecting gray failures. To pinpoint the culprit component, this project further proposes algorithms to infer causality from the collected observations. Lastly, this project designs a runtime checking framework for increasing observability and online diagnosis of gray failures. Gray failures are a common cause of cloud service outages, resulting in significant financial loss. This project can effectively improve our understandings of gray failures and help detect and debug gray failures to reduce their impact on the ubiquitous cloud infrastructures. Software is moving to be more distributed with increasing subtle failure modes. Observability, fault detection, and localization are critical skills for this paradigm shift but are rarely covered in the existing curriculum. This project addresses this educational gap through curriculum development and student training. This project also promotes Computer Science education to underrepresented Baltimore high school students by organizing workshops in partnership with a non-profit organization, Code in the Schools, for local high school students to showcase cloud and system failure concepts.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
云系统是当今许多服务的关键基础设施。确保云软件连续运行而不中断既重要又具有挑战性。几十年的研究已经开发出成熟的技术来检测和屏蔽分布式系统中的故障。但是这些技术通常使用一个简单的模型,假设系统组件要么工作,要么完全停止。然而,许多现实世界的云事故表明,生产云系统经常遇到灰色故障-一种降级的操作模式,其中系统组件似乎正在工作,但实际上受到严重损害。灰色故障不能有效地处理现有的解决方案。该提案的总体目标是开发一种整体方法来检测、查明和诊断生产云系统中的灰色故障。为实现这一目标,提出了四项协同研究活动。具体而言,该项目对流行的分布式系统中的真实世界灰色故障案例进行研究,测量和表征现有系统的可观测性。然后,该项目设计了一种新的混合分析,自动插入整个系统堆栈的报告生成挂钩,利用可观测性检测灰色故障。为了查明罪魁祸首,该项目进一步提出了从收集的观察结果中推断因果关系的算法。最后,设计了一个运行时检测框架,以提高灰色故障的可观测性和在线诊断能力。灰色故障是云服务中断的常见原因,导致重大财务损失。该项目可以有效地提高我们对灰色故障的理解,并帮助检测和调试灰色故障,以减少其对无处不在的云基础设施的影响。软件正朝着更加分布式的方向发展,同时也伴随着越来越多的细微故障模式。可观察性,故障检测和定位是这种范式转变的关键技能,但在现有的课程中很少涉及。该项目通过课程开发和学生培训来解决这一教育差距。该项目还通过与非营利组织Code in the Schools合作,为当地高中生举办研讨会,展示云和系统故障概念,从而向代表性不足的巴尔的摩高中生推广计算机科学教育。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Pushing Performance Isolation Boundaries into Application with pBox
Simplifying Cloud Management with Cloudless Computing
通过无云计算简化云管理
  • DOI:
    10.1145/3626111.3628206
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Qiu, Yiming;Kon, Patrick Tser;Xing, Jiarong;Huang, Yibo;Liu, Hongyi;Wang, Xinyu;Huang, Peng;Chowdhury, Mosharaf;Chen, Ang
  • 通讯作者:
    Chen, Ang
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Peng Huang其他文献

New Method to Measure the Fill Level of the Ball Mill I-Theoretical Analysis and DEM Simulation
球磨机料位测量新方法一-理论分析与DEM模拟
OsMYB516 encoding a MYB transcriptional activator is involved in abiotic stress and circadian rhythm in rice
编码 MYB 转录激活因子的 OsMYB516 参与水稻非生物胁迫和昼夜节律
  • DOI:
  • 发表时间:
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Min Duan;Peng Huang;Xi Yuan;Hui Chen;Ji HUANG;Hongsheng Zhang
  • 通讯作者:
    Hongsheng Zhang
Modelling and compound control of intelligently dielectric elastomer actuator
智能介电弹性体执行器建模与复合控制
  • DOI:
    10.1016/j.conengprac.2022.105261
  • 发表时间:
    2022-09
  • 期刊:
  • 影响因子:
    4.9
  • 作者:
    Yawu Wang;Peng Huang;Jundong Wu;Chun-Yi Su
  • 通讯作者:
    Chun-Yi Su
Tailoring the cationic and anionic sites of LaFeO3-based perovskite generates multiple vacancies for efficient water oxidation
定制 LaFeO3 基钙钛矿的阳离子和阴离子位点可产生多个空位,实现高效水氧化
  • DOI:
    10.1039/d1ta03604a
  • 发表时间:
    2021-08
  • 期刊:
  • 影响因子:
    11.9
  • 作者:
    Paul Blessington Selva;Tuzhi Xiong;Peng Huang;Qirong Tan;Yongchao Huang;Hao Yang;M.-Sadeeq Balogun
  • 通讯作者:
    M.-Sadeeq Balogun
Chinese open information extraction based on DBMCSS in the eld of national information resources
国家信息资源领域基于DBMCSS的中文开放信息抽取
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    1.9
  • 作者:
    Jianhou Gan;Peng Huang;Juxiang Zhou;Bin Wen
  • 通讯作者:
    Bin Wen

Peng Huang的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Peng Huang', 18)}}的其他基金

CNS Core: Small: Intelligent Fault Injection to Expose and Reproduce Production-Grade Bugs in Cloud Systems
CNS 核心:小型:智能故障注入以暴露和重现云系统中的生产级错误
  • 批准号:
    2317698
  • 财政年份:
    2023
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
FMitF: Track I: Synthesizing Semantic Checkers for Runtime Verification of Production Distributed Systems
FMITF:第一轨:综合语义检查器以进行生产分布式系统的运行时验证
  • 批准号:
    2318937
  • 财政年份:
    2023
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
CNS Core: Small: Intelligent Fault Injection to Expose and Reproduce Production-Grade Bugs in Cloud Systems
CNS 核心:小型:智能故障注入以暴露和重现云系统中的生产级错误
  • 批准号:
    2149664
  • 财政年份:
    2021
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
CAREER: Towards Gray-Fault Tolerant Cloud through Harnessing and Enhancing System Observability
职业:通过利用和增强系统可观测性迈向灰色容错云
  • 批准号:
    1942794
  • 财政年份:
    2020
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Continuing Grant
CRII: CSR: Toward Understanding and Automatically Detecting Specious Configuration in Large Systems
CRII:CSR:理解和自动检测大型系统中的可疑配置
  • 批准号:
    1755737
  • 财政年份:
    2018
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant

相似海外基金

Sexual offence interviewing: Towards victim-survivor well-being and justice
性犯罪面谈:为了受害者-幸存者的福祉和正义
  • 批准号:
    DE240100109
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Discovery Early Career Researcher Award
Unlocking the sensory secrets of predatory wasps: towards predictive tools for managing wasps' ecosystem services in the Anthropocene
解开掠食性黄蜂的感官秘密:开发用于管理人类世黄蜂生态系统服务的预测工具
  • 批准号:
    NE/Y001397/1
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Research Grant
Development of programmable nanomachines towards the enzymatic synthesis of peptide oligonucleotide conjugates
开发用于肽寡核苷酸缀合物酶促合成的可编程纳米机器
  • 批准号:
    EP/X019624/1
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Fellowship
Postdoctoral Fellowship: STEMEdIPRF: Towards a Diverse Professoriate: Experiences that Inform Underrepresented Scholars' Perceptions of Value Alignment and Career Decisions
博士后奖学金:STEMEdIPRF:走向多元化的教授职称:为代表性不足的学者对价值调整和职业决策的看法提供信息的经验
  • 批准号:
    2327411
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
CAREER: Adaptive Deep Learning Systems Towards Edge Intelligence
职业:迈向边缘智能的自适应深度学习系统
  • 批准号:
    2338512
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Continuing Grant
CAREER: Towards highly efficient UV emitters with lattice engineered substrates
事业:采用晶格工程基板实现高效紫外线发射器
  • 批准号:
    2338683
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Continuing Grant
ASCENT: Heterogeneously Integrated and AI-Empowered Millimeter-Wave Wide-Bandgap Transmitter Array towards Energy- and Spectrum-Efficient Next-G Communications
ASCENT:异构集成和人工智能支持的毫米波宽带隙发射机阵列,实现节能和频谱高效的下一代通信
  • 批准号:
    2328281
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
Collaborative Research: Maritime to Inland Transitions Towards ENvironments for Convection Initiation (MITTEN CI)
合作研究:海洋到内陆向对流引发环境的转变(MITTEN CI)
  • 批准号:
    2349935
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Continuing Grant
Collaborative Research: Maritime to Inland Transitions Towards ENvironments for Convection Initiation (MITTEN CI)
合作研究:海洋到内陆向对流引发环境的转变(MITTEN CI)
  • 批准号:
    2349934
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Continuing Grant
NSF-BSF: Towards a Molecular Understanding of Dynamic Active Sites in Advanced Alkaline Water Oxidation Catalysts
NSF-BSF:高级碱性水氧化催化剂动态活性位点的分子理解
  • 批准号:
    2400195
  • 财政年份:
    2024
  • 资助金额:
    $ 60.95万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了