Collaborative Research: CISE: Large: Cross-Layer Resilience to Silent Data Corruption

协作研究:CISE:大型:针对静默数据损坏的跨层弹性

基本信息

  • 批准号:
    2321490
  • 负责人:
  • 金额:
    $ 93.75万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2023
  • 资助国家:
    美国
  • 起止时间:
    2023-10-01 至 2028-09-30
  • 项目状态:
    未结题

项目摘要

Hyperscalers (i.e., large cloud service providers) are reporting frequent silent data corruptions (or SDCs) within their datacenter infrastructures. SDCs are software errors for which the only symptom is an incorrect result. Remarkably, SDCs at-scale exhibit error occurrence rates on the order of one thousand faults per one million devices. Meanwhile, hardware manufacturers strive to achieve one hundred and close to zero defective parts per million for the commercial and automotive domains, respectively. This discrepancy between manufacturers’ goals and hyperscalers’ observations suggests that SDCs are a real threat to the reliability of all modern computing systems, and by extension their security and sustainability. This project explores whether it is possible to cooperatively design testing, detection, and mitigation approaches for SDCs that minimize performance impact on software applications, as well as additional carbon footprint expenditures associated with manufacturing and running computing systems. The project’s key novelties include: (1) leveraging reoccurring computational primitives in software (e.g., matrix multiplication in popular machine learning applications) and modern special-purpose hardware (e.g., Artificial Intelligence processors) to design domain-specific SDC solutions; (2) exploiting the fact that SDC testing can be performed throughout a device’s lifetime in the datacenter rather than for a few seconds to minutes — a strict limitation on the manufacturing test floor; (3) considering sustainability and carbon footprint as a core design metric. This project’s core impact will be a critical improvement in reliability and security for the countless applications to which we entrust computing systems today. A secondary core impact is an improvement in the longevity of computing devices, which has significant positive implications for sustainable computing. The research team will also train students and work with industry partners. To address the SDC challenge, the research team pursues four synergistic research thrusts that cut across diverse domains: Silicon Devices, Computer Architecture, Software, and Algorithms. Within each thrust, the team will study the SDC challenge through the lenses of: Testing, Detection, Mitigation, and Security implications. Thrust 1 explores device-level testing through novel test pattern metrics and continuous scan test deployment. Thrust 2 studies system-level testing (improving error detection latency and test coverage and adapting tests to be more representative of datacenter workloads), core-specific testing, defect characterization, hardware support for testing and mitigation, and system security implications. Thrust 3 investigates software detection and mitigation through (partial) redundancy, appropriate scan and system-level test scheduling, test-application fusion (where applications test themselves), and software security hardening against defect-induced vulnerabilities. Thrust 4 pursues algorithmic detection and mitigations with a particular emphasis on enabling robust non-linear computation for important datacenter workloads, like neural networks.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
超缩放器(即,大型云服务提供商)报告其数据中心基础设施内频繁发生静默数据损坏(或SDC)。SDC是软件错误,其唯一症状是结果不正确。值得注意的是,SDC在规模上表现出每一百万个设备一千个故障的数量级的错误发生率。与此同时,硬件制造商努力实现商业和汽车领域的每百万件100件和接近零件缺陷。制造商的目标和超大规模化者的观察之间的这种差异表明,SDC是对所有现代计算系统的可靠性的真实的威胁,并通过扩展其安全性和可持续性。该项目探讨是否有可能合作设计SDC的测试,检测和缓解方法,以最大限度地减少对软件应用程序的性能影响,以及与制造和运行计算系统相关的额外碳足迹支出。该项目的主要创新包括:(1)利用软件中重复出现的计算原语(例如,流行的机器学习应用中的矩阵乘法)和现代专用硬件(例如,(2)利用SDC测试可以在设备的整个生命周期中在数据中心进行,而不是在几秒钟到几分钟内进行--这是对制造测试车间的严格限制;(3)将可持续性和碳足迹作为核心设计指标。该项目的核心影响将是我们今天委托计算系统的无数应用程序的可靠性和安全性的关键改进。第二个核心影响是提高计算设备的寿命,这对可持续计算具有重大的积极影响。研究团队还将培训学生并与行业合作伙伴合作。 为了应对SDC的挑战,研究团队追求四个跨越不同领域的协同研究方向:硅器件,计算机体系结构,软件和算法。在每一次推进中,该团队将通过以下方面研究SDC挑战:测试、检测、缓解和安全影响。推力1通过新的测试模式度量和连续扫描测试部署探索设备级测试。Thrust 2研究系统级测试(改善错误检测延迟和测试覆盖率,并调整测试以更好地代表数据中心工作负载),特定于核心的测试,缺陷表征,测试和缓解的硬件支持,以及系统安全影响。Thrust 3通过(部分)冗余、适当的扫描和系统级测试调度、测试应用程序融合(应用程序测试自己)以及针对缺陷引起的漏洞的软件安全强化来研究软件检测和缓解。Thrust 4致力于算法检测和缓解,特别强调为重要的数据中心工作负载(如神经网络)实现强大的非线性计算。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响评审标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Baris Kasikci其他文献

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Atom:低位量化,实现高效、准确的 LLM 服务
  • DOI:
    10.48550/arxiv.2310.19102
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Yilong Zhao;Chien;Kan Zhu;Zihao Ye;Lequn Chen;Size Zheng;Luis Ceze;Arvind Krishnamurthy;Tianqi Chen;Baris Kasikci
  • 通讯作者:
    Baris Kasikci
Optimal and Error-Free Multi-Valued Byzantine Consensus Through Parallel Execution
通过并行执行实现最优且无错误的多值拜占庭共识
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Andrew D. Loveless;R. Dreslinski;Baris Kasikci
  • 通讯作者:
    Baris Kasikci
A Hypervisor for Shared-Memory FPGA Platforms
适用于共享内存 FPGA 平台的虚拟机管理程序
Holistic defenses against microarchitectural attacks
针对微架构攻击的整体防御
  • DOI:
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Baris Kasikci;Kevin Loughlin
  • 通讯作者:
    Kevin Loughlin
Towards Bug-free Persistent Memory Applications
迈向无错误的持久内存应用
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ian Neal;Andrew Quinn;Baris Kasikci
  • 通讯作者:
    Baris Kasikci

Baris Kasikci的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Baris Kasikci', 18)}}的其他基金

Collaborative Research: FoMR: Taming the Instruction Bottleneck in Modern Datacenter Applications
合作研究:FoMR:克服现代数据中心应用中的指令瓶颈
  • 批准号:
    2346057
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant
CAREER: Leveraging Everyday Usage of Programs to Eliminate Bugs
职业:利用程序的日常使用来消除错误
  • 批准号:
    2333885
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Continuing Grant
CAREER: Leveraging Everyday Usage of Programs to Eliminate Bugs
职业:利用程序的日常使用来消除错误
  • 批准号:
    1942218
  • 财政年份:
    2020
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Continuing Grant
Collaborative Research: FoMR: Taming the Instruction Bottleneck in Modern Datacenter Applications
合作研究:FoMR:克服现代数据中心应用中的指令瓶颈
  • 批准号:
    2010810
  • 财政年份:
    2020
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: CISE: Large: Cross-Layer Resilience to Silent Data Corruption
协作研究:CISE:大型:针对静默数据损坏的跨层弹性
  • 批准号:
    2321492
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Continuing Grant
Collaborative Research: CISE: Large: Integrated Networking, Edge System and AI Support for Resilient and Safety-Critical Tele-Operations of Autonomous Vehicles
合作研究:CISE:大型:集成网络、边缘系统和人工智能支持自动驾驶汽车的弹性和安全关键远程操作
  • 批准号:
    2321531
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Continuing Grant
Collaborative Research: Conference: 2023 CISE Education and Workforce PI and Community Meeting
协作研究:会议:2023 年 CISE 教育和劳动力 PI 和社区会议
  • 批准号:
    2318593
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant
Collaborative Research: Conference: 2023 CISE Education and Workforce PI and Community Meeting
协作研究:会议:2023 年 CISE 教育和劳动力 PI 和社区会议
  • 批准号:
    2318592
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant
Collaborative Research: CISE-MSI: RCBP-ED: CCRI: TechHouse Partnership to Increase the Computer Engineering Research Expansion at Morehouse College
合作研究:CISE-MSI:RCBP-ED:CCRI:TechHouse 合作伙伴关系,以促进莫尔豪斯学院计算机工程研究扩展
  • 批准号:
    2318703
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant
Collaborative Research: CISE: Large: Integrated Networking, Edge System and AI Support for Resilient and Safety-Critical Tele-Operations of Autonomous Vehicles
合作研究:CISE:大型:集成网络、边缘系统和人工智能支持自动驾驶汽车的弹性和安全关键远程操作
  • 批准号:
    2321532
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Continuing Grant
Collaborative Research: CISE: Large: Systems Support for Run-Anywhere Serverless
协作研究:CISE:大型:对 Run-Anywhere Serverless 的系统支持
  • 批准号:
    2321725
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Continuing Grant
Collaborative Research: CISE-MSI: RCBP-RF: CPS: Socially Informed Traffic Signal Control for Improving Near Roadway Air Quality
合作研究:CISE-MSI:RCBP-RF:CPS:用于改善附近道路空气质量的社会知情交通信号控制
  • 批准号:
    2318696
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant
Collaborative Research: CISE-MSI: DP: OAC: Integrated and Extensible Platform for Rethinking the Security of AI-assisted UAV Paradigm
合作研究:CISE-MSI:DP:OAC:重新思考人工智能辅助无人机范式安全性的集成和可扩展平台
  • 批准号:
    2318711
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant
Collaborative Research: CISE-MSI: DP: IIS: Event Detection and Knowledge Extraction via Learning and Causality Analysis for Resilience Emergency Response
协作研究:CISE-MSI:DP:IIS:通过学习和因果关系分析进行事件检测和知识提取,以实现弹性应急响应
  • 批准号:
    2219615
  • 财政年份:
    2023
  • 资助金额:
    $ 93.75万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了