CSR-DMSS, PSCE: Collaborative Research: Scalable Resilience in Large-Scale Systems

CSR-DMSS、PSCE:协作研究:大型系统中的可扩展弹性

基本信息

  • 批准号:
    0834483
  • 负责人:
  • 金额:
    $ 30万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2008
  • 资助国家:
    美国
  • 起止时间:
    2008-09-01 至 2012-08-31
  • 项目状态:
    已结题

项目摘要

The objective of this project is to systematically design means of obtaining, tandardizing, and manipulating quantified Reliability, Availability and Serviceability (RAS) information from extreme-scale High Performance Computing (HPC) distributions, and to develop a novel, scalable framework for the real-time RAS monitoring and modeling of these systems via the research and creation of an optimal feedback control loop encompassing the entire computational environment. This work is necessitated by the continual and substantial increase in the size and scope of HPC systems, which is causing rapid inflation in the number of faults, errors, and other performance interruptions encountered by these machines.As HPC systems move towards the petaflop era, a greater focus must be placed on the performance interruptions encountered by these machines, and the development of means by which they may continue uninterrupted computation. In this extreme-scale environment, efforts aimed towards maintaining high reliability and uptime are futile ? with their enormous processor and computational unit counts, these systems will inevitably encounter performance issues, and failure must be expected. This project aims to 1) research and develop advanced, standardized methodologies for gathering application- and system-level data and generating quantifiable RAS metrics, 2) provide a novel, scalable solution for improving accuracy in reliably predicting imminent node-wise and system failures in large-scale systems, and 3) devise defensive and proactive techniques for reducing the computational costs required to timely and accurately handle resilience issues and model system health. In summary, this work attempts to alleviate the time and cost limitations of contemporary, reactive fault tolerance schemes, and will advance the development of scalable, proactive, and intelligent resilience provision in large-scale computing deployments. In addition, the Resilience Consortium will be established to synergistically research and develop, share data and findings, and disseminate knowledge to the public.
该项目的目标是系统地设计从极端规模的高性能计算(HPC)分布中获取,标准化和操纵量化的可靠性,可用性和可服务性(RAS)信息的方法,并开发一种新的,可扩展的框架,用于真实的-通过研究和创建包含整个计算的最佳反馈控制回路,对这些系统进行时间RAS监测和建模环境随着HPC系统规模和范围的持续和大幅增长,这使得这些机器遇到的故障、错误和其他性能中断的数量迅速增加。随着HPC系统向千万亿次时代发展,必须更加关注这些机器遇到的性能中断,以及发展使它们可以继续不间断计算的手段。在这种极端规模的环境中,旨在保持高可靠性和可靠性的努力是徒劳的?由于这些系统具有庞大的处理器和计算单元数量,因此将不可避免地遇到性能问题,并且必须预期到故障。该项目旨在1)研究和开发用于收集应用程序和系统级数据并生成可量化的RAS指标的先进标准化方法,2)提供一种新颖的可扩展解决方案,用于提高可靠预测大型系统中即将发生的节点和系统故障的准确性,以及3)设计防御性和主动性技术,以减少及时准确地处理弹性问题和建模系统健康所需的计算成本。总之,这项工作试图减轻当代,反应式容错方案的时间和成本限制,并将推进大规模计算部署中可扩展,主动和智能弹性供应的发展。此外,还将建立复原力联盟,以协同研究和开发,分享数据和调查结果,并向公众传播知识。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Chokchai Leangsuksun其他文献

Chokchai Leangsuksun的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似海外基金

DMSS: a Dark Matter Summer School
DMSS:暗物质暑期学校
  • 批准号:
    1806341
  • 财政年份:
    2018
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
CSR-DMSS,SM: Cooperative Activity Analysis in Wireless Smart-Camera Networks (Wi-SCaNs)
CSR-DMSS,SM:无线智能相机网络 (Wi-SCaN) 中的协作活动分析
  • 批准号:
    1205458
  • 财政年份:
    2011
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
CSR-DMSS, SM: ConfVeal: Automated Testing of Security Configuration Enforcement in Distributed Networks
CSR-DMSS、SM:ConfVeal:分布式网络中安全配置实施的自动化测试
  • 批准号:
    1019223
  • 财政年份:
    2010
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
CSR-DMSS, SM, Harmony: Efficient Integrated Resource/Trust Management in Large-Scale Distributed Systems
CSR-DMSS、SM、Harmony:大规模分布式系统中的高效集成资源/信任管理
  • 批准号:
    1025649
  • 财政年份:
    2009
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
CSR-DMSS-SM: Skeptical Systems
CSR-DMSS-SM:怀疑系统
  • 批准号:
    0834392
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
CSR-DMSS, SM: Energy-Efficient and Reliability-Aware Data Management in Mobile Storage Systems
CSR-DMSS、SM:移动存储系统中的节能和可靠性感知数据管理
  • 批准号:
    0834466
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
CSR-DMSS, SM: View Control Management in Geographically Distributed Tele-Immersive Environments
CSR-DMSS、SM:地理分布式远程沉浸式环境中的视图控制管理
  • 批准号:
    0834480
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
CSR-DMSS,TM: Distributed Computing With an Ad-Hoc Network
CSR-DMSS,TM:使用 Ad-Hoc 网络进行分布式计算
  • 批准号:
    0834582
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR-DMSS: On-road Real-time Information Systems for driving safety atop VANET-WSM symbiosis
合作研究:CSR-DMSS:基于 VANET-WSM 共生的用于驾驶安全的道路实时信息系统
  • 批准号:
    0834585
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
DMSS: Network Coding Techniques for Scalable Distributed Media Storage and Streaming
DMSS:可扩展分布式媒体存储和流媒体的网络编码技术
  • 批准号:
    0834775
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了