权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CSR-DMSS, PSCE: Collaborative Research: Scalable Resilience in Large-Scale Systems

CSR-DMSS、PSCE：协作研究：大型系统中的可扩展弹性

基本信息

批准号：
0834483
负责人：
Chokchai Leangsuksun
金额：
$ 30万
依托单位：
Louisiana Tech University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2008
资助国家：
美国
起止时间：
2008-09-01 至 2012-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0834483&HistoricalAwards=false
关键词：
CSR DMSS PSCE Collaborative Research

项目摘要

The objective of this project is to systematically design means of obtaining, tandardizing, and manipulating quantified Reliability, Availability and Serviceability (RAS) information from extreme-scale High Performance Computing (HPC) distributions, and to develop a novel, scalable framework for the real-time RAS monitoring and modeling of these systems via the research and creation of an optimal feedback control loop encompassing the entire computational environment. This work is necessitated by the continual and substantial increase in the size and scope of HPC systems, which is causing rapid inflation in the number of faults, errors, and other performance interruptions encountered by these machines.As HPC systems move towards the petaflop era, a greater focus must be placed on the performance interruptions encountered by these machines, and the development of means by which they may continue uninterrupted computation. In this extreme-scale environment, efforts aimed towards maintaining high reliability and uptime are futile ? with their enormous processor and computational unit counts, these systems will inevitably encounter performance issues, and failure must be expected. This project aims to 1) research and develop advanced, standardized methodologies for gathering application- and system-level data and generating quantifiable RAS metrics, 2) provide a novel, scalable solution for improving accuracy in reliably predicting imminent node-wise and system failures in large-scale systems, and 3) devise defensive and proactive techniques for reducing the computational costs required to timely and accurately handle resilience issues and model system health. In summary, this work attempts to alleviate the time and cost limitations of contemporary, reactive fault tolerance schemes, and will advance the development of scalable, proactive, and intelligent resilience provision in large-scale computing deployments. In addition, the Resilience Consortium will be established to synergistically research and develop, share data and findings, and disseminate knowledge to the public.

该项目的目标是系统地设计从极端规模的高性能计算（HPC）分布中获取，标准化和操纵量化的可靠性，可用性和可服务性（RAS）信息的方法，并开发一种新的，可扩展的框架，用于真实的-通过研究和创建包含整个计算的最佳反馈控制回路，对这些系统进行时间RAS监测和建模环境随着HPC系统规模和范围的持续和大幅增长，这使得这些机器遇到的故障、错误和其他性能中断的数量迅速增加。随着HPC系统向千万亿次时代发展，必须更加关注这些机器遇到的性能中断，以及发展使它们可以继续不间断计算的手段。在这种极端规模的环境中，旨在保持高可靠性和可靠性的努力是徒劳的？由于这些系统具有庞大的处理器和计算单元数量，因此将不可避免地遇到性能问题，并且必须预期到故障。该项目旨在1）研究和开发用于收集应用程序和系统级数据并生成可量化的RAS指标的先进标准化方法，2）提供一种新颖的可扩展解决方案，用于提高可靠预测大型系统中即将发生的节点和系统故障的准确性，以及3）设计防御性和主动性技术，以减少及时准确地处理弹性问题和建模系统健康所需的计算成本。总之，这项工作试图减轻当代，反应式容错方案的时间和成本限制，并将推进大规模计算部署中可扩展，主动和智能弹性供应的发展。此外，还将建立复原力联盟，以协同研究和开发，分享数据和调查结果，并向公众传播知识。