Collaborative Research: CSR-SMA+AES: PROGNOSIS to Enhance the Runtime Health of Large Scale Parallel Systems

合作研究:CSR-SMA AES:增强大规模并行系统运行时健康状况的预测

基本信息

项目摘要

Large scale parallel systems are critical to our computational infrastructure to take on the challenges imposed by applications whose scale and demands exceed the capabilities of machines available in the market today. Pushing the limits of hardware and software technologies to extract the maximum performance, in turn, exacerbates other problems. Notable amongst these problems is the susceptibility to failures, which arises as a consequence of growing hardware transient errors, hardware device failures, software complexity, and the complex hardware/software inter-dependencies between the nodes of a parallel system. These failures can have substantial consequences on system performance, in addition to impacting the costs of maintenance/operation, thereby putting at risk the very motivation behind deploying these large scale systems.This research is expected to make three broad contributions towards developing a runtime infrastructure, called PROGNOSIS, for failure data collection and online analysis. The first set of contributions will be on collecting and analyzing system events and failure data from an actual BlueGene/L system over an extended period of time. In addition to presenting the raw system events, the research will be developing filtering techniques to remove unimportant information and identifying stationary intervals, together with defining the attributes for logging and their frequency. The second set of contributions will be models for online analysis and prediction of evolving failure data by exploiting correlations between system events over time, across the nodes, and with respect to external factors such as imposed workload and operating temperature. The third set of contributions will be on demonstrating the uses of PROGNOSIS. Tools such as PROGNOSIS can help substantially in the development of self-healing systems, which has been noted to be an important goal in the emerging area of Autonomic Computing by several computer vendors.
大规模并行系统对于我们的计算基础设施至关重要,以应对其规模和需求超过当今市场上可用机器能力的应用所带来的挑战。推动硬件和软件技术的极限以获得最大性能,反过来又加剧了其他问题。在这些问题中值得注意的是对故障的敏感性,这是由于增长的硬件瞬时错误、硬件设备故障、软件复杂性以及并行系统的节点之间的复杂硬件/软件相互依赖性而产生的。这些故障可能会对系统性能产生重大影响,除了影响维护/操作的成本,从而把风险背后部署这些大型systems.This研究的动机是非常广泛的贡献,预计对开发一个运行时的基础设施,称为PROGNOSIS,故障数据收集和在线分析。第一组贡献将是收集和分析系统事件和故障数据,从一个实际的BlueGene/L系统在一个延长的时间段。除了呈现原始系统事件外,研究还将开发过滤技术,以去除不重要的信息并识别固定间隔,同时定义日志记录的属性及其频率。第二组贡献将是在线分析和预测不断变化的故障数据的模型,通过利用系统事件之间的相关性,随着时间的推移,跨节点,以及相对于外部因素,如强加的工作负载和工作温度。第三组贡献将展示PROGNOSIS的使用。 诸如PROGNOSIS之类的工具可以极大地帮助开发自我修复系统,这已经被几家计算机供应商注意到是自主计算新兴领域的重要目标。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Anand Sivasubramaniam其他文献

Network-Based Parallel Computing. Communication, Architecture, and Applications
基于网络的并行计算。
  • DOI:
    10.1007/10704826
  • 发表时间:
    1999
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Anand Sivasubramaniam;Mario Lauria
  • 通讯作者:
    Mario Lauria

Anand Sivasubramaniam的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Anand Sivasubramaniam', 18)}}的其他基金

FoMR: Shrinking the Control and Data Flow Latencies of Single Thread Executions for Emerging Workloads
FoMR:缩短新兴工作负载的单线程执行的控制和数据流延迟
  • 批准号:
    1912495
  • 财政年份:
    2019
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant
SHF:Small: Integrated Hardware-Software Power Regulation, Allocation and Isolation in Consolidated Servers
SHF:Small:整合服务器中的集成硬件-软件电源调节、分配和隔离
  • 批准号:
    1714389
  • 财政年份:
    2017
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant
SHF: Small: Virtualizing Coordinated Resource Management of Flows on Handhelds with VIADUCT
SHF:小型:使用 VIADUCT 对手持设备上的流进行虚拟化协调资源管理
  • 批准号:
    1526750
  • 财政年份:
    2015
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant
CSR: Medium: Provisioning and Harnessing Energy Storage for Datacenter Demand Response
CSR:中:为数据中心需求响应配置和利用能源存储
  • 批准号:
    1302225
  • 财政年份:
    2013
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: Application-adaptive I/O Stack for Data-intensive Scientific Computing
协作研究:用于数据密集型科学计算的应用自适应 I/O 堆栈
  • 批准号:
    0621427
  • 财政年份:
    2006
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR---SMA+AES: Pro-Active Runtime Health Enhancement of Large-Scale Parallel Systems Using PROGNOSIS
合作研究:CSR---SMA AES:使用 PROGNOSIS 主动增强大规模并行系统的运行时健康状况
  • 批准号:
    0615097
  • 财政年份:
    2006
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
HECURA: Exploiting Asymmetry in Performance and Security Requirements for I/O in High-end Computing
HECURA:利​​用高端计算中 I/O 性能和安全要求的不对称性
  • 批准号:
    0621429
  • 财政年份:
    2006
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant
Tools and Techniques for Integrated Power Management of Server Disks
服务器磁盘集成电源管理的工具和技术
  • 批准号:
    0429500
  • 财政年份:
    2004
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
ITR: Data-Driven Autonomic Performance Modulation for Servers
ITR:数据驱动的服务器自主性能调制
  • 批准号:
    0325056
  • 财政年份:
    2003
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
CISE Research Resources: From High Performance to Low Power: Infrastructure for Ubiquitous Computing
CISE 研究资源:从高性能到低功耗:普适计算的基础设施
  • 批准号:
    0130143
  • 财政年份:
    2002
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
  • 批准号:
    2312206
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Medium: Architecting GPUs for Practical Homomorphic Encryption-based Computing
协作研究:CSR:中:为实用的同态加密计算构建 GPU
  • 批准号:
    2312276
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
  • 批准号:
    2312689
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
  • 批准号:
    2401244
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
  • 批准号:
    2314681
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Small: Expediting Continual Online Learning on Edge Platforms through Software-Hardware Co-designs
协作研究:企业社会责任:小型:通过软硬件协同设计加快边缘平台上的持续在线学习
  • 批准号:
    2312157
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR: Small: Cross-layer learning-based Energy-Efficient and Resilient NoC design for Multicore Systems
协作研究:CSR:小型:基于跨层学习的多核系统节能和弹性 NoC 设计
  • 批准号:
    2321224
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
  • 批准号:
    2312207
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Medium: Adaptive Environmental Awareness for Collaborative Augmented Reality
协作研究:企业社会责任:媒介:协作增强现实的自适应环境意识
  • 批准号:
    2312760
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
  • 批准号:
    2314680
  • 财政年份:
    2023
  • 资助金额:
    $ 10万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了