Collaborative Research: CSR---SMA+AES: PROGNOSIS to Enhance the Runtime Health of Large Scale Parallel Systems
合作研究:CSR---SMA AES:增强大规模并行系统运行时健康状况的预测
基本信息
- 批准号:0509164
- 负责人:
- 金额:$ 8万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2005
- 资助国家:美国
- 起止时间:2005-08-01 至 2006-07-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Large scale parallel systems are critical to our computational infrastructure to take on the challenges imposed by applications whose scale and demands exceed the capabilities of machines available in the market today. Pushing the limits of hardware and software technologies to extract the maximum performance, in turn, exacerbates other problems. Notable amongst these problems is the susceptibility to failures, which arises as a consequence of growing hardware transient errors, hardware device failures, software complexity, and the complex hardware/software inter-dependencies between the nodes of a parallel system. These failures can have substantial consequences on system performance, in addition to impacting the costs of maintenance/operation, thereby putting at risk the very motivation behind deploying these large scale systems.This research is expected to make three broad contributions towards developing a runtime infrastructure, called PROGNOSIS, for failure data collection and online analysis. The first set of contributions will be on collecting and analyzing system events and failure data from an actual BlueGene/L system over an extended period of time. In addition to presenting the raw system events, the research will be developing filtering techniques to remove unimportant information and identifying stationary intervals, together with defining the attributes for logging and their frequency. The second set of contributions will be models for online analysis and prediction of evolving failure data by exploiting correlations between system events over time, across the nodes, and with respect to external factors such as imposed workload and operating temperature. The third set of contributions will be on demonstrating the uses of PROGNOSIS. Tools such as PROGNOSIS can help substantially in the development of self-healing systems, which has been noted to be an important goal in the emerging area of Autonomic Computing by several computer vendors.
大规模并行系统对于我们的计算基础设施至关重要,以应对其规模和需求超过当今市场上可用机器能力的应用所带来的挑战。推动硬件和软件技术的极限以获得最大性能,反过来又加剧了其他问题。在这些问题中值得注意的是对故障的敏感性,这是由于增长的硬件瞬时错误、硬件设备故障、软件复杂性以及并行系统的节点之间的复杂硬件/软件相互依赖性而产生的。这些故障可能会对系统性能产生重大影响,除了影响维护/操作的成本,从而把风险背后部署这些大型systems.This研究的动机是非常广泛的贡献,预计对开发一个运行时的基础设施,称为PROGNOSIS,故障数据收集和在线分析。第一组贡献将是收集和分析系统事件和故障数据,从一个实际的BlueGene/L系统在一个延长的时间段。除了呈现原始系统事件外,研究还将开发过滤技术,以去除不重要的信息并识别固定间隔,同时定义日志记录的属性及其频率。第二组贡献将是在线分析和预测不断变化的故障数据的模型,通过利用系统事件之间的相关性,随着时间的推移,跨节点,以及相对于外部因素,如强加的工作负载和工作温度。第三组贡献将展示PROGNOSIS的使用。 诸如PROGNOSIS之类的工具可以极大地帮助开发自我修复系统,这已经被几家计算机供应商注意到是自主计算新兴领域的重要目标。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Yanyong Zhang其他文献
BiFocus: using radio-optical beacons for an augmented reality search application
BiFocus:使用无线电光学信标进行增强现实搜索应用
- DOI:
10.1145/2462456.2465706 - 发表时间:
2013 - 期刊:
- 影响因子:4.9
- 作者:
A. Ashok;Chenren Xu;Tam N. Vu;M. Gruteser;R. Howard;Yanyong Zhang;N. Mandayam;Wenjia Yuan;Kristin J. Dana - 通讯作者:
Kristin J. Dana
LDP: A Local Diffusion Planner for Efficient Robot Navigation and Collision Avoidance
LDP:用于高效机器人导航和避免碰撞的局部扩散规划器
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Wenhao Yu;Jie Peng;Huanyu Yang;Junrui Zhang;Yifan Duan;Jianmin Ji;Yanyong Zhang - 通讯作者:
Yanyong Zhang
A Model of Passive Human Motion Recognition Using Two-Layer Wireless Links
使用两层无线链路的被动人体运动识别模型
- DOI:
10.1109/ithings.2014.48 - 发表时间:
2014 - 期刊:
- 影响因子:0
- 作者:
Minmin Gu;Ning An;Jinjun Liu;Yanyong Zhang - 通讯作者:
Yanyong Zhang
On the Cache-and-Forward Network Architecture
浅谈缓存转发网络架构
- DOI:
10.1109/icc.2009.5199249 - 发表时间:
2009 - 期刊:
- 影响因子:0
- 作者:
Lijun Dong;Hongbo Liu;Yanyong Zhang;S. Paul;D. Raychaudhuri - 通讯作者:
D. Raychaudhuri
Design Considerations for Applying ICN to IoT
将 ICN 应用于 IoT 的设计注意事项
- DOI:
- 发表时间:
2019 - 期刊:
- 影响因子:0
- 作者:
B. Ahlgren;Anders Lindgren;Yanyong Zhang;J. Burke;A. Azgin;L. Grieco;R. Ravindran - 通讯作者:
R. Ravindran
Yanyong Zhang的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Yanyong Zhang', 18)}}的其他基金
NeTS: Small: Transmit Only: Green Communication for Dense Wireless Systems
NeTS:小型:仅传输:密集无线系统的绿色通信
- 批准号:
1423020 - 财政年份:2014
- 资助金额:
$ 8万 - 项目类别:
Standard Grant
CT - ISG: ROME: Robust Measurement in Sensor Networks
CT - ISG:ROME:传感器网络中的稳健测量
- 批准号:
0831186 - 财政年份:2008
- 资助金额:
$ 8万 - 项目类别:
Standard Grant
CAREER: PROSE: Providing Robustness in Systems of Embedded Sensors
职业:PROSE:为嵌入式传感器系统提供鲁棒性
- 批准号:
0546072 - 财政年份:2006
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR-SMA+AES: Pro-Active Runtime Health Enhancement of Large-Scale Parallel Systems Using PROGNOSIS
合作研究:CSR-SMA AES:使用 PROGNOSIS 主动增强大规模并行系统的运行时健康状况
- 批准号:
0614976 - 财政年份:2006
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
- 批准号:
2312206 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Architecting GPUs for Practical Homomorphic Encryption-based Computing
协作研究:CSR:中:为实用的同态加密计算构建 GPU
- 批准号:
2312276 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
- 批准号:
2312689 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
- 批准号:
2401244 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
- 批准号:
2314681 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Small: Expediting Continual Online Learning on Edge Platforms through Software-Hardware Co-designs
协作研究:企业社会责任:小型:通过软硬件协同设计加快边缘平台上的持续在线学习
- 批准号:
2312157 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Standard Grant
Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
- 批准号:
2312207 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Adaptive Environmental Awareness for Collaborative Augmented Reality
协作研究:企业社会责任:媒介:协作增强现实的自适应环境意识
- 批准号:
2312760 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Small: Cross-layer learning-based Energy-Efficient and Resilient NoC design for Multicore Systems
协作研究:CSR:小型:基于跨层学习的多核系统节能和弹性 NoC 设计
- 批准号:
2321224 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Standard Grant
Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
- 批准号:
2314680 - 财政年份:2023
- 资助金额:
$ 8万 - 项目类别:
Continuing Grant