Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
基本信息
- 批准号:2405142
- 负责人:
- 金额:$ 7.45万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2024-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Advances throughout science and engineering have for several decades been driven by High Performance Computing (HPC), with the pace of discovery accelerating in concert with continued innovation in computing capabilities. But as semiconductor technology now faces fundamental physical limits, even while large-scale systems are reaching warehouse scales, new approaches are becoming essential to achieving efficient use of computing resources. In particular, given this divergence of scales, HPC systems have necessarily become more distributed and asynchronous (in the sense that system clocks are asynchronous), resulting in increasingly variable and unpredictable execution. While these effects are recognized as critical hindrances to HPC performance, the mechanisms are not yet fully understood. What is known, however, is that much HPC infrastructure is tasked with dealing with inefficiency derived from asynchrony, variability, and unpredictability, leading to a deep and complex hardware/software support stack. The project team's hypothesis is that while each stack element provides a local solution, it may also exacerbate the global problem: that complexity has resulted in more variability, not less, and made determining its causes more difficult. This project explores the possibility of reversing the trend of ever-increasing complexity by removing and simplifying support layers. This strategy’s achievable gains remain limited, however, while the underlying cause, execution asynchrony, remains unaddressed. The approach begins by leveraging recently developed technology that enables clocks to remain extremely accurate even when distributed on a planetary scale. Such accurate, distributed clocks serve to underpin a virtuous cycle where synchrony establishes baseline predictability, which, in turn, reduces variability, and at each stage of the cycle enables reduction in the complexity of the support stack. A benefit of this approach is that the individual steps are largely simple and can be applied directly to existing software systems. This one-year project aims to obtain early findings and practical demonstrations for the importance of synchrony and predictability to increase HPC compute efficiency and thereby improve large-scale program execution. Five tasks are conducted. The first is to demonstrate the feasibility of accurate clock distribution by augmenting existing HPC network infrastructure. The second is to demonstrate the application of synchrony in the establishing a virtuous cycle enabling simplifications to the software/system support stack. The third is to devise mechanisms to model, measure, and validate systems using the proposed methods. The fourth is to investigate the relative benefits of applying the synchrony-based virtuous cycle with respect to various application classes. The fifth is to demonstrate the overall efficacy of the proposed approach through a case study involving a production application. Overall, the project works to determine whether added synchronization through accurate clocks enables significant improvements to HPC computations in terms of how efficiently they use computational resources.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
几十年来,整个科学和工程领域的进步都是由高性能计算(HPC)推动的,随着计算能力的不断创新,发现的速度也在加快。但随着半导体技术现在面临基本的物理限制,即使大规模系统正在达到仓库规模,新的方法也变得对实现计算资源的有效利用至关重要。特别是,考虑到规模的这种差异,高性能计算系统必然变得更加分布式和异步(从系统时钟是异步的意义上讲),导致越来越多的可变和不可预测的执行。虽然这些影响被认为是高性能混凝土性能的关键障碍,但其机制尚未完全了解。然而,已知的是,许多HPC基础设施的任务是处理因异步、可变性和不可预测性而导致的低效,从而导致深度和复杂的硬件/软件支持堆栈。项目团队的假设是,虽然每个堆栈元素提供了一个局部解决方案,但它也可能加剧全球问题:这种复杂性导致了更多的可变性,而不是更少,并使确定其原因变得更加困难。该项目探讨了通过移除和简化支持层来扭转日益复杂的趋势的可能性。然而,这一策略的可实现收益仍然有限,而根本原因--执行异步--仍然没有得到解决。这种方法首先利用最近开发的技术,使时钟即使在行星范围内分布时也能保持极高的精度。这种准确的分布式时钟用于支持良性循环,其中同步建立基线可预测性,进而减少变异性,并且在周期的每个阶段能够降低支持堆栈的复杂性。这种方法的一个好处是,各个步骤在很大程度上都很简单,可以直接应用于现有的软件系统。这个为期一年的项目旨在获得关于同步性和可预测性对提高HPC计算效率从而改善大规模程序执行的重要性的早期发现和实际演示。共进行了五项工作。首先是通过扩展现有的HPC网络基础设施来演示精确时钟分配的可行性。第二个是演示同步在建立良性循环方面的应用,以简化软件/系统支持堆栈。第三是设计使用所提出的方法对系统进行建模、测量和验证的机制。第四个是调查在不同的应用程序类别中应用基于同步的良性循环的相对好处。第五个是通过一个涉及生产应用的案例研究来证明所提出的方法的总体效果。总体而言,该项目致力于确定通过准确的时钟增加的同步是否能够在计算资源的使用效率方面显著改进HPC计算。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Anthony Skjellum其他文献
Understanding GPU Triggering APIs for MPI+X Communication
了解用于 MPI X 通信的 GPU 触发 API
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Patrick G. Bridges;Anthony Skjellum;E. Suggs;Derek Schafer;P. Bangalore - 通讯作者:
P. Bangalore
MitM attacks on intellectual property and integrity of additive manufacturing systems: A security analysis
针对增材制造系统的知识产权和完整性的中间人攻击:安全分析
- DOI:
10.1016/j.cose.2024.103810 - 发表时间:
2024-05-01 - 期刊:
- 影响因子:5.400
- 作者:
Hamza Alkofahi;Heba Alawneh;Anthony Skjellum - 通讯作者:
Anthony Skjellum
Anthony Skjellum的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Anthony Skjellum', 18)}}的其他基金
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
2412182 - 财政年份:2023
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Beginnings: Creating and Sustaining a Diverse Community of Expertise in Quantum Information Science (EQUIS) Across the Southeastern United States
起点:在美国东南部创建并维持一个多元化的量子信息科学 (EQUIS) 专业社区
- 批准号:
2414461 - 财政年份:2023
- 资助金额:
$ 7.45万 - 项目类别:
Cooperative Agreement
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
- 批准号:
2151020 - 财政年份:2022
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Networking Infrastructure: Advancing High-speed Networking at UTC for Research and Education
CC* 网络基础设施:推进 UTC 的研究和教育高速网络
- 批准号:
1925598 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
1918987 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: Software Engineering Workforce Development in High Performance Computing for Digital Twins
协作研究:数字孪生高性能计算中的软件工程劳动力开发
- 批准号:
1935628 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
CC* Compute: A Cost-Effective, 2,048 Core InfiniBand Cluster at UTC for Campus Research and Education
CC* 计算:UTC 的具有成本效益的 2,048 核心 InfiniBand 集群,用于校园研究和教育
- 批准号:
1925603 - 财政年份:2019
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: CICI: Regional: SouthEast SciEntific Cybersecurity for University Research (SouthEast SECURE)
合作研究:CICI:区域:东南大学研究科学网络安全 (SouthEast SECURE)
- 批准号:
1812404 - 财政年份:2017
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
SHF: Medium: Collaborative Research: Next-Generation Message Passing for Parallel Programming: Resiliency, Time-to-Solution, Performance-Portability, Scalability, and QoS
SHF:中:协作研究:并行编程的下一代消息传递:弹性、解决时间、性能可移植性、可扩展性和 QoS
- 批准号:
1822191 - 财政年份:2017
- 资助金额:
$ 7.45万 - 项目类别:
Continuing Grant
SHF: Small: Collaborative Research: Coupling Computation and Communication in FPGA-Enhanced Clouds and Clusters
SHF:小型:协作研究:FPGA 增强型云和集群中的耦合计算和通信
- 批准号:
1821431 - 财政年份:2017
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: EAGER: IMPRESS-U: Groundwater Resilience Assessment through iNtegrated Data Exploration for Ukraine (GRANDE-U)
合作研究:EAGER:IMPRESS-U:通过乌克兰综合数据探索进行地下水恢复力评估 (GRANDE-U)
- 批准号:
2409395 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
EAGER/Collaborative Research: An LLM-Powered Framework for G-Code Comprehension and Retrieval
EAGER/协作研究:LLM 支持的 G 代码理解和检索框架
- 批准号:
2347624 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
EAGER/Collaborative Research: Revealing the Physical Mechanisms Underlying the Extraordinary Stability of Flying Insects
EAGER/合作研究:揭示飞行昆虫非凡稳定性的物理机制
- 批准号:
2344215 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
- 批准号:
2345581 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
- 批准号:
2345582 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
- 批准号:
2345583 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: The next crisis for coral reefs is how to study vanishing coral species; AUVs equipped with AI may be the only tool for the job
合作研究:EAGER:珊瑚礁的下一个危机是如何研究正在消失的珊瑚物种;
- 批准号:
2333604 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Energy for persistent sensing of carbon dioxide under near shore waves.
合作研究:EAGER:近岸波浪下持续感知二氧化碳的能量。
- 批准号:
2339062 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: The next crisis for coral reefs is how to study vanishing coral species; AUVs equipped with AI may be the only tool for the job
合作研究:EAGER:珊瑚礁的下一个危机是如何研究正在消失的珊瑚物种;
- 批准号:
2333603 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant
EAGER/Collaborative Research: An LLM-Powered Framework for G-Code Comprehension and Retrieval
EAGER/协作研究:LLM 支持的 G 代码理解和检索框架
- 批准号:
2347623 - 财政年份:2024
- 资助金额:
$ 7.45万 - 项目类别:
Standard Grant