Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability

合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性

基本信息

  • 批准号:
    2151020
  • 负责人:
  • 金额:
    $ 7.45万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2022
  • 资助国家:
    美国
  • 起止时间:
    2022-06-01 至 2023-12-31
  • 项目状态:
    已结题

项目摘要

Advances throughout science and engineering have for several decades been driven by High Performance Computing (HPC), with the pace of discovery accelerating in concert with continued innovation in computing capabilities. But as semiconductor technology now faces fundamental physical limits, even while large-scale systems are reaching warehouse scales, new approaches are becoming essential to achieving efficient use of computing resources. In particular, given this divergence of scales, HPC systems have necessarily become more distributed and asynchronous (in the sense that system clocks are asynchronous), resulting in increasingly variable and unpredictable execution. While these effects are recognized as critical hindrances to HPC performance, the mechanisms are not yet fully understood. What is known, however, is that much HPC infrastructure is tasked with dealing with inefficiency derived from asynchrony, variability, and unpredictability, leading to a deep and complex hardware/software support stack. The project team's hypothesis is that while each stack element provides a local solution, it may also exacerbate the global problem: that complexity has resulted in more variability, not less, and made determining its causes more difficult. This project explores the possibility of reversing the trend of ever-increasing complexity by removing and simplifying support layers. This strategy’s achievable gains remain limited, however, while the underlying cause, execution asynchrony, remains unaddressed. The approach begins by leveraging recently developed technology that enables clocks to remain extremely accurate even when distributed on a planetary scale. Such accurate, distributed clocks serve to underpin a virtuous cycle where synchrony establishes baseline predictability, which, in turn, reduces variability, and at each stage of the cycle enables reduction in the complexity of the support stack. A benefit of this approach is that the individual steps are largely simple and can be applied directly to existing software systems. This one-year project aims to obtain early findings and practical demonstrations for the importance of synchrony and predictability to increase HPC compute efficiency and thereby improve large-scale program execution. Five tasks are conducted. The first is to demonstrate the feasibility of accurate clock distribution by augmenting existing HPC network infrastructure. The second is to demonstrate the application of synchrony in the establishing a virtuous cycle enabling simplifications to the software/system support stack. The third is to devise mechanisms to model, measure, and validate systems using the proposed methods. The fourth is to investigate the relative benefits of applying the synchrony-based virtuous cycle with respect to various application classes. The fifth is to demonstrate the overall efficacy of the proposed approach through a case study involving a production application. Overall, the project works to determine whether added synchronization through accurate clocks enables significant improvements to HPC computations in terms of how efficiently they use computational resources.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
几十年来,高性能计算(HPC)一直推动着科学和工程领域的进步,随着计算能力的不断创新,发现的步伐不断加快。但是,由于半导体技术现在面临着基本的物理限制,即使大规模系统正在达到仓库规模,新的方法对于实现计算资源的有效利用也变得至关重要。特别是,考虑到这种规模的差异,HPC系统必然变得更加分布式和异步(在系统时钟异步的意义上),导致越来越多的可变和不可预测的执行。虽然这些影响被认为是HPC性能的关键障碍,但其机制尚未完全了解。然而,众所周知的是,许多HPC基础设施的任务是处理由复杂性、可变性和不可预测性导致的低效率,从而导致深度和复杂的硬件/软件支持堆栈。项目团队的假设是,虽然每个堆栈元素提供了一个局部解决方案,但它也可能加剧全局问题:复杂性导致了更多的可变性,而不是更少,并使确定其原因变得更加困难。该项目探讨了通过删除和简化支持层来扭转日益复杂的趋势的可能性。然而,这一战略可实现的收益仍然有限,而执行不力这一根本原因仍然没有得到解决。该方法首先利用最近开发的技术,使时钟即使分布在行星尺度上也能保持非常准确。这种精确的分布式时钟用于支持良性循环,其中同步建立基线可预测性,这反过来又降低了可变性,并且在循环的每个阶段都能够降低支持堆栈的复杂性。这种方法的一个好处是,各个步骤非常简单,可以直接应用于现有的软件系统。这个为期一年的项目旨在获得同步和可预测性的重要性的早期发现和实际演示,以提高HPC计算效率,从而改善大规模程序执行。开展五项任务。第一个是通过增强现有HPC网络基础设施来证明精确时钟分配的可行性。第二个是展示同步在建立一个良性循环中的应用,从而简化软件/系统支持堆栈。第三是设计机制,使用所提出的方法来建模,测量和验证系统。第四是调查应用基于同步的良性循环相对于各种应用类的相对效益。第五个是通过一个涉及生产应用程序的案例研究来证明所提出的方法的总体有效性。总体而言,该项目旨在确定通过精确时钟增加的同步是否能够在高效利用计算资源方面显著改善HPC计算。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
The Viability of Using Online Prediction to Perform Extra Work while Executing BSP Applications
在执行 BSP 应用程序时使用在线预测执行额外工作的可行性
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Anthony Skjellum其他文献

Understanding GPU Triggering APIs for MPI+X Communication
了解用于 MPI X 通信的 GPU 触发 API
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Patrick G. Bridges;Anthony Skjellum;E. Suggs;Derek Schafer;P. Bangalore
  • 通讯作者:
    P. Bangalore
MitM attacks on intellectual property and integrity of additive manufacturing systems: A security analysis
针对增材制造系统的知识产权和完整性的中间人攻击:安全分析
  • DOI:
    10.1016/j.cose.2024.103810
  • 发表时间:
    2024-05-01
  • 期刊:
  • 影响因子:
    5.400
  • 作者:
    Hamza Alkofahi;Heba Alawneh;Anthony Skjellum
  • 通讯作者:
    Anthony Skjellum

Anthony Skjellum的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Anthony Skjellum', 18)}}的其他基金

SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
  • 批准号:
    2412182
  • 财政年份:
    2023
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
  • 批准号:
    2405142
  • 财政年份:
    2023
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Beginnings: Creating and Sustaining a Diverse Community of Expertise in Quantum Information Science (EQUIS) Across the Southeastern United States
起点:在美国东南部创建并维持一个多元化的量子信息科学 (EQUIS) 专业社区
  • 批准号:
    2414461
  • 财政年份:
    2023
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Cooperative Agreement
CC* Networking Infrastructure: Advancing High-speed Networking at UTC for Research and Education
CC* 网络基础设施:推进 UTC 的研究和教育高速网络
  • 批准号:
    1925598
  • 财政年份:
    2019
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
  • 批准号:
    1918987
  • 财政年份:
    2019
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: Software Engineering Workforce Development in High Performance Computing for Digital Twins
协作研究:数字孪生高性能计算中的软件工程劳动力开发
  • 批准号:
    1935628
  • 财政年份:
    2019
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
CC* Compute: A Cost-Effective, 2,048 Core InfiniBand Cluster at UTC for Campus Research and Education
CC* 计算:UTC 的具有成本效益的 2,048 核心 InfiniBand 集群,用于校园研究和教育
  • 批准号:
    1925603
  • 财政年份:
    2019
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: CICI: Regional: SouthEast SciEntific Cybersecurity for University Research (SouthEast SECURE)
合作研究:CICI:区域:东南大学研究科学网络安全 (SouthEast SECURE)
  • 批准号:
    1812404
  • 财政年份:
    2017
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
SHF: Medium: Collaborative Research: Next-Generation Message Passing for Parallel Programming: Resiliency, Time-to-Solution, Performance-Portability, Scalability, and QoS
SHF:中:协作研究:并行编程的下一代消息传递:弹性、解决时间、性能可移植性、可扩展性和 QoS
  • 批准号:
    1822191
  • 财政年份:
    2017
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Continuing Grant
CICI: Data Provenance: Collaborative Research: Provenance Assurance Using Currency Primitives
CICI:数据来源:协作研究:使用货币基元的来源保证
  • 批准号:
    1821926
  • 财政年份:
    2017
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: EAGER: The next crisis for coral reefs is how to study vanishing coral species; AUVs equipped with AI may be the only tool for the job
合作研究:EAGER:珊瑚礁的下一个危机是如何研究正在消失的珊瑚物种;
  • 批准号:
    2333604
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
EAGER/Collaborative Research: An LLM-Powered Framework for G-Code Comprehension and Retrieval
EAGER/协作研究:LLM 支持的 G 代码理解和检索框架
  • 批准号:
    2347624
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
EAGER/Collaborative Research: Revealing the Physical Mechanisms Underlying the Extraordinary Stability of Flying Insects
EAGER/合作研究:揭示飞行昆虫非凡稳定性的物理机制
  • 批准号:
    2344215
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
  • 批准号:
    2345581
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
  • 批准号:
    2345582
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Designing Nanomaterials to Reveal the Mechanism of Single Nanoparticle Photoemission Intermittency
合作研究:EAGER:设计纳米材料揭示单纳米粒子光电发射间歇性机制
  • 批准号:
    2345583
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Energy for persistent sensing of carbon dioxide under near shore waves.
合作研究:EAGER:近岸波浪下持续感知二氧化碳的能量。
  • 批准号:
    2339062
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: IMPRESS-U: Groundwater Resilience Assessment through iNtegrated Data Exploration for Ukraine (GRANDE-U)
合作研究:EAGER:IMPRESS-U:通过乌克兰综合数据探索进行地下水恢复力评估 (GRANDE-U)
  • 批准号:
    2409395
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: The next crisis for coral reefs is how to study vanishing coral species; AUVs equipped with AI may be the only tool for the job
合作研究:EAGER:珊瑚礁的下一个危机是如何研究正在消失的珊瑚物种;
  • 批准号:
    2333603
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
EAGER/Collaborative Research: An LLM-Powered Framework for G-Code Comprehension and Retrieval
EAGER/协作研究:LLM 支持的 G 代码理解和检索框架
  • 批准号:
    2347623
  • 财政年份:
    2024
  • 资助金额:
    $ 7.45万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了