SHF: Medium: Collaborative Research: Next-Generation Message Passing for Parallel Programming: Resiliency, Time-to-Solution, Performance-Portability, Scalability, and QoS
SHF:中:协作研究:并行编程的下一代消息传递:弹性、解决时间、性能可移植性、可扩展性和 QoS
基本信息
- 批准号:1822191
- 负责人:
- 金额:$ 52.37万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2017
- 资助国家:美国
- 起止时间:2017-10-01 至 2022-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Parallel programming based on MPI is being used with increased frequency in academia, government (defense and non-defense uses), as well as emerging uses in scalable machine learning and big data analytics. Emerging supercomputer systems will have more faults and MPI needs to be able to workaround such faults to be appropriate to these emerging situations, rather than causing an entire application to fail. Collaborative, transformative message passing research for High Performance Computing (HPC) critical to performance-portable parallel programming in new and forthcoming scalable systems (with a strategy of "best practice-first, standardization-later") is being reduced to practice. A substantial subset of the Message Passing Interface (MPI-3/4) application programmer interface is being made fault tolerant through extensions with weak collective transactions that synchronize between parallel tasks. This research studies the novel model that localizes faults, provides tunable fault-free overhead, allows for multiple kinds of faults, enables hierarchical recovery, and is data-parallel relevant. Fault modeling of underlying networks is being studied. Application developers control the granularity and fault-free overhead in this effort. Performance and scalability results of the middleware prototype are being demonstrated principally through compact applications that relate to real use cases of practical and academic interest. The impact of this work ranges from users of the largest supercomputers in government labs to practical clusters that have long-running, time-critical applications, and to space-based and other parallel processing in "hostile" environments where faults occur more frequently than in past years. The project is producing usable free software that will be widely shared in the community as well as guidance on how better parallel programs can be written in academia, industry, and government. The project also provides guidelines for how to update existing or legacy programs to use the new capabilities that are being reduced to practice.
基于MPI的并行编程在学术界、政府(国防和非国防用途)以及可扩展机器学习和大数据分析中的新兴用途中的使用频率越来越高。 新兴的超级计算机系统将有更多的故障,MPI需要能够解决这些故障,以适应这些新兴的情况,而不是导致整个应用程序失败。 高性能计算(HPC)的协作,变革性的消息传递研究的关键性能便携式并行编程在新的和即将到来的可扩展系统(与“最佳实践,先,后调试”的战略)正在减少到实践。消息传递接口(MPI-3/4)应用程序编程接口的一个重要子集正在通过具有在并行任务之间同步的弱集体事务的扩展来实现容错。本研究研究的新模型,本地化故障,提供可调的无故障开销,允许多种故障,使分层恢复,是数据并行相关。 正在研究底层网络的故障建模。应用程序开发人员控制这一工作中的粒度和无故障开销。性能和可扩展性的中间件原型的结果主要是通过紧凑的应用程序,涉及到实际和学术兴趣的真实的用例证明。这项工作的影响范围从政府实验室中最大的超级计算机的用户到具有长期运行,时间关键型应用程序的实用集群,以及在故障比过去几年更频繁发生的“敌对”环境中的天基和其他并行处理。 该项目正在制作可用的免费软件,这些软件将在社区中广泛共享,并指导如何在学术界、工业界和政府中编写更好的并行程序。 该项目还提供了如何更新现有或遗留程序以使用正在减少到实践中的新功能的指导方针。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Design of a Portable Implementation of Partitioned Point-to-Point Communication Primitives
分区点对点通信原语的便携式实现的设计
- DOI:10.1145/3458744.3474046
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Worley, Andrew;Prema Soundararajan, Prema;Schafer, Derek;Bangalore, Purushotham;Grant, Ryan;Dosanjh, Matthew;Skjellum, Anthony;Ghafoor, Sheikh
- 通讯作者:Ghafoor, Sheikh
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Anthony Skjellum其他文献
Understanding GPU Triggering APIs for MPI+X Communication
了解用于 MPI X 通信的 GPU 触发 API
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Patrick G. Bridges;Anthony Skjellum;E. Suggs;Derek Schafer;P. Bangalore - 通讯作者:
P. Bangalore
MitM attacks on intellectual property and integrity of additive manufacturing systems: A security analysis
针对增材制造系统的知识产权和完整性的中间人攻击:安全分析
- DOI:
10.1016/j.cose.2024.103810 - 发表时间:
2024-05-01 - 期刊:
- 影响因子:5.400
- 作者:
Hamza Alkofahi;Heba Alawneh;Anthony Skjellum - 通讯作者:
Anthony Skjellum
Anthony Skjellum的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Anthony Skjellum', 18)}}的其他基金
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
2412182 - 财政年份:2023
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
- 批准号:
2405142 - 财政年份:2023
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Beginnings: Creating and Sustaining a Diverse Community of Expertise in Quantum Information Science (EQUIS) Across the Southeastern United States
起点:在美国东南部创建并维持一个多元化的量子信息科学 (EQUIS) 专业社区
- 批准号:
2414461 - 财政年份:2023
- 资助金额:
$ 52.37万 - 项目类别:
Cooperative Agreement
Collaborative Research: EAGER: Real-time Strategies and Synchronized Time Distribution Mechanisms for Enhanced Exascale Performance-Portability and Predictability
合作研究:EAGER:实时策略和同步时间分配机制,以增强百亿亿次性能-可移植性和可预测性
- 批准号:
2151020 - 财政年份:2022
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
CC* Networking Infrastructure: Advancing High-speed Networking at UTC for Research and Education
CC* 网络基础设施:推进 UTC 的研究和教育高速网络
- 批准号:
1925598 - 财政年份:2019
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
SPX: Collaborative Research: Intelligent Communication Fabrics to Facilitate Extreme Scale Computing
SPX:协作研究:促进超大规模计算的智能通信结构
- 批准号:
1918987 - 财政年份:2019
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: Software Engineering Workforce Development in High Performance Computing for Digital Twins
协作研究:数字孪生高性能计算中的软件工程劳动力开发
- 批准号:
1935628 - 财政年份:2019
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
CC* Compute: A Cost-Effective, 2,048 Core InfiniBand Cluster at UTC for Campus Research and Education
CC* 计算:UTC 的具有成本效益的 2,048 核心 InfiniBand 集群,用于校园研究和教育
- 批准号:
1925603 - 财政年份:2019
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: CICI: Regional: SouthEast SciEntific Cybersecurity for University Research (SouthEast SECURE)
合作研究:CICI:区域:东南大学研究科学网络安全 (SouthEast SECURE)
- 批准号:
1812404 - 财政年份:2017
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
CICI: Data Provenance: Collaborative Research: Provenance Assurance Using Currency Primitives
CICI:数据来源:协作研究:使用货币基元的来源保证
- 批准号:
1821926 - 财政年份:2017
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
相似海外基金
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403134 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
- 批准号:
2402804 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403408 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
- 批准号:
2423813 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:
2402806 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
- 批准号:
2403135 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
- 批准号:
2403409 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
- 批准号:
2402805 - 财政年份:2024
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: High-Performance, Verified Accelerator Programming
合作研究:SHF:中:高性能、经过验证的加速器编程
- 批准号:
2313024 - 财政年份:2023
- 资助金额:
$ 52.37万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Medium: Verifying Deep Neural Networks with Spintronic Probabilistic Computers
合作研究:SHF:中:使用自旋电子概率计算机验证深度神经网络
- 批准号:
2311295 - 财政年份:2023
- 资助金额:
$ 52.37万 - 项目类别:
Continuing Grant