Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems

合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用

基本信息

  • 批准号:
    2403398
  • 负责人:
  • 金额:
    $ 30万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2024
  • 资助国家:
    美国
  • 起止时间:
    2024-07-01 至 2027-06-30
  • 项目状态:
    未结题

项目摘要

Supercomputers, or high-performance computing (HPC) clusters, are instrumental in propelling scientific and engineering research by offering vast computational resources. These systems are increasingly crucial as artificial intelligence (AI) techniques become pervasive across various fields, including climate modeling, drug discovery, and physics simulations, significantly expanding the need for computational power and data management. However, the existing HPC infrastructures face challenges with extended job wait times and suboptimal resource use, primarily due to the escalating complexity of computations and the burgeoning demands for resources. Unlike traditional HPC tasks, AI algorithms and models exhibit distinct resource requirements, often resulting in either excess or insufficient resource allocation for AI tasks. This project aims to bridge the gap between HPC resource provisioning and AI application demands through an in-depth analysis of resource allocation and utilization within HPC environments running AI workloads. The goal is to identify strategies for minimizing resource waste and reducing the length of job queues by efficiently reallocating idle resources to accommodate large-scale AI tasks. By creating and disseminating datasets, models, algorithms, and system source code, this initiative will contribute valuable tools and insights to the research community. The findings will be broadly shared through research papers, technical reports, book chapters, course materials, and tutorials, enhancing the knowledge base in both HPC and AI fields and supporting the broader objectives of promoting scientific progress, improving national health, prosperity, and welfare, and contributing to national defense. This project centers on advancing the efficiency and productivity of HPC systems by innovatively leveraging idle resources to expedite AI job processing and diminish waiting periods. The research is structured around three interconnected themes, each addressing critical aspects of resource utilization and AI performance enhancement within HPC environments. The initial theme undertakes a comprehensive analysis of idle resources in HPC systems, aiming to identify patterns and opportunities for resource optimization. Building on the insights gained, the second theme explores methodologies for the safe and timely harvesting of idle resources across various categories, ensuring that these resources can be reallocated without compromising system stability or performance. The third theme is dedicated to developing strategies that utilize these harvested resources to boost AI application outcomes significantly and, by extension, enhance the overall productivity of HPC operations. The project will implement a tangible HPC testbed equipped with real-world benchmarks and workloads alongside these thematic investigations. This testbed will serve as a platform for empirically validating developed algorithms and systems, facilitating a rigorous assessment of their effectiveness in improving HPC resource allocation and utilization.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
超级计算机或高性能计算(HPC)集群通过提供大量计算资源,在推动科学和工程研究方面发挥着重要作用。随着人工智能(AI)技术在气候建模、药物发现和物理模拟等各个领域的普及,这些系统变得越来越重要,从而显著扩大了对计算能力和数据管理的需求。然而,现有的HPC基础设施面临着延长作业等待时间和次优资源使用的挑战,这主要是由于计算的复杂性不断上升和对资源的需求不断增长。与传统的HPC任务不同,人工智能算法和模型具有不同的资源需求,通常会导致人工智能任务的资源分配过多或不足。该项目旨在通过深入分析运行AI工作负载的HPC环境中的资源分配和利用率,弥合HPC资源配置与AI应用需求之间的差距。目标是通过有效地重新分配空闲资源来适应大规模AI任务,从而确定最大限度地减少资源浪费和缩短作业队列长度的策略。通过创建和传播数据集、模型、算法和系统源代码,该计划将为研究界提供有价值的工具和见解。研究结果将通过研究论文,技术报告,书籍章节,课程材料和教程广泛分享,增强HPC和AI领域的知识基础,并支持促进科学进步,改善国民健康,繁荣和福利以及为国防做出贡献的更广泛目标。该项目旨在通过创新性地利用闲置资源来加快AI作业处理并缩短等待时间,从而提高HPC系统的效率和生产力。该研究围绕三个相互关联的主题展开,每个主题都涉及HPC环境中资源利用和AI性能增强的关键方面。初始主题对HPC系统中的空闲资源进行了全面分析,旨在确定资源优化的模式和机会。第二个主题以所获得的见解为基础,探讨了安全和及时地收集各类闲置资源的方法,确保这些资源可以在不影响系统稳定性或性能的情况下重新分配。第三个主题是致力于制定战略,利用这些收获的资源来显着提高AI应用程序的成果,并通过扩展来提高HPC操作的整体生产力。该项目将实施一个有形的HPC测试平台,配备真实世界的基准和工作负载以及这些主题调查。该试验台将作为一个平台,用于对开发的算法和系统进行经验验证,促进对它们在提高HPC资源分配和利用率方面的有效性进行严格评估。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Hao Wang其他文献

Oxidative stress increases the 17,20-lyase-catalyzing activity of adrenal P450c17 through p38α in the development of hyperandrogenism
在高雄激素血症的发展过程中,氧化应激通过 p38 α 增加肾上腺 P450c17 的 17,20-裂解酶催化活性
  • DOI:
    10.1016/j.mce.2019.01.020
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    4.1
  • 作者:
    Wenjiao Zhu;Bing Han;Mengxia Fan;Nan Wang;Hao Wang;Hui Zhu;Tong Cheng;Shuangxia Zhao;Huaidong Song;Jie Qiao
  • 通讯作者:
    Jie Qiao
Interacting Superprocesses with Discontinuous Spatial Motion and their Associated SPDEs
超级过程与不连续空间运动及其相关 SPDE 的交互
  • DOI:
  • 发表时间:
    2009
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Zhen;Hao Wang;J. Xiong
  • 通讯作者:
    J. Xiong
State classification for a class of measure-valued branching diffusions in a Brownian medium
布朗介质中一类测值分支扩散的状态分类
  • DOI:
  • 发表时间:
    1997
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Hao Wang
  • 通讯作者:
    Hao Wang
Weighted 3D GS algorithm for image-quality improvement of multi-plane holographic display
用于改善多平面全息显示图像质量的加权3D GS算法
  • DOI:
    10.3788/cjl201239.1009001
  • 发表时间:
    2012
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Fang. Li;Y. Bi;Hao Wang;Minyuan Sun;Xinxin Kong
  • 通讯作者:
    Xinxin Kong
Investigations into the Rock Dynamic Response under Blasting Load by an Improved DDA Approach
改进的 DDA 方法研究爆破荷载下岩石的动力响应
  • DOI:
    10.1155/2021/8827022
  • 发表时间:
    2021-02
  • 期刊:
  • 影响因子:
    1.8
  • 作者:
    Biting Xie;Xiuli Zhang;Hao Wang;Yuyong Jiao;Fei Zheng
  • 通讯作者:
    Fei Zheng

Hao Wang的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Hao Wang', 18)}}的其他基金

RII Track-4:NSF: Federated Analytics Systems with Fine-grained Knowledge Comprehension: Achieving Accuracy with Privacy
RII Track-4:NSF:具有细粒度知识理解的联合分析系统:通过隐私实现准确性
  • 批准号:
    2327480
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: SaTC: CORE: Small: Critical Learning Periods Augmented Robust Federated Learning
协作研究:SaTC:核心:小型:关键学习期增强鲁棒联邦学习
  • 批准号:
    2315612
  • 财政年份:
    2023
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
CRII: OAC: High-Efficiency Serverless Computing Systems for Deep Learning: A Hybrid CPU/GPU Architecture
CRII:OAC:用于深度学习的高效无服务器计算系统:混合 CPU/GPU 架构
  • 批准号:
    2153502
  • 财政年份:
    2022
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
RI: Small: Enabling Interpretable AI via Bayesian Deep Learning
RI:小型:通过贝叶斯深度学习实现可解释的人工智能
  • 批准号:
    2127918
  • 财政年份:
    2021
  • 资助金额:
    $ 30万
  • 项目类别:
    Continuing Grant
US-China planning visit: Development of High Performance and Multifunctional Infrastructure Material
中美计划访问:高性能多功能基础设施材料的开发
  • 批准号:
    1338297
  • 财政年份:
    2013
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
SBIR Phase II: SAFE: Behavior-based Malware Detection and Prevention
SBIR 第二阶段:SAFE:基于行为的恶意软件检测和预防
  • 批准号:
    0750299
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
SBIR Phase I: SpiderWeb - Self-Healing Networks for Spyware Detection
SBIR 第一阶段:SpiderWeb - 用于间谍软件检测的自我修复网络
  • 批准号:
    0638170
  • 财政年份:
    2007
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Constructibility and Large Cardinal Numbers
可构造性和大基数
  • 批准号:
    7902941
  • 财政年份:
    1979
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402946
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403088
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403090
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
  • 批准号:
    2403399
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403089
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了