Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems

合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用

基本信息

项目摘要

Supercomputers, or high-performance computing (HPC) clusters, are instrumental in propelling scientific and engineering research by offering vast computational resources. These systems are increasingly crucial as artificial intelligence (AI) techniques become pervasive across various fields, including climate modeling, drug discovery, and physics simulations, significantly expanding the need for computational power and data management. However, the existing HPC infrastructures face challenges with extended job wait times and suboptimal resource use, primarily due to the escalating complexity of computations and the burgeoning demands for resources. Unlike traditional HPC tasks, AI algorithms and models exhibit distinct resource requirements, often resulting in either excess or insufficient resource allocation for AI tasks. This project aims to bridge the gap between HPC resource provisioning and AI application demands through an in-depth analysis of resource allocation and utilization within HPC environments running AI workloads. The goal is to identify strategies for minimizing resource waste and reducing the length of job queues by efficiently reallocating idle resources to accommodate large-scale AI tasks. By creating and disseminating datasets, models, algorithms, and system source code, this initiative will contribute valuable tools and insights to the research community. The findings will be broadly shared through research papers, technical reports, book chapters, course materials, and tutorials, enhancing the knowledge base in both HPC and AI fields and supporting the broader objectives of promoting scientific progress, improving national health, prosperity, and welfare, and contributing to national defense. This project centers on advancing the efficiency and productivity of HPC systems by innovatively leveraging idle resources to expedite AI job processing and diminish waiting periods. The research is structured around three interconnected themes, each addressing critical aspects of resource utilization and AI performance enhancement within HPC environments. The initial theme undertakes a comprehensive analysis of idle resources in HPC systems, aiming to identify patterns and opportunities for resource optimization. Building on the insights gained, the second theme explores methodologies for the safe and timely harvesting of idle resources across various categories, ensuring that these resources can be reallocated without compromising system stability or performance. The third theme is dedicated to developing strategies that utilize these harvested resources to boost AI application outcomes significantly and, by extension, enhance the overall productivity of HPC operations. The project will implement a tangible HPC testbed equipped with real-world benchmarks and workloads alongside these thematic investigations. This testbed will serve as a platform for empirically validating developed algorithms and systems, facilitating a rigorous assessment of their effectiveness in improving HPC resource allocation and utilization.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
超级计算机,或高性能计算(HPC)集群,通过提供大量计算资源,在推动科学和工程研究方面发挥了重要作用。随着人工智能(AI)技术在各个领域的普及,包括气候建模、药物发现和物理模拟,这些系统变得越来越关键,显著扩大了对计算能力和数据管理的需求。然而,现有的高性能计算基础设施面临着作业等待时间延长和资源利用不佳的挑战,这主要是由于计算的复杂性不断上升和对资源的需求迅速增长。与传统的HPC任务不同,AI算法和模型表现出不同的资源需求,经常导致AI任务的资源分配过剩或不足。该项目旨在通过深入分析运行AI工作负载的HPC环境中的资源分配和利用,弥合HPC资源配置和AI应用需求之间的差距。目标是确定通过高效地重新分配空闲资源来适应大规模AI任务来最大限度地减少资源浪费和减少作业队列长度的策略。通过创建和传播数据集、模型、算法和系统源代码,这一倡议将为研究界贡献宝贵的工具和见解。这些发现将通过研究论文、技术报告、书籍章节、课程材料和教程广泛分享,增强HPC和人工智能领域的知识库,并支持促进科学进步、改善国民健康、繁荣和福利以及为国防做出贡献的更广泛目标。该项目的核心是通过创新地利用闲置资源来加快人工智能作业处理并减少等待时间,从而提高HPC系统的效率和生产率。这项研究围绕三个相互关联的主题展开,每个主题都涉及高性能计算环境中资源利用和人工智能性能增强的关键方面。最初的主题是对高性能计算系统中的闲置资源进行全面分析,旨在确定资源优化的模式和机会。在所获得的见解的基础上,第二个主题探讨了安全和及时地收集各种类别的闲置资源的方法,确保可以在不损害系统稳定性或性能的情况下重新分配这些资源。第三个主题致力于制定战略,利用这些收获的资源显著提高人工智能应用成果,进而提高高性能计算运营的整体生产率。该项目将实施一个有形的高性能计算试验台,配备真实世界的基准和工作量,以及这些专题调查。该试验台将作为对开发的算法和系统进行经验性验证的平台,促进对它们在改进HPC资源分配和利用方面的有效性的严格评估。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Seung-Jong Park其他文献

Quality changes in <em>Pteridium aquilinum</em> and the root of <em>Platycodon grandiflorum</em> frozen under different conditions
  • DOI:
    10.1016/j.ijrefrig.2014.04.004
  • 发表时间:
    2014-07-01
  • 期刊:
  • 影响因子:
  • 作者:
    Seung-Jong Park;Mohammad Al Mijan;Kyung Bin Song
  • 通讯作者:
    Kyung Bin Song
Energy-Aware Topology Control and Data Delivery in Wireless Sensor Networks

Seung-Jong Park的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Seung-Jong Park', 18)}}的其他基金

IPA Agreement with Louisiana State University 1st year (Park 2021)
与路易斯安那州立大学签订 IPA 协议第一年(2021 年公园)
  • 批准号:
    2120248
  • 财政年份:
    2021
  • 资助金额:
    $ 30万
  • 项目类别:
    Intergovernmental Personnel Award
SCC-Planning: Promoting Smart Technologies in Public Safety and Transportation to Improve Social and Economic Outcomes in a US EDA-Designated Critical Manufacturing Region
SCC-规划:在公共安全和交通领域推广智能技术,以改善美国 EDA 指定关键制造区域的社会和经济成果
  • 批准号:
    1737557
  • 财政年份:
    2017
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
MRI: Acquisition of SuperMIC -- A Heterogeneous Computing Environment to Enable Transformation of Computational Research and Education in the State of Louisiana
MRI:收购 SuperMIC——一种异构计算环境,以实现路易斯安那州计算研究和教育的转型
  • 批准号:
    1338051
  • 财政年份:
    2013
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
CC-NIE Integration: Bridging, Transferring and Analyzing Big Data over 10Gbps Campus-Wide Software Defined Networks
CC-NIE 集成:通过 10Gbps 校园范围软件定义网络桥接、传输和分析大数据
  • 批准号:
    1341008
  • 财政年份:
    2013
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
MRI: CRON: Development of a Cyberinfrastructure Reconfigurable Optical Network for Large-Scale Scientific Discovery
MRI:CRON:开发用于大规模科学发现的网络基础设施可重构光网络
  • 批准号:
    0821741
  • 财政年份:
    2008
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402946
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403088
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403090
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403089
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
  • 批准号:
    2403398
  • 财政年份:
    2024
  • 资助金额:
    $ 30万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了