Collaborative Research: OAC Core: Improving Utilization of High-Performance Computing Systems via Intelligent Co-scheduling
合作研究:OAC Core:通过智能协同调度提高高性能计算系统的利用率
基本信息
- 批准号:2103511
- 负责人:
- 金额:$ 25.03万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-09-01 至 2024-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
This project is aimed at increasing efficiency of high-performance computing systems by scheduling multiple jobs on the same set of nodes in a system, generally called co-scheduling. This is a break from current practice in which nodes are dedicated to one job at a time, which results in predictable execution time but inefficient use of system resources. To make this practical, the project will develop analyses to determine how to carry out co-scheduling such that overall system efficiency is improved while the performance impact on individual applications is minimized. In particular, the goal is to co-schedule jobs that can co-exist without contending for similar resources on the nodes.  The work done in this project will help achieve better efficiency on high-performance systems, which will impact application domains such as climate/weather, renewable energy, and national security. The work will be implemented and validated on systems at Lawrence Livermore and Sandia National Laboratories and then transitioned into software that will be used at these national laboratories. The project will also have an impact on education by integrating the techniques in this research into courses covering parallel and distributed computing at the PIs' institutions. In addition, the project will take place at two Hispanic-serving institutions, and the PIs have a history of advising under-represented students; the project will broaden participation in computing by recruiting Hispanic undergraduates to work on the project and sending them to national laboratories for internships.The long-standing abstraction at high-end computing facilities is one of a submitted job being allocated a set of dedicated nodes. However, this makes systems much less efficient, as there are more per-node resources that will often be used inefficiently. In addition, the demand for high-end systems is increasing and dedicating nodes to jobs can increase job turnaround time and decrease overall system throughput.  One way to address this problem is for supercomputer centers to break from the current common practice of assigning each job a private, isolated portion of a supercomputer.  The intellectual merit of the project is three-fold. First, novel profile analyses will be developed that will reveal the effects on jobs due to sharing nodes. Second, novel statistical projection techniques will be developed that predict scaling behavior of jobs that are utilizing shared nodes. Third, new job-level scheduling techniques will be designed that use the interference analysis and projections to choose a set of shared nodes that will lead to good job turnaround time and maximize system throughput. The broader impact of the project is multifold.  This project will help achieve better efficiency on high-performance systems, which will benefit a broad range of applications that includes climate/weather prediction, nuclear energy, and national security.  Through a long-standing collaboration with both Lawrence Livermore and Sandia National Laboratories, the PIs will implement and validate the techniques on LLNL and SNL systems as well as transition the techniques into future resource managers at the national laboratories. In addition, both PIs will broaden participation in computing by recruiting Hispanic undergraduates to work on the project and sending them to national labs for internships.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该项目旨在通过在系统中的同一组节点上调度多个作业来提高高性能计算系统的效率,通常称为协同调度。这与当前的做法不同,在当前的做法中,节点一次专用于一个作业,这导致可预测的执行时间,但系统资源的使用效率低下。为了使这一点切实可行,该项目将开展分析,以确定如何进行联合调度,从而提高整个系统的效率,同时尽量减少对个别应用程序的性能影响。特别是,我们的目标是共同调度的工作,可以共存,而不会争夺类似的资源节点上。  该项目所做的工作将有助于提高高性能系统的效率,这将影响气候/天气、可再生能源和国家安全等应用领域。这项工作将在劳伦斯利弗莫尔和桑迪亚国家实验室的系统上实施和验证,然后过渡到这些国家实验室使用的软件中。该项目还将通过将本研究中的技术整合到PI机构的并行和分布式计算课程中来对教育产生影响。此外,该项目将在两个为西班牙裔学生服务的机构进行,而PI有为代表性不足的学生提供建议的历史;该项目将通过招募西班牙裔本科生参与该项目并将他们送到国家实验室实习来扩大对计算的参与。在高端计算设施中,长期存在的抽象是一个提交的作业被分配给一组专用节点。然而,这使得系统的效率低得多,因为每个节点的资源更多,往往会被低效地使用。此外,对高端系统的需求正在增加,将节点专用于作业会增加作业周转时间并降低整体系统吞吐量。  解决这个问题的一种方法是让超级计算机中心打破目前的惯例,即为每个作业分配超级计算机的一个私有的、孤立的部分。  该项目的智力价值是三方面的。首先,将开发新的配置文件分析,这将揭示由于共享节点对工作的影响。其次,将开发新的统计投影技术,预测正在利用共享节点的作业的缩放行为。第三,将设计新的作业级调度技术,使用干扰分析和预测来选择一组共享节点,这将导致良好的作业周转时间和最大化系统吞吐量。该项目的广泛影响是多方面的。  该项目将有助于提高高性能系统的效率,这将有利于广泛的应用,包括气候/天气预测,核能和国家安全。  通过与Lawrence Livermore和Sandia国家实验室的长期合作,PI将在LLNL和SNL系统上实施和验证技术,并将这些技术过渡到国家实验室未来的资源管理器中。此外,这两个PI将通过招募西班牙裔本科生参与该项目并将他们送到国家实验室实习来扩大对计算的参与。该奖项反映了NSF的法定使命,并被认为值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估来支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Evaluating the Potential of Coscheduling on High-Performance Computing Systems
评估高性能计算系统协同调度的潜力
- DOI:
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Hall, Jason;Lathi, Arjun;Lowenthal, David K;Patki, Tapasya
- 通讯作者:Patki, Tapasya
{{
                item.title }}
{{ item.translation_title }}
- DOI:{{ item.doi }} 
- 发表时间:{{ item.publish_year }} 
- 期刊:
- 影响因子:{{ item.factor }}
- 作者:{{ item.authors }} 
- 通讯作者:{{ item.author }} 
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ patent.updateTime }}
David Lowenthal其他文献
COMO CONHECEMOS O PASSADO
科莫·科赫西莫斯·奥帕萨多
- DOI:
- 发表时间:1998 
- 期刊:
- 影响因子:0
- 作者:David Lowenthal;Tradução Lúcia Haddad;Revisão técnica Mariana Maluf 
- 通讯作者:Revisão técnica Mariana Maluf 
Cardiac Response to Exercise in Health and Disease
健康和疾病中心脏对运动的反应
- DOI:10.1055/s-2007-1006312 
- 发表时间:1993 
- 期刊:
- 影响因子:0
- 作者:David Lowenthal;Michael Pollock 
- 通讯作者:Michael Pollock 
From harmony of the spheres to national anthem: Reflections on musical heritage
- DOI:10.1007/s10708-006-0008-y 
- 发表时间:2006-02-01 
- 期刊:
- 影响因子:1.900
- 作者:David Lowenthal 
- 通讯作者:David Lowenthal 
A case report of Tubulo-Interstitial Nephritis with Uveitis (TINU syndrome) and follow-up for one year
- DOI:10.1023/a:1025657713078 
- 发表时间:2002-01-01 
- 期刊:
- 影响因子:1.900
- 作者:Chadi Alkhalil;Fawad A. Tanvir;Abdurahman Ahmed;David Lowenthal 
- 通讯作者:David Lowenthal 
Social Origins of Dictatorship and Democracy: Lord and Peasant in the Making of the Modern World
独裁与民主的社会根源:现代世界形成中的地主与农民
- DOI:10.2307/2575331 
- 发表时间:1967 
- 期刊:
- 影响因子:0
- 作者:David Lowenthal;Barrington. Moore 
- 通讯作者:Barrington. Moore 
David Lowenthal的其他文献
{{
              item.title }}
{{ item.translation_title }}
- DOI:{{ item.doi }} 
- 发表时间:{{ item.publish_year }} 
- 期刊:
- 影响因子:{{ item.factor }}
- 作者:{{ item.authors }} 
- 通讯作者:{{ item.author }} 
{{ truncateString('David Lowenthal', 18)}}的其他基金
Collaborative Research: SHF: Medium: Co-Optimizing Computation and Data Transformations for Sparse Tensors
协作研究:SHF:中:稀疏张量的协同优化计算和数据转换
- 批准号:2106621 
- 财政年份:2022
- 资助金额:$ 25.03万 
- 项目类别:Continuing Grant 
CSR: Rethinking System Software for Overprovisioned, High-Performance Computing Systems
CSR:重新思考用于过度配置的高性能计算系统的系统软件
- 批准号:1526015 
- 财政年份:2015
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
CSR: Small:Conductor: A Run-Time System for Exascale Computing
CSR:Small:Conductor:用于百亿亿次计算的运行时系统
- 批准号:1216829 
- 财政年份:2012
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
CSR-PSCE, SM: MPI-PPA: Improving Efficiency of Large-Scale Clusters Through Statistical Performance Prediction
CSR-PSCE、SM:MPI-PPA:通过统计性能预测提高大规模集群的效率
- 批准号:0936251 
- 财政年份:2009
- 资助金额:$ 25.03万 
- 项目类别:Continuing Grant 
CSR-PSCE, SM: MPI-PPA: Improving Efficiency of Large-Scale Clusters Through Statistical Performance Prediction
CSR-PSCE、SM:MPI-PPA:通过统计性能预测提高大规模集群的效率
- 批准号:0834356 
- 财政年份:2008
- 资助金额:$ 25.03万 
- 项目类别:Continuing Grant 
Collaborative Research: Efficient Detection and Alleviation of Scalability Problems
协作研究:有效检测和缓解可扩展性问题
- 批准号:0429285 
- 财政年份:2004
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
SOFTWARE: Heterogeneous Cluster MPI: A System for Out-Of-Core, Heterogeneous Data Distribution
软件:异构集群 MPI:核外异构数据分发系统
- 批准号:0234285 
- 财政年份:2003
- 资助金额:$ 25.03万 
- 项目类别:Continuing Grant 
Instrumentation Grant for Research in Parallel and Distributed Computing
用于并行和分布式计算研究的仪器补助金
- 批准号:9986032 
- 财政年份:2000
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Career: An Integrated Compiler/Run-Time System for Global Data Distribution
职业生涯:用于全球数据分发的集成编译器/运行时系统
- 批准号:9733063 
- 财政年份:1998
- 资助金额:$ 25.03万 
- 项目类别:Continuing Grant 
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:2403312 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:2414474 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:2402947 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:2403313 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:2414185 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:2402946 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:2403088 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:2403090 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
- 批准号:2403399 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:2403089 
- 财政年份:2024
- 资助金额:$ 25.03万 
- 项目类别:Standard Grant 

 刷新
              刷新
            
















 {{item.name}}会员
              {{item.name}}会员
            



