Optimal Scheduling of Parallelizable Jobs in Cloud Computing Environments

云计算环境中可并行作业的优化调度

基本信息

  • 批准号:
    1938909
  • 负责人:
  • 金额:
    $ 54.95万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-01-01 至 2022-12-31
  • 项目状态:
    已结题

项目摘要

This award will contribute to the advancement of national prosperity and economic welfare by deriving methods to improve response times for processing jobs in cloud computing environments. Today, data centers of major companies and supercomputing centers are heavily occupied in processing machine learning jobs, where each job occupies multiple servers/cores in parallel. While there is a long history on scheduling for serial jobs, less is known about parallel job scheduling, and existing heuristics perform poorly in the cloud computing environment. This award will develop algorithms for efficiently allocating a finite number of servers across a set of parallelizable jobs. Optimal scheduling is difficult because an individual parallel job receives decreasing marginal benefit from each additional server that it is allocated. The PIs will develop new analytical methods to address this complex scheduling problem. The results of this research are highly relevant to industry and will inform state-of-the-art cloud scheduling systems. Algorithms, protocols, and experimental results will be disseminated via journal publications, online code and open access data repositories. As part of this award, the PIs will provide outreach to middle school girls to increase exposure and skills in mathematics and computing, as well as training of both undergraduates and PhD students in the areas of parallel computing, scheduling, and queueing.This award will support research on optimally allocating a finite number of servers across parallel jobs, so as to minimize mean flow time, mean slowdown, and related metrics. Motivated by real-world benchmarks and measurements, parallel jobs are modeled via a concave speedup function which specifies the speedup benefit to the job as a function of the number of servers which are allocated to it. The research plan addresses a wide variety of situations, including the case where jobs have different speedup functions, where jobs have different priorities, where job sizes are not known a priori, and where jobs arrive over time. Deriving optimal scheduling strategies will require the development of new analytic techniques to vastly reduce the search space of possible allocations and reveal the structure of the optimal solution. Towards that goal, the award will develop a series of dimensionality reduction techniques, including a scale-free property, a size-invariant property, an online completion order property, a technique for trading off job sizes and different parallelization levels, and a parallel counterpart to the Gittins Index scheduling policy. Results will be validated first via stochastic simulation and then via trace-driven simulation using traces from industry and supercomputing centers.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该奖项将通过衍生方法来改善云计算环境中处理作业的响应时间,从而促进国家繁荣和经济福利。 如今,大公司的数据中心和超级计算中心都在大量处理机器学习作业,其中每个作业都并行占用多个服务器/核心。 虽然串行作业调度有很长的历史,但对并行作业调度知之甚少,现有的并行作业调度在云计算环境中表现不佳。 该奖项将开发算法,有效地分配有限数量的服务器在一组并行化的作业。 最优调度是困难的,因为单个并行作业从分配给它的每个附加服务器接收递减的边际效益。 PI将开发新的分析方法来解决这个复杂的调度问题。 这项研究的结果与行业高度相关,并将为最先进的云调度系统提供信息。 算法、协议和实验结果将通过期刊出版物、在线代码和开放获取数据库进行传播。 作为该奖项的一部分,PI将为中学女生提供外展服务,以提高她们在数学和计算方面的接触和技能,并对本科生和博士生进行并行计算、调度和调度方面的培训。该奖项将支持在并行作业中优化分配有限数量的服务器的研究,以最小化平均流时间、平均减速和相关指标。受现实世界的基准测试和测量的启发,并行作业通过凹加速函数建模,该函数指定作业的加速效益作为分配给它的服务器数量的函数。研究计划解决了各种各样的情况,包括作业具有不同加速函数的情况,作业具有不同优先级的情况,作业大小先验未知的情况,以及随着时间的推移,工作岗位会在哪里出现。 导出最佳调度策略将需要开发新的分析技术,以大大减少可能的分配的搜索空间,并揭示最佳解决方案的结构。 为了实现这一目标,该奖项将开发一系列降维技术,包括无标度属性,大小不变属性,在线完成顺序属性,用于权衡作业大小和不同并行化级别的技术,以及Gittins Index调度策略的并行对应物。 结果将首先通过随机模拟进行验证,然后通过使用来自行业和超级计算中心的跟踪的跟踪驱动模拟进行验证。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估而被认为值得支持。

项目成果

期刊论文数量(32)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Optimal Scheduling in the Multiserver-job Model under Heavy Traffic
The Gittins Policy is Nearly Optimal in the M/G/k under Extremely General Conditions
在极其一般的条件下,Gittins 策略在 M/G/k 中几乎是最优的
The CacheLib Caching Engine: Design and Experiences at Scale
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Benjamin Berg;Daniel S. Berger;Sara McAllister;Isaac Grosof;S. Gunasekar;Jimmy Lu;Michael Uhlar;Jim Carrig;Nathan Beckmann;Mor Harchol-Balter;G. Ganger
  • 通讯作者:
    Benjamin Berg;Daniel S. Berger;Sara McAllister;Isaac Grosof;S. Gunasekar;Jimmy Lu;Michael Uhlar;Jim Carrig;Nathan Beckmann;Mor Harchol-Balter;G. Ganger
Correction to: Multi-server queueing systems with multiple priority classes
更正:具有多个优先级的多服务器排队系统
  • DOI:
    10.1007/s11134-021-09710-1
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    1.2
  • 作者:
    Harchol-Balter, Mor;Osogami, Takayuki;Scheller-Wolf, Alan;Wierman, Adam
  • 通讯作者:
    Wierman, Adam
Zero Queueing for Multi-Server Jobs
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Mor Harchol-Balter其他文献

Analysis of scheduling policies under correlated job sizes
  • DOI:
    10.1016/j.peva.2010.08.010
  • 发表时间:
    2010-11-01
  • 期刊:
  • 影响因子:
  • 作者:
    Varun Gupta;Michelle Burroughs;Mor Harchol-Balter
  • 通讯作者:
    Mor Harchol-Balter
Server farms with setup costs
  • DOI:
    10.1016/j.peva.2010.07.004
  • 发表时间:
    2010-11-01
  • 期刊:
  • 影响因子:
  • 作者:
    Anshul Gandhi;Mor Harchol-Balter;Ivo Adan
  • 通讯作者:
    Ivo Adan
Performance Modeling and Design of Computer Systems: Contents
  • DOI:
    10.1017/cbo9781139226424
  • 发表时间:
    2013-02
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Mor Harchol-Balter
  • 通讯作者:
    Mor Harchol-Balter
Performance Modeling and Design of Computer Systems: Scheduling: SRPT and Fairness
  • DOI:
    10.1017/cbo9781139226424.041
  • 发表时间:
    2013
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Mor Harchol-Balter
  • 通讯作者:
    Mor Harchol-Balter
Performance Modeling and Design of Computer Systems: Introduction to Queueing
  • DOI:
    10.1017/cbo9781139226424.002
  • 发表时间:
    2013
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Mor Harchol-Balter
  • 通讯作者:
    Mor Harchol-Balter

Mor Harchol-Balter的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Mor Harchol-Balter', 18)}}的其他基金

Collaborative Research: III: Small: High-Performance Scheduling for Modern Database Systems
协作研究:III:小型:现代数据库系统的高性能调度
  • 批准号:
    2322973
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
New Approaches to Multiserver Scheduling
多服务器调度的新方法
  • 批准号:
    2307008
  • 财政年份:
    2023
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
CSR: Medium: Collaborative Research: Foundations of Cache Network Operations for Content Delivery
CSR:媒介:协作研究:内容交付缓存网络操作的基础
  • 批准号:
    1763701
  • 财政年份:
    2018
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Continuing Grant
Reducing Latency by Replicating Jobs
通过复制作业减少延迟
  • 批准号:
    1538204
  • 财政年份:
    2015
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
Priority Pricing for Profit Maximization Given Strategic, Delay-Sensitive Customers with a Continuum of Types
针对具有连续类型的战略性、延迟敏感型客户,优先定价以实现利润最大化
  • 批准号:
    1334194
  • 财政年份:
    2013
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
CSR: Student Travel Support for SIGMETRICS 2013
CSR:SIGMETRICS 2013 学生旅行支持
  • 批准号:
    1300202
  • 财政年份:
    2013
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
CSR: Small: Simple Dynamic Traffic-Oblivious Power Management for Multi-Tier Web Clusters
CSR:小型:多层 Web 集群的简单动态流量无关电源管理
  • 批准号:
    1116282
  • 财政年份:
    2011
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
COLLABORATIVE RESEARCH: CSR---SMA: New Breakthrough in Analyzing Limited Resource Sharing Systems
合作研究:CSR---SMA:分析有限资源共享系统的新突破
  • 批准号:
    0719106
  • 财政年份:
    2007
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
SMA/PDOS Collaborative Research: Design, Analysis, and Control of Adaptive Sharing Mechanisms
SMA/PDOS 协作研究:自适应共享机制的设计、分析和控制
  • 批准号:
    0615262
  • 财政年份:
    2006
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Continuing Grant
ITR: Improving the Performance of Web Servers under Overload
ITR:提高 Web 服务器在过载情况下的性能
  • 批准号:
    0313148
  • 财政年份:
    2003
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant

相似海外基金

CAREER: Frequency-Constrained Energy Scheduling for Renewable-Dominated Low-Inertia Power Systems
职业:可再生能源为主的低惯量电力系统的频率约束能量调度
  • 批准号:
    2337598
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Continuing Grant
Collaborative Research: III: Small: High-Performance Scheduling for Modern Database Systems
协作研究:III:小型:现代数据库系统的高性能调度
  • 批准号:
    2322973
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Small: High-Performance Scheduling for Modern Database Systems
协作研究:III:小型:现代数据库系统的高性能调度
  • 批准号:
    2322974
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
Revolutionising Surgery Scheduling: an innovative AI-powered health-tech platform enhancing Operating Room efficiency, with an automated schedule unlocking the potential for an additional 10% or 350K surgeries annually in the UK.
彻底改变%20手术%20调度:%20an%20创新%20AI驱动%20健康科技%20平台%20增强%20操作%20房间%20效率,%20与%20an%20自动化%20调度%20解锁%20%20潜力%20用于%20an%20额外%
  • 批准号:
    10095646
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Collaborative R&D
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403088
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403090
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
CNS Core: Small: Core Scheduling Techniques and Programming Abstractions for Scalable Serverless Edge Computing Engine
CNS Core:小型:可扩展无服务器边缘计算引擎的核心调度技术和编程抽象
  • 批准号:
    2322919
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
Differential Evolution Framework for Intelligent Charging Scheduling
智能充电调度的差分进化框架
  • 批准号:
    DP240102317
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Discovery Projects
Human Scheduling of Perceptual Tasks
人类感知任务的调度
  • 批准号:
    DP240100979
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Discovery Projects
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403089
  • 财政年份:
    2024
  • 资助金额:
    $ 54.95万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了