权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Algorithmic Support for Massive Scale Distributed Systems

大规模分布式系统的算法支持

基本信息

批准号：
EP/T01461X/1
负责人：
Natalia Shakhlevich
金额：
$ 128.78万
依托单位：
University of Leeds
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2020
资助国家：
英国
起止时间：
2020 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FT01461X%2F1
关键词：
Algorithmic Support Massive Scale Distributed

项目摘要

Resource scheduling in massive-scale distributed systems is the process of matching demand with supply. Demand is associated with requests for resources to execute workloads, such as jobs, tasks and applications. Typical resources in a distributed computing system include servers within a data centre cluster. A scheduler aims to achieve several goals, for example, to maximise system throughput, to minimise response time, to optimise energy usage, etc. These goals may conflict (e.g. throughput versus latency), and the scheduler needs to make a suitable compromise, depending on the user's needs and objectives.In a data centre system with hundreds of thousands of distributed servers, its massive scale is characterised by a number of factors that contribute to the system complexity:- the number of server nodes in the cluster, interconnections between resources and heterogeneity of resources (different types of CPUs, memories, local storages);- the number of concurrent jobs in the system and their arrival rate; - heterogeneity of jobs (different requirements of CPU, memory and local storage; different patterns of resource usage, long-running jobs vs short-alive jobs; urgent jobs vs jobs with loose deadlines).The key requirement for the system is its scalability - the ability of the system to sustain the required throughput level (such as operations per second) while confining the perceptional response latencies to a level similar to a small or medium size system. In our project, we aim to address the following challenges:(a) scheduling at scale (to make prompt scheduling decisions at a rapid rate);(b) resource utilisation at scale (to improve utilisation of resources while maintaining high quality of service);(c) Quality-of-Service provision at scale (to satisfy requirements of diverse workloads).Existing scheduling algorithms developed for practical systems are often designed largely based on empirical knowledge, experience, and best effort. Due to the lack of theoretical foundation, performance of those algorithms cannot be always guaranteed. On the other hand, scheduling algorithms proposed by the theoretical community are usually based on oversimplified abstract system models. Theoretically sound algorithms, with guaranteed accuracy and time complexity, are often impractical because system models do not reflect practical complexity of real systems, and even minor adjustments of system models towards real systems make algorithms no longer applicable.In our project, theoretical and applied experts will consolidate efforts to conduct jointly an interdisciplinary study, overcoming the shortcomings of isolated research. Overall, our project is 1) methodologically driven, attempting to extend the applicability of the most powerful techniques of mathematical optimisation; 2) application driven, where the challenges of massive-scale distributed systems invoke new developments of scheduling methodology; and 3) practice driven, where the research direction is based on hands-on experience of distributed systems specialists.

大规模分布式系统中的资源调度是需求与供给相匹配的过程。需求与执行工作负载（如作业、任务和应用程序）的资源请求相关联。分布式计算系统中的典型资源包括数据中心集群内的服务器。调度器旨在实现若干目标，例如，最大化系统吞吐量、最小化响应时间、最优化能量使用等。（例如，吞吐量与延迟），并且调度器需要根据用户的需求和目标做出适当的折衷。在具有数十万个分布式服务器的数据中心系统中，其庞大规模的特点是由一系列因素导致系统的复杂性：-集群中服务器节点的数量，资源之间的互连和资源的异质性。（不同类型的CPU、存储器、本地存储器）;-系统中并发作业的数量及其到达率;- 工作的异质性（对CPU、内存和本地存储的不同要求;不同的资源使用模式，长时间运行的作业与短时间运行的作业;紧急作业与期限宽松的作业）。系统的关键要求是其可伸缩性-系统维持所需吞吐量水平的能力在一个实施例中，系统可以在将感知响应延迟限制到类似于小型或中型系统的水平的同时，将感知响应延迟限制到类似于小型或中型系统的水平。在我们的项目中，我们的目标是解决以下挑战：（a）大规模调度（迅速作出及时的调度决定）;（B）大规模利用资源（在维持高质素服务的同时，善用资源）;（c）按规模提供服务质量（以满足不同工作负载的要求）。为实际系统开发的现有调度算法通常主要基于经验知识，经验，尽最大努力由于缺乏理论基础，这些算法的性能并不能总是得到保证。另一方面，理论界提出的调度算法通常是基于过于简化的抽象系统模型。理论上合理的算法，保证准确性和时间复杂性，往往是不切实际的，因为系统模型不反映实际的复杂性的真实的系统，甚至轻微的调整系统模型对真实的系统使算法不再适用。在我们的项目中，理论和应用专家将巩固努力，共同进行跨学科的研究，克服孤立的研究的缺点。总的来说，我们的项目是1）方法驱动，试图扩展最强大的数学优化技术的适用性; 2）应用驱动，大规模分布式系统的挑战调用调度方法的新发展;和3）实践驱动，研究方向是基于分布式系统专家的实践经验。