权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Efficient and reliable coded distributed computing

高效可靠的编码分布式计算

基本信息

批准号：
570977-2021
负责人：
Ardakani, MasoudM
金额：
$ 3.64万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Alliance Grants
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=751758
关键词：
Efficient reliable coded distributed computing

项目摘要

Many modern ICT applications work with data at scale and demand massive computations that cannot be performed in a single computer. This has led to the wide use of distributed computing, where a massive computational task is distributed among a large number of computing nodes in a communication network. In real-life, some of these computing nodes fail to deliver their task due to software/hardware failures, handling other tasks for the network, leaving the network, etc. These straggling nodes (typically around 5% of the processing nodes) result in unpredictable network performance and can significantly prolong job completion. Currently, redundancy in the form of repeating the tasks is implemented to combat the stragglers.Error-correcting codes offer an opportunity to combat stragglers at a much lower cost, reduced communication load, higher success rate, and with added security/privacy benefits. They also create the opportunity of using a large number of very low-cost hardware by the network to reliably finish a massive job in a short time. In this project(i) we will design various low-complexity error-correction coding algorithms that are feasible for large-scale distributed computing, hence enabling the network to handle data at scale reliably;(ii) we will design task scheduling algorithms that optimally distribute and schedule the tasks in the network in order to minimize the completion time/cost, with guaranteed success.We anticipate this project to significantly improve cloud services by developing coded distributed computation and task allocation/scheduling algorithms that (i) reduce the completion time and communication costs, (ii) very efficiently use the available resources, (iii) have low implementation complexity, and (iv) provide added privacy/security. In addition, our algorithms can be used in real-life applications such as telepresence, telehealth, augmented reality, distributed database management systems, real-time process control and more.

许多现代信息和通信技术应用都需要大规模的数据，需要大量的计算，而这些计算无法在一台计算机上完成。这导致了分布式计算的广泛使用，其中大量计算任务分布在通信网络中的大量计算节点之间。在现实生活中，这些计算节点中的一些由于软件/硬件故障而无法交付其任务，处理网络的其他任务，离开网络等。这些分散的节点（通常约占处理节点的5%）导致不可预测的网络性能，并可能显著延长作业完成时间。目前，以重复任务的形式实现的冗余来对抗掉队者。纠错码提供了以低得多的成本、减少的通信负载、更高的成功率和附加的安全/隐私益处来对抗掉队者的机会。它们还创造了通过网络使用大量非常低成本的硬件在短时间内可靠地完成大量工作的机会。在这个项目中，（i）我们将设计各种低复杂度的纠错编码算法，这些算法适用于大规模分布式计算，从而使网络能够可靠地处理大规模数据;（ii）我们将设计任务调度算法，该算法最优地分配和调度网络中的任务，以便最小化完成时间/成本，我们预计该项目将通过开发编码的分布式计算和任务分配/调度算法来显着改善云服务，这些算法（i）减少完成时间和通信成本，（ii）非常有效地使用可用资源，（iii）具有低的实现复杂性，以及（iv）提供附加的隐私/安全性。此外，我们的算法可用于现实生活中的应用，如远程呈现，远程医疗，增强现实，分布式数据库管理系统，实时过程控制等。