基于在网计算的分布式机器学习加速方法研究-猫眼课题宝

权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

基于在网计算的分布式机器学习加速方法研究

结题报告

批准号：

62002344

项目类别：

青年科学基金项目

资助金额：

24.0 万元

负责人：

潘恒

依托单位：

中国科学院计算技术研究所

学科分类：

计算机网络

结题年份：

2023

批准年份：

2020

项目状态：

已结题

项目参与者：

潘恒

关键词：

通信开销分布式机器学习在网计算可编程交换机网络聚合

国基评审专家1V1指导中标率高出同行96.8%

中文摘要

分布式机器学习与人工智能已广泛应用于计算机视觉、自然语言处理、网络运维等不同领域。随着训练集群规模不断增大，训练任务不断复杂，网络通信已经成为影响分布式训练效率的关键因素之一。通过对互联网企业训练集群的初步测量分析，网络中训练节点间交换数据量大、节点内大量的并发连接管理是造成分布式训练通信开销大的重要因素。针对这些问题，本项目拟研究基于在网计算的分布式机器学习加速方法，通过探索与融合网络内生的计算能力，以期消除分布式训练的网络通信瓶颈。本项目拟从以下三个方面开展：(1)首先，研究分布式训练负载与性能瓶颈，刻画流量特征与负载模型，为后续通信优化提供理论与数据的指导；(2)其次，提出一种基于近似计算的梯度聚合方法，解决训练过程中网络交换数据量大的问题；(3)最后，设计一种基于管控分离的连接压缩机制，降低节点内的连接管理开销。通过这上述研究内容，实现加速分布式机器学习训练效率的目标。

英文摘要

Distributed machine learning and artificial intelligence have been extensively used in a wide range of areas, such as computer vision, natural language processing and network maintenance, etc. With the increase of the training cluster scale and task complexity, network communication has been one of the key factors affecting the training efficiency. Based on our prior work on measurement and analysis of the internet corporation training clusters, exchanged data volume between cluster nodes and massive connection maintenance in single node are the most important factors that lead to heavy communication overhead. To this end, this project will study accelerating distributed machine learning with in-network computation, explore how to integrate the endogenous computing capability of network and eliminate the communication bottleneck in distributed training. This project consists of three aspects: (1) First, it studies the distributed training load and performance bottleneck, characterizes the traffic features and load model so that it can provide theoretical guidance for further communication optimization; (2) Second, it proposes an approximate calculation method for graduation aggregation to reduce the generated exchanged data volume when training models; (3) Finally, it designs a connection aggregation mechanism so that it can reduce the maintenance overhead in one single node. Based on above studies, our target is to accelerate distributed machine learning.

期刊论文列表

专著列表

科研奖励列表

会议论文列表

专利列表

Enabling In-Network Floating-Point Arithmetic for Efficient Computation Offloading

DOI：10.1109/tpds.2022.3208425

发表时间：2022-12

期刊：

IEEE Transactions on Parallel and Distributed Systems

影响因子：5.3

作者：

Penglai Cui;H. Pan;Zhenyu Li;Penghao Zhang;Tianhao Miao;Jianer Zhou;Hongtao Guan;Gaogang Xie

通讯作者：Penglai Cui;H. Pan;Zhenyu Li;Penghao Zhang;Tianhao Miao;Jianer Zhou;Hongtao Guan;Gaogang Xie

国内基金

海外基金

会员权益说明：