权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Graph Neural Network Inference on Multi-FPGA Clusters

多 FPGA 集群上的图神经网络推理

基本信息

批准号：
2894270
负责人：
金额：
--
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2023
资助国家：
英国
起止时间：
2023 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2894270
关键词：
Graph Neural Network Inference Multi

项目摘要

Neural networks have been widely deployed to achieve state-of-the-art performance in tasks within various domains, such as in image classification, machine translation, and text generation. Such models are typically executed on Graphical Processing Units (GPU), which are widely commercially available, and offer large performance improvements over general-purpose computers due to their deeply parallelized architecture.With increasing complexity in cutting edge models, GPUs have shown a performance limitation due to expensive data management mechanisms. In particular, low-latency applications such as in high-energy physics or autonomous vehicles show the need for custom hardware to achieve sub-microsecond computation. Field-Programmable Gate Arrays (FPGA) are a class of integrated circuit which are well capable of meeting these requirements due to their reconfigurable fabric, and have been shown to achieve up to 10x latency and throughput improvements over GPU counterparts, with orders of magnitude lower power consumption. Additionally, FPGAs provide the flexibility to perform fine-grained optimizations in the network implementation, due to their reconfigurability.In recent times, Graph Neural Networks (GNNs) have attracted great attention due to their classification performance on non-Euclidean data, such as in social networks, drug discovery and recommendation systems. FPGA acceleration proves particularly beneficial for GNNs given their irregular memory access patterns, resulting from the sparse structure of graphs. These unique compute requirements have been addressed by several FPGA accelerators in the literature. Despite the benefits of inference on reconfigurable logic, high-end FPGAs are still limited by resource availability on-chip. This challenge can be addressed by FPGA clusters connecting multiple devices through high-speed interconnects. This offers the ability to scale inference performance approximately linearly with the number of devices connected in the network. This approach has been explored in the literature to accelerate Convolutional Neural Networks (CNN), through an exploration of dedicated layer partitioning approaches.Although this method has proved effective for CNN acceleration, GNNs offer an unexplored problem setting. GNNs have shown an inherently shallower structure than CNNs since the number of layers corresponds to the number of neighbours through which features propagate. As such, my research aims to demonstrate that GNN inference on FPGA clusters benefits most from partitioning in the graph rather than layer dimension.Several graph partitioning approaches have been proposed in the literature; a naïve approach involves splitting the adjacency matrix into regular node intervals. Alternatively, dynamic sliding-window based approaches consider the graph data, leading to denser partitions and higher spatial locality. In real-time applications, the latency of this pre-processing step needs to be traded-off against the added throughput in node feature transformations per layer. With any given partitioning scheme, a distributed node transformation engine requires careful consideration of data coherency, a classic problem in computer architecture. The distribution of feature updates across several devices with dedicated memory components shows the need for "residual" connections between devices such that messages can be computed. Various hardware optimisations could then be explored to limit the overhead of intra-device communication.In conclusion, as the demand for efficient hardware acceleration grows beyond traditional GPUs, FPGAs present a compelling solution. However, scalability challenges in high-end FPGAs prompt the exploration of FPGA clusters. For GNNs, the proposal to shift from layer to graph partitioning in FPGA clusters shows promise, but refining partitioning strategies and addressing data coherency are critical for unlocking the full potential

神经网络已被广泛应用于各种领域的任务中，如图像分类、机器翻译和文本生成。这些模型通常在图形处理单元（GPU）上执行，GPU在商业上广泛可用，并且由于其深度并行架构而提供了比通用计算机更大的性能改进。随着尖端模型的日益复杂，由于昂贵的数据管理机制，gpu已经显示出性能限制。特别是，在高能物理或自动驾驶汽车等低延迟应用中，需要定制硬件来实现亚微秒级计算。现场可编程门阵列（FPGA）是一类集成电路，由于其可重构结构，能够很好地满足这些要求，并且已被证明可以实现高达10倍的延迟和吞吐量改进，超过GPU对应物，具有数量级更低的功耗。此外，由于fpga的可重构性，它提供了在网络实现中执行细粒度优化的灵活性。近年来，图神经网络（Graph Neural Networks, gnn）因其在社交网络、药物发现和推荐系统等非欧几里得数据上的分类性能而备受关注。FPGA加速被证明特别有利于gnn，因为它们的不规则内存访问模式是由图的稀疏结构造成的。这些独特的计算需求已经由几个FPGA加速器在文献中解决。尽管基于可重构逻辑的推理具有优势，但高端fpga仍然受到片上资源可用性的限制。这一挑战可以通过FPGA集群通过高速互连连接多个设备来解决。这提供了与网络中连接的设备数量近似线性扩展推理性能的能力。这种方法已经在文献中进行了探索，通过探索专用层划分方法来加速卷积神经网络（CNN）。虽然这种方法已被证明对CNN加速是有效的，但gnn提供了一个未探索的问题设置。由于层的数量对应于特征传播的邻居的数量，gnn显示出比cnn固有的更浅的结构。因此，我的研究旨在证明FPGA集群上的GNN推理从图中的划分而不是层维度中获益最多。文献中提出了几种图划分方法；naïve方法涉及将邻接矩阵拆分为规则的节点间隔。另外，基于动态滑动窗口的方法考虑图形数据，导致更密集的分区和更高的空间局部性。在实时应用程序中，这个预处理步骤的延迟需要与每层节点特征转换中增加的吞吐量进行权衡。对于任何给定的分区方案，分布式节点转换引擎都需要仔细考虑数据一致性，这是计算机体系结构中的一个经典问题。功能更新在多个具有专用内存组件的设备之间的分布表明，需要在设备之间建立“剩余”连接，以便可以计算消息。然后可以探索各种硬件优化，以限制设备内通信的开销。总之，随着对高效硬件加速的需求超越传统gpu， fpga提出了一个令人信服的解决方案。然而，高端FPGA的可扩展性挑战促使FPGA集群的探索。对于gnn，在FPGA集群中从层分区转向图分区的建议显示出希望，但改进分区策略和解决数据一致性对于释放全部潜力至关重要