In this work, we propose and evaluate a Network-on-Chip (NoC) augmented with light-weight processing elements to provide a lean dataflow-style system. We show that contemporary NoC routers can frequently experience long periods of idle time, with less than 10% link utilization in HPC applications. By repurposing the temporal and spatial slack of the NoC, the proposed platform, SnackNoC, is able to compute linear algebra kernels efficiently within the communication layer with minimal additional resource costs. SnackNoC 'Snack' application kernels are programmed with a producer-consumer data model that uses the NoC slack to store and transmit intermediate data between processing elements. SnackNoC is demonstrated in a multi-program environment that continually executes linear algebra kernels on the NoC simultaneously with chip multiprocessor (CMP) applications on the processor cores. Linear algebra kernels are computed up to 14.2x faster on SnackNoC compared to an Intel Haswell EPx86 processing core. The cost of executing 'snack' kernels in parallel to the CMP applications is a minimal runtime impact of 0.01% to 0.83% due to higher link utilization, and an uncore area overhead of 1.1%.
在这项工作中,我们提出并评估了一种添加了轻量级处理元件的片上网络(NoC),以提供一种精简的数据流式系统。我们表明,当代的NoC路由器经常会经历长时间的空闲期,在高性能计算应用中链路利用率低于10%。通过重新利用NoC的时间和空间闲置资源,所提出的平台SnackNoC能够在通信层内以极小的额外资源成本高效地计算线性代数内核。SnackNoC的“Snack”应用内核采用生产者 - 消费者数据模型进行编程,该模型利用NoC的闲置资源在处理元件之间存储和传输中间数据。SnackNoC在一个多程序环境中得到了验证,该环境在处理器内核上运行芯片多处理器(CMP)应用的同时,在NoC上持续执行线性代数内核。与英特尔Haswell EPx86处理核心相比,线性代数内核在SnackNoC上的计算速度快达14.2倍。由于链路利用率提高,在与CMP应用并行执行“snack”内核时,运行时间的影响极小,仅为0.01%到0.83%,并且非核心区域的开销为1.1%。