We present techniques for supernodal sparse Cholesky factorization on a hybrid multicore platform consisting of a multicore CPU and GPU. The techniques are the subtree algorithm, pipelining and multithreading. The subtree algorithm [15] minimizes PCIe transmissions by storing an entire branch of the elimination tree in the GPU memory (the elimination tree is a tree data structure describing the workflow of the factorization), and also reduces the total kernel launch time by launching BLAS kernels in batches. The pipelining technique overlaps the execution of GPU kernels and PCIe data transfers. The multithreading technique [17] creates multiple threads for both the CPU and the GPU, to utilize concurrency of the elimination tree. Our experimental results on a platform consisting of an Intel multicore processor along with an Nvidia GPU indicate a significant improvement in performance and energy over CHOLMOD (SuiteSparse 4.5.3), a sparse algorithm, after these techniques are applied.
我们介绍了在由多核CPU和GPU组成的混合多核平台上进行超节点稀疏楚列斯基(Cholesky)分解的技术。这些技术包括子树算法、流水线技术和多线程技术。子树算法[15]通过在GPU内存中存储消去树的整个分支来最小化PCIe传输(消去树是一种描述分解工作流程的树状数据结构),并且通过批量启动BLAS内核来减少内核启动的总时间。流水线技术使GPU内核的执行与PCIe数据传输重叠。多线程技术[17]为CPU和GPU创建多个线程,以利用消去树的并发性。我们在由英特尔多核处理器和英伟达GPU组成的平台上进行的实验结果表明,在应用这些技术之后,与稀疏算法CHOLMOD(SuiteSparse 4.5.3)相比,在性能和能耗方面都有显著的提高。