We present techniques for supernodal sparse Cholesky factorization on a hybrid multicore platform consisting of a multicore CPU and GPU. The techniques are the subtree algorithm, pipelining and multithreading. The subtree algorithm [15] minimizes PCIe transmissions by storing an entire branch of the elimination tree in the GPU memory (the elimination tree is a tree data structure describing the workflow of the factorization), and also reduces the total kernel launch time by launching BLAS kernels in batches. The pipelining technique overlaps the execution of GPU kernels and PCIe data transfers. The multithreading technique [17] creates multiple threads for both the CPU and the GPU, to utilize concurrency of the elimination tree. Our experimental results on a platform consisting of an Intel multicore processor along with an Nvidia GPU indicate a significant improvement in performance and energy over CHOLMOD (SuiteSparse 4.5.3), a sparse algorithm, after these techniques are applied.
我们提出了一个多核CPU和GPU的混合多核平台上的超节点稀疏Cholesky分解技术。这些技术是子树算法、流水线和多线程。子树算法[15]通过将消除树的整个分支存储在GPU内存中来最小化PCIe传输(消除树是描述因子分解工作流的树数据结构),并且还通过批量启动BLAS内核来减少总内核启动时间。流水线技术重叠GPU内核和PCIe数据传输的执行。多线程技术[17]为CPU和GPU创建多个线程,以利用消除树的并发性。我们在由Intel多核处理器沿着Nvidia GPU组成的平台上的实验结果表明,在应用这些技术后,性能和能量都比稀疏算法CHOLMOD(SuiteSparse 4.5.3)有了显着提高。