权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

SHF: Small: Empirical Autotuning of Parallel Computation for Scalable Hybrid Systems

SHF：小型：可扩展混合系统并行计算的经验自动调整

基本信息

批准号：
1527706
负责人：
Jack Dongarra
金额：
$ 45万
依托单位：
University of Tennessee Knoxville
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2015
资助国家：
美国
起止时间：
2015-07-15 至 2019-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1527706&HistoricalAwards=false
关键词：
SHF Small Empirical Autotuning Parallel

项目摘要

Today, scientific and engineering computing is synonymous with parallel computing, and applications such as climate modeling, drug design, aircraft design, etc. utilize very large supercomputer installations, with power consumption measured in MegaWatts, and the cost of electricity measured in millions of dollars. At the same time, every parallel application requires some level of tuning to ensure that the software is mapped appropriately to the hardware. Otherwise, suboptimal performance can lead to lost cycles, kilowatt-hours, and, ultimately, dollars. Tuning the application by making repeated runs is also a wasteful option at very large scale. The DARE project addresses this problem by tuning the application through modeling and simulation of its behavior at very large scale, rather than actually running it. Therefore, resources required for tuning are marginal compared to those consumed in production runs. DARE is based on the observation that the same approach that replaces a wind tunnel with a computer simulation of the airfoil can be applied to the software itself. Two aspects of today's high-end computing landscape make the DARE work unique: 1) the prevalence of hardware accelerators, such as Graphics Processing Units and Xeon Phi co-processors, and 2) adoption of task-based, dynamic, work scheduling systems as an alternative to traditional, lock-step parallel programming models. In particular, DARE combines three components into a refinement loop: a hardware analysis component, a kernel modeling component, and a workload simulation component. The role of the hardware analysis component is to extract the basic hardware information, such as processing power and data link speed. The role of the kernel modeling component is to provide performance models of the serial kernels that constitute the building blocks of the parallel program. Finally, the role of the simulation component is to simulate large-scale parallel workloads.The hardware analysis component gathers the basic knowledge about the system, such as: the number of CPU sockets per shared memory node, the number of CPU cores in each socket, the cache hierarchy, existence of hyper-threading, number of NUMA nodes and proximity of CPUs to NUMA nodes, number of GPU accelerators or Xeon Phi co-processors and capacities of their device memories, and the topology and bandwidth of data links, both within each node (busses), and between nodes (network switches). Part of this knowledge can be gathered by using appropriate query APIs, such as hwloc, netloc, PAPI, and those provided in the CUDA SDK, OpenCL SDK, and Xeon Phi SDK. Synthetic tests can be used for parameters that cannot be established in this manner.Kernels are essentially the serial building blocks of parallel problems. Although kernels are usually characterized by serial control flow, most of the time they already rely on a high degree of data parallelism. Today's CPUs get most of their performance from SIMD parallelism, and GPUs get their performance from massive SIMT parallelism. The role of the kernel modeling component is two-fold: 1) to tune kernels for maximum performance at a given granularity, 2) to provide the kernel performance model as a function of granularity, which is changing to accommodate parallel execution.DARE turns to a stochastic time-stepping simulation in order to predict the performance of a dynamic runtime scheduler for two fundamental reasons: 1) Building good performance models on the basis of benchmarking actual parallel runs requires a significant number of runs with significant problem sizes, which is simply too time consuming. And 2), the impact of many tuning parameters is too complex to be modeled by sparsely sampling the tuning space and fitting simple curves / surfaces to the sample points. The answer to the problem is to replace the run with a time stepping simulation, where a given task-based scheduler is used for assigning tasks to cores, but instead of invoking actual kernel tasks, control is passed to a progress tracking simulation system, which relies on kernel performance models to simulate the execution of the tasks and produce a virtual trace of the simulated execution. The performance advantage is twofold: 1) Simulating a single run is much faster than actually making that run, and 2) Many simulations can be run in parallel allowing for fast sweeps through a large parameter search space.DARE replaces the standard waterfall autotuning process with a process that is incremental and iterative in nature. The power of the DARE approach lies in the mutual refinement loop, where each of the three phases is capable of massively pruning the search space for the other two. As a result, very high quality models can be built for a particular workload, since time is being spent refining the model for the conditions that actually apply, rather than sampling the search space in areas never touched at runtime.

今天，科学和工程计算是并行计算的同义词，诸如气候建模、药物设计、飞机设计等应用使用非常大的超级计算机装置，其功耗以兆瓦计，电力成本以数百万美元计。同时，每个并行应用程序都需要进行某种程度的调优，以确保软件适当地映射到硬件。否则，次优性能可能会导致周期损失、千瓦时损失以及最终的损失。通过重复运行来调优应用程序在非常大规模的情况下也是一种浪费的选择。DARE项目通过对应用程序进行大规模的建模和模拟来解决这个问题，而不是实际运行它。因此，与生产运行中消耗的资源相比，调优所需的资源是微不足道的。DARE是基于观察，同样的方法，取代风洞与计算机模拟的翼型可以应用到软件本身。当今高端计算领域的两个方面使DARE工作独特：1)硬件加速器的普及，如图形处理单元和Xeon Phi协处理器；2)采用基于任务的动态工作调度系统，作为传统锁步并行编程模型的替代方案。特别是，DARE将三个组件组合成一个细化循环：硬件分析组件、内核建模组件和工作负载模拟组件。硬件分析组件的作用是提取硬件的基本信息，如处理能力和数据链路速度。内核建模组件的作用是提供构成并行程序构建块的串行内核的性能模型。最后，模拟组件的作用是模拟大规模并行工作负载。硬件分析组件收集关于系统的基本知识,如:CPU插座的数量每节点共享内存,CPU核的数量在每个插座,缓存层次结构,超线程的存在,NUMA节点和邻近的CPU数量NUMA节点,GPU加速器或Xeonφ协同处理器和能力的设备记忆,和数据的拓扑结构和带宽链接,在每个节点(公交车),节点之间(网络交换机)。可以通过使用适当的查询api（如hwloc、netloc、PAPI以及CUDA SDK、OpenCL SDK和Xeon Phi SDK中提供的api）来收集这些知识的一部分。综合试验可用于不能以这种方式确定的参数。核本质上是并行问题的串行构建块。虽然内核通常以串行控制流为特征，但大多数时候它们已经依赖于高度的数据并行性。今天的cpu从SIMD并行性中获得大部分性能，gpu从大量SIMT并行性中获得性能。内核建模组件的作用有两个方面：1)在给定粒度下调优内核以获得最大性能，2)提供作为粒度函数的内核性能模型，该模型正在更改以适应并行执行。为了预测动态运行时调度器的性能，DARE转向随机时间步进模拟，有两个基本原因：1)在基准测试实际并行运行的基础上构建良好的性能模型需要大量具有重大问题规模的运行，这实在是太耗时了。2)许多调谐参数的影响太复杂，无法通过对调谐空间进行稀疏采样并将简单的曲线/曲面拟合到样本点来建模。解决这个问题的方法是用时间步进模拟取代运行，其中使用给定的基于任务的调度器将任务分配给内核，但不是调用实际的内核任务，而是将控制传递给进度跟踪模拟系统，该系统依赖内核性能模型来模拟任务的执行，并生成模拟执行的虚拟跟踪。性能优势是双重的：1)模拟单个运行比实际运行要快得多，2)许多模拟可以并行运行，允许在大参数搜索空间中快速扫描。DARE用本质上是增量和迭代的过程取代了标准的瀑布式自动调整过程。DARE方法的强大之处在于相互优化循环，其中三个阶段中的每一个阶段都能够为其他两个阶段大量修剪搜索空间。因此，可以为特定的工作负载构建非常高质量的模型，因为将时间花在为实际应用的条件改进模型上，而不是在运行时从未触及的区域中对搜索空间进行采样。

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Jack Dongarra其他文献

The co-evolution of computational physics and high-performance computing

计算物理与高性能计算的协同演化

DOI：
10.1038/s42254-024-00750-z
发表时间：
2024-08-23
期刊：
Nature Reviews Physics
影响因子：
39.500
作者：
Jack Dongarra;David Keyes
通讯作者：
David Keyes

hipMAGMA v1.0

DOI：
发表时间：
2020
期刊：
影响因子：
0
作者：
Cade Brown;Ahmad Abdelfattah;Stanimire Tomov;Jack Dongarra
通讯作者：
Jack Dongarra

The eigenvalue problem for Hermitian matrices with time reversal symmetry

具有时间反演对称性的 Hermitian 矩阵的特征值问题

DOI：
10.1016/0024-3795(84)90068-5
发表时间：
1984
期刊：
Linear Algebra and its Applications
影响因子：
1.1
作者：
Jack Dongarra;J. R. Gabriel;D. D. Koelling;James Hardy Wilkinson
通讯作者：
James Hardy Wilkinson

Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU clusters

使用分层矩阵分析 BiCGStab 在 GPU 集群上的性能

DOI：
发表时间：
2018
期刊：
影响因子：
0
作者：
Ichitaro Yamazaki;Ahmad Abdelfattah;Akihiro Ida;Satoshi Ohshima;Stanimire Tomov;Rio Yokota;Jack Dongarra
通讯作者：
Jack Dongarra

Self-healing network for scalable fault-tolerant runtime environments

DOI：
10.1016/j.future.2009.04.001
发表时间：
2010-03-01
期刊：
Research article
影响因子：
作者：
Thara Angskun;Graham Fagg;George Bosilca;Jelena Pješivac-Grbović;Jack Dongarra
通讯作者：
Jack Dongarra