权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

SDCI HPC Improvement: IPM- A Performance Monitoring Environment for Petascale High-Performance Computing Sytems

SDCI HPC 改进：IPM - 用于千万级高性能计算系统的性能监控环境

基本信息

批准号：
0721397
负责人：
Laura Carrington
金额：
--
依托单位：
University of California-San Diego
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2007
资助国家：
美国
起止时间：
2007-09-01 至 2012-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0721397&HistoricalAwards=false
关键词：
SDCI HPC Improvement IPM Performance

项目摘要

Project SummaryThe goal of this project is to improve the utilization of HPC machines at NSF centers and elsewhere via a lightweight performance profiling tool that can identify performance bottlenecks in full scale applications during production runs. Investments by NSF in Tier1 and Tier2 computers, as well as ever-growing popularity of smaller clusters in university and industrial settings, offer tremendous opportunities for new scientific discoveries using computational science. Yet experience suggests that many users do not make effective use of these machines, often relying on algorithms, programming tools, or libraries that encounter removable performance bottlenecks. High concurrency, complex processor architectures, fragile compiler optimizations, low degree network topologies, deep memory hierarchies, load imbalance, and unpredictable performance due to OS noise are among architectural features that make performance bottlenecks simultaneously easy to encounter and hard to find.We propose research and development to provide users and system administrators with a tool for identifying performance bottlenecks in production. We will extend and deploy our Integrated Performance Monitoring (IPM) tool for identifying communication bottlenecks, memory system bottlenecks, load imbalances, and other performance problems on systems ranging from small clusters to the petascale. We developed IPM as an ultra lightweight performance profiling system. The current version is in use at NSF, DOE, and DOD HPC centers. IPM has unique features that make it effective for ongoing monitoring of application performance by system administrators as well as application scientists. The key features of IPM include: a performance profiling strategy that is highly scalable and perturbs performance by less that 5%; integration with a performance database that allows for easy and immediate comparisons across applications runs and users; and an easy to recompilation. Via further development we will provide: 1) A tool for capturing program's performance data with special emphasis on low overhead and scalability for up to millions of processors; 2) Easy to understand application profiles which capture communication volumes and patterns, processor and memory system counter information, and topology-aware counters from network adapters and switches 3) A database backend for workload characterization and architecture analytics; 4) Support for community driven enhancements through our portable, extensible, Open Source software.Intellectual Merit: We will extend IPM's breadth by making it run on more and larger machines, and include additional important performance information. Thereby we will enable domain scientists to pinpoint performance issues affecting their applications running on machines with deep memory hierarchies, complex network topologies, and hierarchical parallelism. We will help scientists to quickly answer questions such as, what are the factors affecting the performance of my scientific application?" In addition, our infrastructure will answer fundamental questions about the benefits of architectural features, such as one sided communication, high degree networks, memory system structures, and processor accelerators. It will also support application performance analysis across petascale systems, automatic and manual performance tuning, and in situ" analysis of algorithm scalability using a full machine and real input data.Broader Impact: Through our scalable, portable, and extensible approach we will bring transparency to performance analysis with low overhead. We will enable all HPC stakeholders to assess and improve both applications and architectures, educate users on performance features, and ensure that parallel machines are used productively to answer basic questions in science and engineering. In addition, by providing a close working relationship between domain scientists, NSF centers, and HPC vendors, this project will educate students who are trained in the many facets that impact HPC software and hardware design. A byproduct will be increased understanding of how to optimally use the current and upcoming NSF HPC Tier1 and Tier2 systems portfolio.

项目摘要该项目的目标是通过一个轻量级的性能分析工具来提高NSF中心和其他地方的HPC机器的利用率，该工具可以在生产运行期间识别全面应用程序中的性能瓶颈。NSF对Tier 1和Tier 2计算机的投资，以及在大学和工业环境中越来越受欢迎的小型集群，为使用计算科学的新科学发现提供了巨大的机会。然而，经验表明，许多用户并没有有效地利用这些机器，往往依赖于算法，编程工具，或遇到可移动性能瓶颈的库。高并发性，复杂的处理器架构，脆弱的编译器优化，低程度的网络拓扑结构，深内存层次结构，负载不平衡，以及不可预测的性能，由于操作系统的噪音之间的体系结构特征，使性能瓶颈，同时容易遇到，很难找到。我们提出的研究和开发，为用户和系统管理员提供一个工具，用于识别生产中的性能瓶颈。我们将扩展和部署我们的集成性能监控（IPM）工具，用于识别通信瓶颈，内存系统瓶颈，负载不平衡，以及从小型集群到千万亿次系统的其他性能问题。我们开发了IPM作为一个超轻量级的性能分析系统。当前版本在NSF、DOE和DOD HPC中心使用。IPM具有独特的功能，使系统管理员和应用程序科学家能够有效地持续监控应用程序性能。IPM的主要功能包括：性能分析策略，具有高度可扩展性，对性能的干扰小于5%;与性能数据库集成，允许跨应用程序运行和用户进行轻松即时的比较;以及易于重新编译。通过进一步的开发，我们将提供：1）一个捕获程序性能数据的工具，特别强调低开销和可扩展性高达数百万个处理器; 2）易于理解的应用简档，其捕获通信量和模式、处理器和存储器系统计数器信息，以及来自网络适配器和交换机的拓扑感知计数器3）用于工作负载表征和架构分析的数据库后端; 4）通过我们的可移植、可扩展的开源软件支持社区驱动的增强功能。智力优势：我们将通过使IPM在更多更大的机器上运行来扩展IPM的广度，并包括额外的重要性能信息。因此，我们将使领域科学家能够查明影响其应用程序在具有深层内存层次结构、复杂网络拓扑结构和层次并行性的机器上运行的性能问题。我们将帮助科学家快速回答问题，例如，影响我的科学应用性能的因素是什么？“此外，我们的基础设施将回答有关架构功能的好处的基本问题，例如单边通信，高度网络，内存系统结构和处理器加速器。它还将支持跨千万亿次系统的应用程序性能分析、自动和手动性能调优，以及使用完整的机器和真实的输入数据对算法可扩展性进行"原位”分析。更广泛的影响：通过我们的可扩展、可移植和可扩展的方法，我们将以低开销为性能分析带来透明度。我们将使所有HPC利益相关者能够评估和改进应用程序和架构，向用户介绍性能特性，并确保高效地使用并行机来回答科学和工程中的基本问题。此外，通过在领域科学家、NSF中心和HPC供应商之间建立密切的工作关系，该项目将教育那些在影响HPC软件和硬件设计的许多方面接受过培训的学生。副产品将增加对如何最佳使用当前和即将推出的NSF HPC Tier 1和Tier 2系统组合的理解。