权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Self-tuning Parallel Software and Systems

职业：自调整并行软件和系统

基本信息

批准号：
2047120
负责人：
Abhinav Bhatele
金额：
$ 55.07万
依托单位：
University of Maryland, College Park
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-03-01 至 2026-02-28
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2047120&HistoricalAwards=false
关键词：
CAREER Self tuning Parallel Software

项目摘要

Recent advances in machine learning (ML) approaches are driving scientific discovery across many disciplines. This presents a unique opportunity in the parallel computing community to remove the human and associated guesswork in the performance engineering loop, and instead, use data-driven ML models for performance modeling, forecasting and tuning. Analytics of data about software performance and operational efficiency of the parallel systems can be used to identify performance anomalies and their root causes. This can transform the process of optimizing the performance of parallel software and operational efficiency of parallel systems. By using data-driven statistical modeling based on machine learning, the impact of human errors in the process can be minimized, and parallel software and systems can become truly self-tuning. This work is leveraging and contributing to the growing body of work on ML for Systems, and brings its benefits to extreme-scale parallel software and systems. The project is also engaging high school students, training undergraduate and graduate students in parallel computing and preparing them for a career in HPC to address a significant shortage of computer and computational scientists in HPC, both in the industry and national laboratories. The project is applying statistical and ML algorithms to analyze performance data, and using the trained models and insights to enable the self-tuning of performance of parallel software and systems. This work is developing a holistic methodology for accomplishing the following tasks: (1) analyze large volumes of software and system data collected over time, (2) apply machine learning to model application and system behavior, and (3) use these models to guide application, runtime and system optimization decisions that impact future executions. This holistic approach of data-driven self-tuning can significantly improve the performance and portability of parallel software, and operational efficiency of HPC and data center systems even as codes and systems evolve. Better performance of individual jobs leads to faster science results and increased job throughput. This work is making advances in three key areas. First, development of ML-based mechanisms to model the performance of parallel software and use of such models to automatically optimize their performance by selecting high-performance configurations. Second, the development of automated methods to analyze large-scale longitudinal monitoring data for analysis of parallel systems, and develop mechanisms to use trained ML models to automatically tune the operation of parallel systems. And finally, the first two thrusts can be used to automatically tune the performance of parallel codes as they are ported to new or future architectures by using techniques such as transfer learning. This project is leading to the development of a suite of techniques and frameworks to analyze performance-related data being gathered at different levels (job, system and facility) and to make decisions for optimizing various operational efficiency related metrics.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

机器学习（ML）方法的最新进展正在推动许多学科的科学发现。这为并行计算社区提供了一个独特的机会，可以消除性能工程循环中的人为和相关猜测，而是使用数据驱动的ML模型进行性能建模，预测和调优。关于并行系统的软件性能和操作效率的数据的分析可以用于识别性能异常及其根本原因。这可以改变优化并行软件的性能和并行系统的运行效率的过程。通过使用基于机器学习的数据驱动统计建模，可以最大限度地减少过程中人为错误的影响，并行软件和系统可以真正实现自调优。这项工作正在利用和促进越来越多的ML for Systems的工作，并将其好处带到极端规模的并行软件和系统中。该项目还吸引高中生参与，培训并行计算的本科生和研究生，并为他们在HPC中的职业生涯做好准备，以解决HPC中计算机和计算科学家的严重短缺问题，无论是在行业还是国家实验室。该项目正在应用统计和ML算法来分析性能数据，并使用经过训练的模型和见解来实现并行软件和系统性能的自调优。这项工作正在开发一种整体方法来完成以下任务：（1）分析随着时间的推移收集的大量软件和系统数据，（2）应用机器学习来建模应用程序和系统行为，以及（3）使用这些模型来指导影响未来执行的应用程序，运行时和系统优化决策。这种数据驱动自调优的整体方法可以显著提高并行软件的性能和可移植性，以及HPC和数据中心系统的运营效率，即使代码和系统不断发展。更好地执行单个作业可以更快地获得科学结果并提高作业吞吐量。这项工作正在三个关键领域取得进展。首先，开发基于ML的机制来模拟并行软件的性能，并使用这些模型通过选择高性能配置来自动优化其性能。第二，开发自动化方法来分析并行系统分析的大规模纵向监测数据，并开发使用训练的ML模型自动调整并行系统操作的机制。最后，前两个推力可以用来自动调整并行代码的性能，因为它们通过使用迁移学习等技术移植到新的或未来的架构中。该项目旨在开发一套技术和框架，以分析在不同级别（工作、系统和设施）收集的与性能相关的数据，并做出优化各种运营效率相关指标的决策。该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（1）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Resource Utilization Aware Job Scheduling to Mitigate Performance Variability

资源利用感知作业调度以减轻性能变化

DOI：
10.1109/ipdps53621.2022.00040
发表时间：
2022
期刊：
IEEE
影响因子：
0
作者：
Nichols, Daniel;Marathe, Aniruddha;Shoga, Kathleen;Gamblin, Todd;Bhatele, Abhinav
通讯作者：
Bhatele, Abhinav