CAREER: Self-tuning Parallel Software and Systems

职业:自调整并行软件和系统

基本信息

  • 批准号:
    2047120
  • 负责人:
  • 金额:
    $ 55.07万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-03-01 至 2026-02-28
  • 项目状态:
    未结题

项目摘要

Recent advances in machine learning (ML) approaches are driving scientific discovery across many disciplines. This presents a unique opportunity in the parallel computing community to remove the human and associated guesswork in the performance engineering loop, and instead, use data-driven ML models for performance modeling, forecasting and tuning. Analytics of data about software performance and operational efficiency of the parallel systems can be used to identify performance anomalies and their root causes. This can transform the process of optimizing the performance of parallel software and operational efficiency of parallel systems. By using data-driven statistical modeling based on machine learning, the impact of human errors in the process can be minimized, and parallel software and systems can become truly self-tuning. This work is leveraging and contributing to the growing body of work on ML for Systems, and brings its benefits to extreme-scale parallel software and systems. The project is also engaging high school students, training undergraduate and graduate students in parallel computing and preparing them for a career in HPC to address a significant shortage of computer and computational scientists in HPC, both in the industry and national laboratories. The project is applying statistical and ML algorithms to analyze performance data, and using the trained models and insights to enable the self-tuning of performance of parallel software and systems. This work is developing a holistic methodology for accomplishing the following tasks: (1) analyze large volumes of software and system data collected over time, (2) apply machine learning to model application and system behavior, and (3) use these models to guide application, runtime and system optimization decisions that impact future executions. This holistic approach of data-driven self-tuning can significantly improve the performance and portability of parallel software, and operational efficiency of HPC and data center systems even as codes and systems evolve. Better performance of individual jobs leads to faster science results and increased job throughput. This work is making advances in three key areas. First, development of ML-based mechanisms to model the performance of parallel software and use of such models to automatically optimize their performance by selecting high-performance configurations. Second, the development of automated methods to analyze large-scale longitudinal monitoring data for analysis of parallel systems, and develop mechanisms to use trained ML models to automatically tune the operation of parallel systems. And finally, the first two thrusts can be used to automatically tune the performance of parallel codes as they are ported to new or future architectures by using techniques such as transfer learning. This project is leading to the development of a suite of techniques and frameworks to analyze performance-related data being gathered at different levels (job, system and facility) and to make decisions for optimizing various operational efficiency related metrics.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
机器学习的最新进展(ML)方法正在推动许多学科的科学发现。这在并行计算社区中提供了独特的机会,以删除性能工程循环中的人类和相关的猜测,而是使用数据驱动的ML模型进行性能建模,预测和调整。有关软件性能和并行系统操作效率的数据分析可用于识别性能异常及其根本原因。这可以改变优化并行软件的性能和并行系统的操作效率的过程。通过使用基于机器学习的数据驱动的统计建模,可以最大程度地减少人类错误的影响,并且并行软件和系统可以真正地进行自我调整。这项工作正在利用并促进了系统上不断增长的系统工作,并将其优势带入了极端规模的并行软件和系统中。该项目还吸引了高中生,培训本科生和研究生并行计算,并为他们从事HPC的职业做准备,以解决HPC在行业和国家实验室中的计算机和计算科学家的大量短缺。该项目正在应用统计和ML算法来分析性能数据,并使用训练有素的模型和见解来自我调整并行软件和系统的性能。这项工作正在开发一种整体方法来完成以下任务:(1)分析随着时间的推移收集的大量软件和系统数据,(2)将机器学习应用于建模应用程序和系统行为,(3)使用这些模型来指导应用程序,运行时和系统优化决策,影响未来执行。数据驱动的自我调整的这种整体方法可以显着提高并行软件的性能和可移植性,以及HPC和数据中心系统的操作效率,即使代码和系统的发展也是如此。更好的单个工作表现会带来更快的科学结果,并增加了工作吞吐量。这项工作在三个关键领域取得了进步。首先,开发基于ML的机制来建模并行软件的性能以及使用此类模型通过选择高性能配置来自动优化其性能。其次,开发自动化方法来分析大规模的纵向监视数据,以分析并行系统分析,并开发使用训练有素的ML模型来自动调整并行系统的操作的机制。最后,可以使用前两个推力来自动调整并行代码的性能,因为它们通过使用诸如传输学习之类的技术将其移植到新的或将来的架构中。该项目导致了一套技术和框架的开发,以分析与绩效相关的数据,以不同级别(工作,系统和设施)收集的绩效相关数据,并做出决定优化各种与操作效率相关的指标的决策。该奖项反映了NSF的法定任务,并认为通过基金会的知识优点和广泛的cripitia cribitia cripitia criperia crigitia cripitia cripicia crigitia cripicia cripicia receptiation rection the奖项。

项目成果

期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Resource Utilization Aware Job Scheduling to Mitigate Performance Variability
资源利用感知作业调度以减轻性能变化
  • DOI:
    10.1109/ipdps53621.2022.00040
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Nichols, Daniel;Marathe, Aniruddha;Shoga, Kathleen;Gamblin, Todd;Bhatele, Abhinav
  • 通讯作者:
    Bhatele, Abhinav
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Abhinav Bhatele其他文献

関数データに対する半教師付き判別問題について
关于函数数据的半监督判别问题
  • DOI:
  • 发表时间:
    2016
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Takatsugu Ono;Yuta Kakibuka;Nikhil Jain;Abhinav Bhatele;Shinobu Miwa;Koji Inoue;Yasuhiro Ogasahara;寺田吉壱
  • 通讯作者:
    寺田吉壱
Extending A Network Simulator for Power/Performance Prediction of Large Scale Interconnection Networks
扩展网络模拟器以预测大规模互连网络的功率/性能
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Takatsugu Ono;Yuta Kakibuka;Nikhil Jain;Abhinav Bhatele;Shinobu Miwa;Koji Inoue
  • 通讯作者:
    Koji Inoue

Abhinav Bhatele的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Abhinav Bhatele', 18)}}的其他基金

Travel: Student Support for IEEE Cluster 2023 Conference
旅行:学生对 IEEE Cluster 2023 会议的支持
  • 批准号:
    2323232
  • 财政年份:
    2023
  • 资助金额:
    $ 55.07万
  • 项目类别:
    Standard Grant

相似国自然基金

Fibered纽结的自同胚、Floer同调与4维亏格
  • 批准号:
    12301086
  • 批准年份:
    2023
  • 资助金额:
    30.00 万元
  • 项目类别:
    青年科学基金项目
Self-DNA介导的CD4+组织驻留记忆T细胞(Trm)分化异常在狼疮肾炎发病中的作用及机制研究
  • 批准号:
    82371813
  • 批准年份:
    2023
  • 资助金额:
    50 万元
  • 项目类别:
    面上项目
“为自己的健康负责”——基于当责视角的健康管理APP对用户行为的作用机制研究
  • 批准号:
    72302199
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于受体识别和转运整合的self-DNA诱导采后桃果实抗病反应的机理研究
  • 批准号:
    32302161
  • 批准年份:
    2023
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于广义测量的多体量子态self-test的实验研究
  • 批准号:
    12104186
  • 批准年份:
    2021
  • 资助金额:
    24.00 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Regulation of T cell ligand discrimination by tuning the phosphorylation kinetics of Zap70 substrates
通过调节 Zap70 底物的磷酸化动力学来调节 T 细胞配体辨别
  • 批准号:
    10405414
  • 财政年份:
    2021
  • 资助金额:
    $ 55.07万
  • 项目类别:
Regulation of T cell ligand discrimination by tuning the phosphorylation kinetics of Zap70 substrates
通过调节 Zap70 底物的磷酸化动力学来调节 T 细胞配体辨别
  • 批准号:
    9720665
  • 财政年份:
    2021
  • 资助金额:
    $ 55.07万
  • 项目类别:
Alternative splicing regulation by extracellular matrix mechanics: a self-tuning tool to control cell microenvironmental adaptation and tumor progression
细胞外基质力学的选择性剪接调节:控制细胞微环境适应和肿瘤进展的自调节工具
  • 批准号:
    9224733
  • 财政年份:
    2017
  • 资助金额:
    $ 55.07万
  • 项目类别:
CAREER: A Self-Tuning Cache Architecture for Multi-Core Systems
职业:多核系统的自调整缓存架构
  • 批准号:
    0953447
  • 财政年份:
    2010
  • 资助金额:
    $ 55.07万
  • 项目类别:
    Continuing Grant
CAREER: A Framework for Dynamic Self-Tuning of General Purpose Programs
职业:通用程序动态自调整框架
  • 批准号:
    0347260
  • 财政年份:
    2004
  • 资助金额:
    $ 55.07万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了