CAREER: Self-tuning Parallel Software and Systems
职业:自调整并行软件和系统
基本信息
- 批准号:2047120
- 负责人:
- 金额:$ 55.07万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-03-01 至 2026-02-28
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Recent advances in machine learning (ML) approaches are driving scientific discovery across many disciplines. This presents a unique opportunity in the parallel computing community to remove the human and associated guesswork in the performance engineering loop, and instead, use data-driven ML models for performance modeling, forecasting and tuning. Analytics of data about software performance and operational efficiency of the parallel systems can be used to identify performance anomalies and their root causes. This can transform the process of optimizing the performance of parallel software and operational efficiency of parallel systems. By using data-driven statistical modeling based on machine learning, the impact of human errors in the process can be minimized, and parallel software and systems can become truly self-tuning. This work is leveraging and contributing to the growing body of work on ML for Systems, and brings its benefits to extreme-scale parallel software and systems. The project is also engaging high school students, training undergraduate and graduate students in parallel computing and preparing them for a career in HPC to address a significant shortage of computer and computational scientists in HPC, both in the industry and national laboratories. The project is applying statistical and ML algorithms to analyze performance data, and using the trained models and insights to enable the self-tuning of performance of parallel software and systems. This work is developing a holistic methodology for accomplishing the following tasks: (1) analyze large volumes of software and system data collected over time, (2) apply machine learning to model application and system behavior, and (3) use these models to guide application, runtime and system optimization decisions that impact future executions. This holistic approach of data-driven self-tuning can significantly improve the performance and portability of parallel software, and operational efficiency of HPC and data center systems even as codes and systems evolve. Better performance of individual jobs leads to faster science results and increased job throughput. This work is making advances in three key areas. First, development of ML-based mechanisms to model the performance of parallel software and use of such models to automatically optimize their performance by selecting high-performance configurations. Second, the development of automated methods to analyze large-scale longitudinal monitoring data for analysis of parallel systems, and develop mechanisms to use trained ML models to automatically tune the operation of parallel systems. And finally, the first two thrusts can be used to automatically tune the performance of parallel codes as they are ported to new or future architectures by using techniques such as transfer learning. This project is leading to the development of a suite of techniques and frameworks to analyze performance-related data being gathered at different levels (job, system and facility) and to make decisions for optimizing various operational efficiency related metrics.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
机器学习(ML)方法的最新进展正在推动许多学科的科学发现。这为并行计算社区提供了一个独特的机会,可以消除性能工程循环中的人为和相关猜测,而是使用数据驱动的ML模型进行性能建模,预测和调优。关于并行系统的软件性能和操作效率的数据的分析可以用于识别性能异常及其根本原因。这可以改变优化并行软件的性能和并行系统的运行效率的过程。通过使用基于机器学习的数据驱动统计建模,可以最大限度地减少过程中人为错误的影响,并行软件和系统可以真正实现自调优。这项工作正在利用和促进越来越多的ML for Systems的工作,并将其好处带到极端规模的并行软件和系统中。该项目还吸引高中生参与,培训并行计算的本科生和研究生,并为他们在HPC中的职业生涯做好准备,以解决HPC中计算机和计算科学家的严重短缺问题,无论是在行业还是国家实验室。该项目正在应用统计和ML算法来分析性能数据,并使用经过训练的模型和见解来实现并行软件和系统性能的自调优。这项工作正在开发一种整体方法来完成以下任务:(1)分析随着时间的推移收集的大量软件和系统数据,(2)应用机器学习来建模应用程序和系统行为,以及(3)使用这些模型来指导影响未来执行的应用程序,运行时和系统优化决策。这种数据驱动自调优的整体方法可以显著提高并行软件的性能和可移植性,以及HPC和数据中心系统的运营效率,即使代码和系统不断发展。更好地执行单个作业可以更快地获得科学结果并提高作业吞吐量。这项工作正在三个关键领域取得进展。首先,开发基于ML的机制来模拟并行软件的性能,并使用这些模型通过选择高性能配置来自动优化其性能。第二,开发自动化方法来分析并行系统分析的大规模纵向监测数据,并开发使用训练的ML模型自动调整并行系统操作的机制。最后,前两个推力可以用来自动调整并行代码的性能,因为它们通过使用迁移学习等技术移植到新的或未来的架构中。该项目旨在开发一套技术和框架,以分析在不同级别(工作、系统和设施)收集的与性能相关的数据,并做出优化各种运营效率相关指标的决策。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Resource Utilization Aware Job Scheduling to Mitigate Performance Variability
资源利用感知作业调度以减轻性能变化
- DOI:10.1109/ipdps53621.2022.00040
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Nichols, Daniel;Marathe, Aniruddha;Shoga, Kathleen;Gamblin, Todd;Bhatele, Abhinav
- 通讯作者:Bhatele, Abhinav
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Abhinav Bhatele其他文献
関数データに対する半教師付き判別問題について
关于函数数据的半监督判别问题
- DOI:
- 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Takatsugu Ono;Yuta Kakibuka;Nikhil Jain;Abhinav Bhatele;Shinobu Miwa;Koji Inoue;Yasuhiro Ogasahara;寺田吉壱 - 通讯作者:
寺田吉壱
Extending A Network Simulator for Power/Performance Prediction of Large Scale Interconnection Networks
扩展网络模拟器以预测大规模互连网络的功率/性能
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Takatsugu Ono;Yuta Kakibuka;Nikhil Jain;Abhinav Bhatele;Shinobu Miwa;Koji Inoue - 通讯作者:
Koji Inoue
Abhinav Bhatele的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Abhinav Bhatele', 18)}}的其他基金
Travel: Student Support for IEEE Cluster 2023 Conference
旅行:学生对 IEEE Cluster 2023 会议的支持
- 批准号:
2323232 - 财政年份:2023
- 资助金额:
$ 55.07万 - 项目类别:
Standard Grant
相似国自然基金
Self-DNA介导的CD4+组织驻留记忆T细胞(Trm)分化异常在狼疮肾炎发病中的作用及机制研究
- 批准号:82371813
- 批准年份:2023
- 资助金额:50 万元
- 项目类别:面上项目
基于受体识别和转运整合的self-DNA诱导采后桃果实抗病反应的机理研究
- 批准号:32302161
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于广义测量的多体量子态self-test的实验研究
- 批准号:
- 批准年份:2021
- 资助金额:30 万元
- 项目类别:青年科学基金项目
Self-shrinkers的刚性及相关问题
- 批准号:
- 批准年份:2019
- 资助金额:10.0 万元
- 项目类别:省市级项目
基于Self-peptide和Fe5C2构建的高敏感MR分子探针对肿瘤血管的MR靶向成像研究
- 批准号:81501521
- 批准年份:2015
- 资助金额:18.0 万元
- 项目类别:青年科学基金项目
平均曲率流中非紧Self-shrinkers的结构
- 批准号:11301190
- 批准年份:2013
- 资助金额:22.0 万元
- 项目类别:青年科学基金项目
2维伪欧氏空间下平均曲率流中Self-shrinker问题的研究
- 批准号:11126152
- 批准年份:2011
- 资助金额:3.0 万元
- 项目类别:数学天元基金项目
晶态桥联聚倍半硅氧烷的自导向组装(self-directed assembly)及其发光性能
- 批准号:21171046
- 批准年份:2011
- 资助金额:55.0 万元
- 项目类别:面上项目
成束蛋白Fascin1在肺癌"self-seeding"过程中的作用及机制研究
- 批准号:81001041
- 批准年份:2010
- 资助金额:22.0 万元
- 项目类别:青年科学基金项目
工业用腈水合酶全新蛋白质翻译后调节体系self-subunit swapping的研究
- 批准号:31070711
- 批准年份:2010
- 资助金额:35.0 万元
- 项目类别:面上项目
相似海外基金
Adaptive optimization: parameter-free self-tuning algorithms beyond smoothness and convexity
自适应优化:超越平滑性和凸性的无参数自调整算法
- 批准号:
24K20737 - 财政年份:2024
- 资助金额:
$ 55.07万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Early-stage embryo as an active self-tuning soft material
作为主动自调节软材料的早期胚胎
- 批准号:
EP/W023806/1 - 财政年份:2022
- 资助金额:
$ 55.07万 - 项目类别:
Research Grant
Self-Tuning Controllers via Deep Reinforcement Learning
通过深度强化学习自调整控制器
- 批准号:
546972-2020 - 财政年份:2022
- 资助金额:
$ 55.07万 - 项目类别:
Alexander Graham Bell Canada Graduate Scholarships - Doctoral
Early-stage embryo as an active self-tuning soft material
作为主动自调节软材料的早期胚胎
- 批准号:
EP/W023849/1 - 财政年份:2022
- 资助金额:
$ 55.07万 - 项目类别:
Research Grant
Early-stage embryo as an active self-tuning soft material
作为主动自调节软材料的早期胚胎
- 批准号:
EP/W023946/1 - 财政年份:2022
- 资助金额:
$ 55.07万 - 项目类别:
Research Grant
Parameter-Free Stochastic Gradient Descent: Fast, Self-Tuning Algorithms for Training Deep Neural Networks
无参数随机梯度下降:用于训练深度神经网络的快速自调整算法
- 批准号:
547242-2020 - 财政年份:2022
- 资助金额:
$ 55.07万 - 项目类别:
Postgraduate Scholarships - Doctoral
A Self-Tuning Liquid Metal Coil Conforming to Movement for High-Resolution Brachial Plexus MRI
适合高分辨率臂丛 MRI 运动的自调节液态金属线圈
- 批准号:
10453862 - 财政年份:2022
- 资助金额:
$ 55.07万 - 项目类别:
A Self-Tuning Liquid Metal Coil Conforming to Movement for High-Resolution Brachial Plexus MRI
适合高分辨率臂丛 MRI 运动的自调节液态金属线圈
- 批准号:
10621375 - 财政年份:2022
- 资助金额:
$ 55.07万 - 项目类别:
Parameter-Free Stochastic Gradient Descent: Fast, Self-Tuning Algorithms for Training Deep Neural Networks
无参数随机梯度下降:用于训练深度神经网络的快速自调整算法
- 批准号:
547242-2020 - 财政年份:2021
- 资助金额:
$ 55.07万 - 项目类别:
Postgraduate Scholarships - Doctoral
the self-tuning brain: cellular and circuit mechanisms of behavioral resilience
自调节大脑:行为弹性的细胞和回路机制
- 批准号:
10405344 - 财政年份:2021
- 资助金额:
$ 55.07万 - 项目类别: