Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
基本信息
- 批准号:2312688
- 负责人:
- 金额:$ 66.69万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2023
- 资助国家:美国
- 起止时间:2023-10-01 至 2026-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Large computing clusters, including data centers and supercomputers, are used for a variety of applications including scientific computations and machine learning. Modern compute clusters typically use specialized accelerator hardware to speed up computations. Operators of accelerator-rich clusters aim to have high resource utilization across all users of the cluster. However, these systems are often under-utilized due to performance variability across accelerators; that is, application performance varies across accelerators even when the same application is run on the same type of accelerator. This proposal will develop Fortuna, a set of tools that can be used by cluster operators and researchers to characterize and harness variability across accelerators. First, Fortuna will use new methodologies to characterize how much performance variability exists across a wide range of accelerator hardware. Second, Fortuna will identify which applications are more likely to suffer from performance variability. Finally, Fortuna will include new scheduling mechanisms that can use variability measurements and knowledge about applications to improve utilization.Broader impacts of the proposed research include open-source implementations of algorithms and tools, which will be applicable to many large-scale clusters and lay the groundwork for wider industry adoption. The project will also create course modules on system design principles with heterogeneous hardware and software, based on the tools developed as a part of the proposal. This will teach the next generation of students how to design hardware and software to improve utilization of future systems.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
大型计算集群,包括数据中心和超级计算机,用于各种应用,包括科学计算和机器学习。现代计算集群通常使用专门的加速器硬件来加速计算。富加速器集群的运营商的目标是在集群的所有用户中实现高资源利用率。然而,由于加速器之间的性能差异,这些系统往往没有得到充分利用;也就是说,即使在相同类型的加速器上运行相同的应用程序,应用程序性能在不同的加速器上也是不同的。该提案将开发Fortuna,这是一套工具,集群运营商和研究人员可以使用它来描述和利用加速器之间的可变性。首先,Fortuna将使用新的方法来描述各种加速器硬件之间存在的性能差异。其次,Fortuna将确定哪些应用程序更容易受到性能变化的影响。最后,Fortuna将包括新的调度机制,可以使用可变性测量和有关应用程序的知识来提高利用率。拟议研究的更广泛影响包括算法和工具的开源实现,这将适用于许多大规模集群,并为更广泛的行业采用奠定基础。本计划亦会以建议书所开发的工具为基础,以异质硬体及软体为基础,建立有关系统设计原则的课程模块。这将教会下一代学生如何设计硬件和软件,以提高未来系统的利用率。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Shivaram Venkataraman其他文献
CHAI: Clustered Head Attention for Efficient LLM Inference
CHAI:用于高效 LLM 推理的集群头注意力
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Saurabh Agarwal;Bilge Acun;Basil Homer;Mostafa Elhoushi;Yejin Lee;Shivaram Venkataraman;Dimitris Papailiopoulos;Carole - 通讯作者:
Carole
Shivaram Venkataraman的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Shivaram Venkataraman', 18)}}的其他基金
CAREER: Resource Efficient Systems for Machine Learning on Structured Data
职业:结构化数据机器学习的资源高效系统
- 批准号:
2237306 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: Frameworks: Diamond: Democratizing Large Neural Network Model Training for Science
合作研究:框架:钻石:科学大型神经网络模型训练的民主化
- 批准号:
2311767 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Standard Grant
III: Small: A New Machine Learning Approach for Improved Entity Identification
III:小:改进实体识别的新机器学习方法
- 批准号:
1815538 - 财政年份:2018
- 资助金额:
$ 66.69万 - 项目类别:
Standard Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
- 批准号:
2312206 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Architecting GPUs for Practical Homomorphic Encryption-based Computing
协作研究:CSR:中:为实用的同态加密计算构建 GPU
- 批准号:
2312276 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
- 批准号:
2312689 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Fortuna: Characterizing and Harnessing Performance Variability in Accelerator-rich Clusters
合作研究:CSR:Medium:Fortuna:表征和利用富含加速器的集群中的性能变异性
- 批准号:
2401244 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Small: Expediting Continual Online Learning on Edge Platforms through Software-Hardware Co-designs
协作研究:企业社会责任:小型:通过软硬件协同设计加快边缘平台上的持续在线学习
- 批准号:
2312157 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Standard Grant
Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
- 批准号:
2314681 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Small: Cross-layer learning-based Energy-Efficient and Resilient NoC design for Multicore Systems
协作研究:CSR:小型:基于跨层学习的多核系统节能和弹性 NoC 设计
- 批准号:
2321224 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Standard Grant
Collaborative Research: CSR: Medium: Scaling Secure Serverless Computing on Heterogeneous Datacenters
协作研究:CSR:中:在异构数据中心上扩展安全无服务器计算
- 批准号:
2312207 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Medium: Adaptive Environmental Awareness for Collaborative Augmented Reality
协作研究:企业社会责任:媒介:协作增强现实的自适应环境意识
- 批准号:
2312760 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant
Collaborative Research: CSR: Small: Caphammer: A New Security Exploit in Energy Harvesting Systems and its Countermeasures
合作研究:CSR:小型:Caphammer:能量收集系统的新安全漏洞及其对策
- 批准号:
2314680 - 财政年份:2023
- 资助金额:
$ 66.69万 - 项目类别:
Continuing Grant