Collaborative Research: OAC Core: Simulation-driven runtime resource management for distributed workflow applications
协作研究:OAC Core:分布式工作流应用程序的模拟驱动的运行时资源管理
基本信息
- 批准号:2106147
- 负责人:
- 金额:$ 22万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2021
- 资助国家:美国
- 起止时间:2021-10-01 至 2025-09-30
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Many scientific breakthroughs in domains such as health, climate modeling, particle physics, seismology, etc., can only be achieved by performing complex processing of vast amounts of data. This processing is automated by software systems that use the compute, storage, and network hardware provided by the cyberinfrastructure. In addition to automation, a key objective of these systems is the efficient use of the resources as measured by cost and energy usage, while making the processing as fast as possible or as needed. To this end, these systems must make decisions regarding which resources should be used to do what and when. Many such systems are used in production today and make such decisions. Yet making good, let alone best, decisions is still an open research challenge. Theoretical research has proposed solutions that are difficult to put into practice, and practical solutions are known to not make good decisions, or at least not consistently so. However, both theory and practice follow the same basic philosophy: make decisions by reasoning about known information on what needs to be computed and on what hardware resources are available. This philosophy has shown its limits, so this project adopts a radically different approach. The key idea is to repeatedly execute fast, computationally inexpensive simulations of the application execution in order to evaluate large sets of potential resource management decisions and automatically select the most desirable ones. The benefits of this approach will be demonstrated for several software systems used to support scientific applications that are critical for the development and sustainability of society.Software systems are used to run scientific applications on advanced cyberinfrastructure. These systems automate application execution, and make resource management decision along several axes including selecting and provisioning (virtualized) hardware, picking application configuration options, and scheduling application activities in time and space. Their objective is to optimize both application performance and also a set of resource usage efficiency metrics that include monetary and energy costs. Consequently, the resource management decision space is enormous, and making good decisions is a steep challenge that has been the subject of countless efforts, both from theoreticians and practitioners. However, the challenge is far from being solved: theoreticians produce solutions that are rarely used by practitioners, and conversely practitioners implement solutions that may be highly sub-optimal because they not informed by theory. This project resolves this disconnect by obviating the need for developing effective resource management strategies. The key idea is to use online simulations to search the resource management decision space rapidly at runtime. Large numbers of fast simulations of the application's execution are executed throughout that very execution, so as to evaluate many potential resource management options and automatically select desirable ones. This approach thus shifts the overall problem from the design of complex resource management algorithms to the enumeration of many resource management decisions. The transformation of resource management practice in cyberinfrastructure systems not only renders the resource management problem tractable but also unlocks previously out-of-reach resource management decisions. The benefits of this transformation will be demonstrated for a critical class of production systems and applications, specifically Workflow Management Systems and the scientific applications they support.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在健康、气候模拟、粒子物理、地震学等领域的许多科学突破,只有通过对大量数据进行复杂的处理才能实现。此处理由使用网络基础设施提供的计算、存储和网络硬件的软件系统自动完成。除自动化外,这些系统的一个关键目标是有效利用按成本和能源使用衡量的资源,同时使处理过程尽可能快或按需要进行。为此,这些系统必须决定应该使用哪些资源来做什么以及何时做什么。许多这样的系统在今天的生产中使用,并做出这样的决定。然而,做出好的决定,更不用说做出最好的决定,仍然是一个悬而未决的研究挑战。理论研究提出了难以付诸实施的解决方案,而实际的解决方案往往不会做出好的决定,或者至少不是始终如一的。然而,理论和实践都遵循相同的基本原则:通过对关于需要计算的内容和可用的硬件资源的已知信息进行推理来做出决定。这种理念已经显示出它的局限性,所以这个项目采用了一种截然不同的方法。其关键思想是重复执行应用程序执行的快速、计算成本低廉的模拟,以便评估大量潜在的资源管理决策,并自动选择最理想的决策。这种方法的好处将在几个用于支持对社会发展和可持续发展至关重要的科学应用程序的软件系统中得到展示。软件系统用于在先进的网络基础设施上运行科学应用程序。这些系统自动执行应用程序,并沿多个轴做出资源管理决策,包括选择和配置(虚拟化)硬件、挑选应用程序配置选项以及在时间和空间上计划应用程序活动。他们的目标是既优化应用程序性能,又优化包括金钱和能源成本在内的一组资源使用效率指标。因此,资源管理决策空间是巨大的,做出好的决策是一个严峻的挑战,这是理论家和实践者做出的无数努力的主题。然而,挑战还远远没有得到解决:理论家提出的解决方案很少被实践者使用,相反,实践者实施的解决方案可能非常次优,因为他们没有得到理论的指导。该项目不再需要制定有效的资源管理战略,从而解决了这一脱节问题。其核心思想是利用在线模拟在运行时快速搜索资源管理决策空间。在整个执行过程中执行大量的应用程序执行的快速模拟,以便评估许多潜在的资源管理选项并自动选择所需的选项。因此,这种方法将整个问题从复杂的资源管理算法的设计转移到许多资源管理决策的列举上。网络基础设施系统中资源管理做法的转变不仅使资源管理问题变得容易处理,而且还解锁了以前遥不可及的资源管理决定。这一转变的好处将在一类关键的生产系统和应用程序中得到展示,特别是工作流管理系统及其支持的科学应用程序。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
On the Feasibility of Simulation-driven Portfolio Scheduling for Cyberinfrastructure Runtime Systems
网络基础设施运行时系统仿真驱动组合调度的可行性
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Casanova. H.;Wong Y. C.;Pottier, L.;Ferreira da Silva, R.
- 通讯作者:Ferreira da Silva, R.
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ewa Deelman其他文献
Mapping Abstract Complex Workflows onto Grid Environments
- DOI:
10.1023/a:1024000426962 - 发表时间:
2003-01-01 - 期刊:
- 影响因子:2.900
- 作者:
Ewa Deelman;James Blythe;Yolanda Gil;Carl Kesselman;Gaurang Mehta;Karan Vahi;Kent Blackburn;Albert Lazzarini;Adam Arbree;Richard Cavanaugh;Scott Koranda - 通讯作者:
Scott Koranda
Advancing Anomaly Detection in Computational Workflows with Active Learning
通过主动学习推进计算工作流程中的异常检测
- DOI:
10.48550/arxiv.2405.06133 - 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Krishnan Raghavan;George Papadimitriou;Hongwei Jin;A. Mandal;Mariam Kiran;Prasanna Balaprakash;Ewa Deelman - 通讯作者:
Ewa Deelman
A terminology for scientific workflow systems
科学工作流系统的术语
- DOI:
10.1016/j.future.2025.107974 - 发表时间:
2026-01-01 - 期刊:
- 影响因子:6.100
- 作者:
Frédéric Suter;Tainã Coleman;İlkay Altintaş;Rosa M. Badia;Bartosz Balis;Kyle Chard;Iacopo Colonnelli;Ewa Deelman;Paolo Di Tommaso;Thomas Fahringer;Carole Goble;Shantenu Jha;Daniel S. Katz;Johannes Köster;Ulf Leser;Kshitij Mehta;Hilary Oliver;J.-Luc Peterson;Giovanni Pizzi;Loïc Pottier;Rafael Ferreira da Silva - 通讯作者:
Rafael Ferreira da Silva
Broadening Student Engagement To Build the Next Generation of Cyberinfrastructure Professionals
扩大学生参与度,培养下一代网络基础设施专业人员
- DOI:
10.1145/3569951.3597567 - 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Angela Murillo;Don Brower;Sarowar Hossain;K. Kee;A. Mandal;J. Nabrzyski;Erik Scott;Nicole K. Virdone;Rodney Ewing;Ewa Deelman - 通讯作者:
Ewa Deelman
How is Artificial Intelligence Changing Science?
人工智能如何改变科学?
- DOI:
10.1109/e-science58273.2023.10254913 - 发表时间:
2023 - 期刊:
- 影响因子:0
- 作者:
Ewa Deelman - 通讯作者:
Ewa Deelman
Ewa Deelman的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ewa Deelman', 18)}}的其他基金
Collaborative Research: CyberTraining: Implementation: Medium: CyberInfrastructure Training and Education for Synchrotron X-Ray Science (X-CITE)
合作研究:网络培训:实施:媒介:同步加速器 X 射线科学网络基础设施培训和教育 (X-CITE)
- 批准号:
2320375 - 财政年份:2023
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: SHF: Small: Model-driven Design and Optimization of Dataflows for Scientific Applications
协作研究:SHF:小型:科学应用数据流的模型驱动设计和优化
- 批准号:
2331153 - 财政年份:2023
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
CI CoE: CI Compass: An NSF Cyberinfrastructure (CI) Center of Excellence for Navigating the Major Facilities Data Lifecycle
CI CoE:CI Compass:用于导航主要设施数据生命周期的 NSF 网络基础设施 (CI) 卓越中心
- 批准号:
2127548 - 财政年份:2021
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: Elements: Simulation-driven Evaluation of Cyberinfrastructure Systems
协作研究:要素:网络基础设施系统的仿真驱动评估
- 批准号:
2103508 - 财政年份:2021
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: VisDict - Visual Dictionaries for Enhancing the Communication between Domain Scientists and Scientific Workflow Providers
协作研究:EAGER:VisDict - 用于增强领域科学家和科学工作流程提供商之间沟通的视觉词典
- 批准号:
2100636 - 财政年份:2021
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Advancing Reproducibility in Multi-Messenger Astrophysics
合作研究:EAGER:提高多信使天体物理学的可重复性
- 批准号:
2041901 - 财政年份:2020
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: EAGER: Leveraging Advanced Cyberinfrastructure and Developing Organizational Resilience for NSF Large Facilities in the Pandemic Era
合作研究:EAGER:在大流行时代利用先进的网络基础设施并提高 NSF 大型设施的组织弹性
- 批准号:
2042054 - 财政年份:2020
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: PPoSS: Planning: Performance Scalability, Trust, and Reproducibility: A Community Roadmap to Robust Science in High-throughput Applications
协作研究:PPoSS:规划:性能可扩展性、信任和可重复性:高通量应用中稳健科学的社区路线图
- 批准号:
2028930 - 财政年份:2020
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
2019 NSF Workshop on Connecting Large Facilities and Cyberinfrastructure
2019 年 NSF 连接大型设施和网络基础设施研讨会
- 批准号:
1933353 - 财政年份:2019
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Pilot Study for a Cyberinfrastructure Center of Excellence
网络基础设施卓越中心试点研究
- 批准号:
1842042 - 财政年份:2018
- 资助金额:
$ 22万 - 项目类别:
Continuing Grant
相似国自然基金
Research on Quantum Field Theory without a Lagrangian Description
- 批准号:24ZR1403900
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
Cell Research
- 批准号:31224802
- 批准年份:2012
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research
- 批准号:31024804
- 批准年份:2010
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Cell Research (细胞研究)
- 批准号:30824808
- 批准年份:2008
- 资助金额:24.0 万元
- 项目类别:专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
- 批准号:10774081
- 批准年份:2007
- 资助金额:45.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
- 批准号:
2414474 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403312 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
- 批准号:
2414185 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402947 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
- 批准号:
2403313 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
- 批准号:
2402946 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403088 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403090 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
- 批准号:
2403399 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
- 批准号:
2403089 - 财政年份:2024
- 资助金额:
$ 22万 - 项目类别:
Standard Grant