Collaborative Research: OAC Core: Simulation-driven runtime resource management for distributed workflow applications

协作研究:OAC Core:分布式工作流应用程序的模拟驱动的运行时资源管理

基本信息

  • 批准号:
    2106147
  • 负责人:
  • 金额:
    $ 22万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-10-01 至 2025-09-30
  • 项目状态:
    未结题

项目摘要

Many scientific breakthroughs in domains such as health, climate modeling, particle physics, seismology, etc., can only be achieved by performing complex processing of vast amounts of data. This processing is automated by software systems that use the compute, storage, and network hardware provided by the cyberinfrastructure. In addition to automation, a key objective of these systems is the efficient use of the resources as measured by cost and energy usage, while making the processing as fast as possible or as needed. To this end, these systems must make decisions regarding which resources should be used to do what and when. Many such systems are used in production today and make such decisions. Yet making good, let alone best, decisions is still an open research challenge. Theoretical research has proposed solutions that are difficult to put into practice, and practical solutions are known to not make good decisions, or at least not consistently so. However, both theory and practice follow the same basic philosophy: make decisions by reasoning about known information on what needs to be computed and on what hardware resources are available. This philosophy has shown its limits, so this project adopts a radically different approach. The key idea is to repeatedly execute fast, computationally inexpensive simulations of the application execution in order to evaluate large sets of potential resource management decisions and automatically select the most desirable ones. The benefits of this approach will be demonstrated for several software systems used to support scientific applications that are critical for the development and sustainability of society.Software systems are used to run scientific applications on advanced cyberinfrastructure. These systems automate application execution, and make resource management decision along several axes including selecting and provisioning (virtualized) hardware, picking application configuration options, and scheduling application activities in time and space. Their objective is to optimize both application performance and also a set of resource usage efficiency metrics that include monetary and energy costs. Consequently, the resource management decision space is enormous, and making good decisions is a steep challenge that has been the subject of countless efforts, both from theoreticians and practitioners. However, the challenge is far from being solved: theoreticians produce solutions that are rarely used by practitioners, and conversely practitioners implement solutions that may be highly sub-optimal because they not informed by theory. This project resolves this disconnect by obviating the need for developing effective resource management strategies. The key idea is to use online simulations to search the resource management decision space rapidly at runtime. Large numbers of fast simulations of the application's execution are executed throughout that very execution, so as to evaluate many potential resource management options and automatically select desirable ones. This approach thus shifts the overall problem from the design of complex resource management algorithms to the enumeration of many resource management decisions. The transformation of resource management practice in cyberinfrastructure systems not only renders the resource management problem tractable but also unlocks previously out-of-reach resource management decisions. The benefits of this transformation will be demonstrated for a critical class of production systems and applications, specifically Workflow Management Systems and the scientific applications they support.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在健康、气候建模、粒子物理学、地震学等领域的许多科学突破, 只能通过对大量数据进行复杂处理来实现。 这种处理由软件系统自动进行,这些软件系统使用网络基础设施提供的计算、存储和网络硬件。 除了自动化之外,这些系统的一个关键目标是通过成本和能源使用来衡量资源的有效利用,同时使处理尽可能快或按需进行。为此,这些系统必须决定哪些资源应该用于做什么以及何时做。 许多这样的系统在今天的生产中使用,并做出这样的决定。然而,做出好的决策,更不用说最好的决策,仍然是一个开放的研究挑战。理论研究已经提出了难以付诸实践的解决方案,而实际解决方案已知不会做出好的决策,或者至少不会始终如一地做出好的决策。 然而,理论和实践都遵循相同的基本哲学:通过推理关于需要计算的内容和可用硬件资源的已知信息来做出决策。这种哲学已经显示出它的局限性,所以这个项目采用了一种完全不同的方法。 其关键思想是重复执行快速,计算成本低的应用程序执行模拟,以评估大量的潜在资源管理决策,并自动选择最理想的。这种方法的好处将在用于支持对社会发展和可持续性至关重要的科学应用的几个软件系统中得到证明。软件系统用于在先进的网络基础设施上运行科学应用。 这些系统自动化应用程序执行,并沿着几个轴(包括选择和供应(虚拟化)硬件、挑选应用程序配置选项以及在时间和空间上调度应用程序活动)进行沿着资源管理决策。他们的目标是优化应用程序性能和一组资源使用效率指标,包括货币和能源成本。因此,资源管理决策空间是巨大的,做出好的决策是一个严峻的挑战,这是理论家和实践者无数努力的主题。 然而,挑战远未解决:理论家提出的解决方案很少被实践者使用,相反,实践者实施的解决方案可能是高度次优的,因为他们没有得到理论的指导。本项目通过消除制定有效的资源管理战略的需要来解决这种脱节。 其核心思想是使用在线模拟在运行时快速搜索资源管理决策空间。在整个执行过程中执行应用程序执行的大量快速模拟,以便评估许多潜在的资源管理选项并自动选择所需的选项。 因此,这种方法将整个问题从复杂的资源管理算法的设计转移到许多资源管理决策的枚举上。网络基础设施系统中资源管理实践的转变不仅使资源管理问题变得易于处理,而且还解锁了以前遥不可及的资源管理决策。 这一转变的好处将被证明为一个关键类的生产系统和应用程序,特别是工作流管理系统和科学应用程序,他们支持。这一奖项反映了NSF的法定使命,并已被认为是值得的支持,通过评估使用基金会的知识价值和更广泛的影响审查标准。

项目成果

期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
On the Feasibility of Simulation-driven Portfolio Scheduling for Cyberinfrastructure Runtime Systems
网络基础设施运行时系统仿真驱动组合调度的可行性
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Ewa Deelman其他文献

Mapping Abstract Complex Workflows onto Grid Environments
  • DOI:
    10.1023/a:1024000426962
  • 发表时间:
    2003-01-01
  • 期刊:
  • 影响因子:
    2.900
  • 作者:
    Ewa Deelman;James Blythe;Yolanda Gil;Carl Kesselman;Gaurang Mehta;Karan Vahi;Kent Blackburn;Albert Lazzarini;Adam Arbree;Richard Cavanaugh;Scott Koranda
  • 通讯作者:
    Scott Koranda
Advancing Anomaly Detection in Computational Workflows with Active Learning
通过主动学习推进计算工作流程中的异常检测
  • DOI:
    10.48550/arxiv.2405.06133
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Krishnan Raghavan;George Papadimitriou;Hongwei Jin;A. Mandal;Mariam Kiran;Prasanna Balaprakash;Ewa Deelman
  • 通讯作者:
    Ewa Deelman
A terminology for scientific workflow systems
科学工作流系统的术语
  • DOI:
    10.1016/j.future.2025.107974
  • 发表时间:
    2026-01-01
  • 期刊:
  • 影响因子:
    6.100
  • 作者:
    Frédéric Suter;Tainã Coleman;İlkay Altintaş;Rosa M. Badia;Bartosz Balis;Kyle Chard;Iacopo Colonnelli;Ewa Deelman;Paolo Di Tommaso;Thomas Fahringer;Carole Goble;Shantenu Jha;Daniel S. Katz;Johannes Köster;Ulf Leser;Kshitij Mehta;Hilary Oliver;J.-Luc Peterson;Giovanni Pizzi;Loïc Pottier;Rafael Ferreira da Silva
  • 通讯作者:
    Rafael Ferreira da Silva
Broadening Student Engagement To Build the Next Generation of Cyberinfrastructure Professionals
扩大学生参与度,培养下一代网络基础设施专业人员
  • DOI:
    10.1145/3569951.3597567
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Angela Murillo;Don Brower;Sarowar Hossain;K. Kee;A. Mandal;J. Nabrzyski;Erik Scott;Nicole K. Virdone;Rodney Ewing;Ewa Deelman
  • 通讯作者:
    Ewa Deelman
How is Artificial Intelligence Changing Science?
人工智能如何改变科学?

Ewa Deelman的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Ewa Deelman', 18)}}的其他基金

Collaborative Research: CyberTraining: Implementation: Medium: CyberInfrastructure Training and Education for Synchrotron X-Ray Science (X-CITE)
合作研究:网络培训:实施:媒介:同步加速器 X 射线科学网络基础设施培训和教育 (X-CITE)
  • 批准号:
    2320375
  • 财政年份:
    2023
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Model-driven Design and Optimization of Dataflows for Scientific Applications
协作研究:SHF:小型:科学应用数据流的模型驱动设计和优化
  • 批准号:
    2331153
  • 财政年份:
    2023
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
CI CoE: CI Compass: An NSF Cyberinfrastructure (CI) Center of Excellence for Navigating the Major Facilities Data Lifecycle
CI CoE:CI Compass:用于导航主要设施数据生命周期的 NSF 网络基础设施 (CI) 卓越中心
  • 批准号:
    2127548
  • 财政年份:
    2021
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: Simulation-driven Evaluation of Cyberinfrastructure Systems
协作研究:要素:网络基础设施系统的仿真驱动评估
  • 批准号:
    2103508
  • 财政年份:
    2021
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: VisDict - Visual Dictionaries for Enhancing the Communication between Domain Scientists and Scientific Workflow Providers
协作研究:EAGER:VisDict - 用于增强领域科学家和科学工作流程提供商之间沟通的视觉词典
  • 批准号:
    2100636
  • 财政年份:
    2021
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Advancing Reproducibility in Multi-Messenger Astrophysics
合作研究:EAGER:提高多信使天体物理学的可重复性
  • 批准号:
    2041901
  • 财政年份:
    2020
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Leveraging Advanced Cyberinfrastructure and Developing Organizational Resilience for NSF Large Facilities in the Pandemic Era
合作研究:EAGER:在大流行时代利用先进的网络基础设施并提高 NSF 大型设施的组织弹性
  • 批准号:
    2042054
  • 财政年份:
    2020
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: PPoSS: Planning: Performance Scalability, Trust, and Reproducibility: A Community Roadmap to Robust Science in High-throughput Applications
协作研究:PPoSS:规划:性能可扩展性、信任和可重复性:高通量应用中稳健科学的社区路线图
  • 批准号:
    2028930
  • 财政年份:
    2020
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
2019 NSF Workshop on Connecting Large Facilities and Cyberinfrastructure
2019 年 NSF 连接大型设施和网络基础设施研讨会
  • 批准号:
    1933353
  • 财政年份:
    2019
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Pilot Study for a Cyberinfrastructure Center of Excellence
网络基础设施卓越中心试点研究
  • 批准号:
    1842042
  • 财政年份:
    2018
  • 资助金额:
    $ 22万
  • 项目类别:
    Continuing Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402946
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403088
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403090
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
  • 批准号:
    2403399
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403089
  • 财政年份:
    2024
  • 资助金额:
    $ 22万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了