Collaborative Research: OAC Core: Simulation-driven runtime resource management for distributed workflow applications

协作研究:OAC Core:分布式工作流应用程序的模拟驱动的运行时资源管理

基本信息

  • 批准号:
    2106059
  • 负责人:
  • 金额:
    $ 28万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-10-01 至 2024-09-30
  • 项目状态:
    已结题

项目摘要

Many scientific breakthroughs in domains such as health, climate modeling, particle physics, seismology, etc., can only be achieved by performing complex processing of vast amounts of data. This processing is automated by software systems that use the compute, storage, and network hardware provided by the cyberinfrastructure. In addition to automation, a key objective of these systems is the efficient use of the resources as measured by cost and energy usage, while making the processing as fast as possible or as needed. To this end, these systems must make decisions regarding which resources should be used to do what and when. Many such systems are used in production today and make such decisions. Yet making good, let alone best, decisions is still an open research challenge. Theoretical research has proposed solutions that are difficult to put into practice, and practical solutions are known to not make good decisions, or at least not consistently so. However, both theory and practice follow the same basic philosophy: make decisions by reasoning about known information on what needs to be computed and on what hardware resources are available. This philosophy has shown its limits, so this project adopts a radically different approach. The key idea is to repeatedly execute fast, computationally inexpensive simulations of the application execution in order to evaluate large sets of potential resource management decisions and automatically select the most desirable ones. The benefits of this approach will be demonstrated for several software systems used to support scientific applications that are critical for the development and sustainability of society.Software systems are used to run scientific applications on advanced cyberinfrastructure. These systems automate application execution, and make resource management decision along several axes including selecting and provisioning (virtualized) hardware, picking application configuration options, and scheduling application activities in time and space. Their objective is to optimize both application performance and also a set of resource usage efficiency metrics that include monetary and energy costs. Consequently, the resource management decision space is enormous, and making good decisions is a steep challenge that has been the subject of countless efforts, both from theoreticians and practitioners. However, the challenge is far from being solved: theoreticians produce solutions that are rarely used by practitioners, and conversely practitioners implement solutions that may be highly sub-optimal because they not informed by theory. This project resolves this disconnect by obviating the need for developing effective resource management strategies. The key idea is to use online simulations to search the resource management decision space rapidly at runtime. Large numbers of fast simulations of the application's execution are executed throughout that very execution, so as to evaluate many potential resource management options and automatically select desirable ones. This approach thus shifts the overall problem from the design of complex resource management algorithms to the enumeration of many resource management decisions. The transformation of resource management practice in cyberinfrastructure systems not only renders the resource management problem tractable but also unlocks previously out-of-reach resource management decisions. The benefits of this transformation will be demonstrated for a critical class of production systems and applications, specifically Workflow Management Systems and the scientific applications they support.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在健康、气候建模、粒子物理、地震学等领域,许多科学突破只能通过对大量数据进行复杂处理来实现。软件系统使用网络基础设施提供的计算、存储和网络硬件来自动处理这个过程。除了自动化之外,这些系统的一个关键目标是通过成本和能源使用来有效利用资源,同时尽可能快地或根据需要进行处理。为此,这些系统必须决定应该使用哪些资源来做什么和什么时候做什么。许多这样的系统在今天的生产中使用,并做出这样的决策。然而,做出好的决策,更不用说最好的决策,仍然是一个开放的研究挑战。理论研究提出了难以付诸实践的解决方案,而众所周知,实际的解决方案不会做出正确的决策,或者至少不会始终如此。但是,理论和实践都遵循相同的基本原理:通过对需要计算的内容和可用硬件资源的已知信息进行推理来做出决策。这种理念已经显示出它的局限性,所以这个项目采用了一种完全不同的方法。关键思想是重复执行快速、计算成本低廉的应用程序执行模拟,以便评估大量潜在的资源管理决策,并自动选择最理想的决策。这种方法的好处将在几个软件系统中得到证明,这些软件系统用于支持对社会发展和可持续发展至关重要的科学应用。软件系统用于在先进的网络基础设施上运行科学应用程序。这些系统自动执行应用程序,并沿着几个轴做出资源管理决策,包括选择和供应(虚拟化)硬件、选择应用程序配置选项以及在时间和空间上调度应用程序活动。他们的目标是优化应用程序性能和一组资源使用效率指标(包括货币和能源成本)。因此,资源管理决策空间是巨大的,做出好的决策是一项艰巨的挑战,已经成为理论家和实践者无数努力的主题。然而,这一挑战远未得到解决:理论家提出的解决方案很少被实践者使用,相反,实践者实施的解决方案可能非常不理想,因为他们没有得到理论的支持。该项目通过消除开发有效资源管理策略的需要来解决这种脱节。其关键思想是利用在线模拟在运行时快速搜索资源管理决策空间。在整个执行过程中,对应用程序的执行进行大量快速模拟,以便评估许多潜在的资源管理选项并自动选择理想的选项。因此,这种方法将整个问题从复杂资源管理算法的设计转移到许多资源管理决策的枚举。网络基础设施系统资源管理实践的转变不仅使资源管理问题变得容易处理,而且解开了以前遥不可及的资源管理决策。这种转变的好处将在生产系统和应用程序的关键类别中得到证明,特别是工作流管理系统和它们支持的科学应用程序。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(2)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
WfCommons: Data Collection and Runtime Experiments using Multiple Workflow Systems
WfCommons:使用多个工作流系统的数据收集和运行时实验
On the Feasibility of Simulation-driven Portfolio Scheduling for Cyberinfrastructure Runtime Systems
网络基础设施运行时系统仿真驱动组合调度的可行性
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Henri Casanova其他文献

High-Bandwidth Low-Latency Approximate Interconnection Networks
高带宽低延迟近似互连网络
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Daichi Fujiki;Kiyo Ishii;Ikki Fujiwara;Hiroki Matsutani;Hideharu Amano ;Henri Casanova;Michihiro Koibuchi
  • 通讯作者:
    Michihiro Koibuchi
Discussion on Approximate Interconnection Networks
近似互连网络的讨论
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Nguyen T. Truong;Henri Casanova;鯉渕 道紘
  • 通讯作者:
    鯉渕 道紘
一般化ガンマクラスタリングについて
关于广义伽马聚类
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ikki Fujiwara;Michihiro Koibuchi. Tomoya Ozaki;Hiroki Matsutani;Henri Casanova;稲垣貴大・結縁祥治;野津昭文,大前勝弘,江口真透
  • 通讯作者:
    野津昭文,大前勝弘,江口真透
FPGAアクセラレータと高位合成系を用いた瞳検出手法の実装
利用FPGA加速器和高级综合系统实现瞳孔检测方法
  • DOI:
  • 发表时间:
    2013
  • 期刊:
  • 影响因子:
    0
  • 作者:
    鯉渕道紘;松谷宏紀;天野英晴;D.Frank Hsu;Henri Casanova;土肥慶亮,柴田裕一郎,小栗清
  • 通讯作者:
    土肥慶亮,柴田裕一郎,小栗清
Androidアプリケーションの並行実行における予期しない消費電力増加の検出
检测 Android 应用程序并行执行中的意外功耗增加
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ikki Fujiwara;Michihiro Koibuchi. Tomoya Ozaki;Hiroki Matsutani;Henri Casanova;稲垣貴大・結縁祥治
  • 通讯作者:
    稲垣貴大・結縁祥治

Henri Casanova的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Henri Casanova', 18)}}的其他基金

Collaborative Research: Elements: Simulation-driven Evaluation of Cyberinfrastructure Systems
协作研究:要素:网络基础设施系统的仿真驱动评估
  • 批准号:
    2103489
  • 财政年份:
    2021
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
CCRI: Planning: Collaborative Research: Infrastructure for Enabling Systematic Development and Research of Scientific Workflow Management Systems
CCRI:规划:协作研究:支持科学工作流程管理系统系统开发和研究的基础设施
  • 批准号:
    2016610
  • 财政年份:
    2020
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Small: Integrating core CI literacy and skills into university curricula via simulation-driven activities
协作研究:网络培训:实施:小型:通过模拟驱动的活动将核心 CI 素养和技能融入大学课程
  • 批准号:
    1923621
  • 财政年份:
    2019
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: SI2-SSE: WRENCH: A Simulation Workbench for Scientific Worflow Users, Developers, and Researchers
协作研究:SI2-SSE:WRENCH:面向科学 Worflow 用户、开发人员和研究人员的模拟工作台
  • 批准号:
    1642369
  • 财政年份:
    2017
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: II-New: Distributed Research Testbed (DiRT)
协作研究:II-新:分布式研究测试台 (DiRT)
  • 批准号:
    0855245
  • 财政年份:
    2009
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: CSR-PDOS: Designing Large-Scale Distributed Systems for Realistic Failure Models
合作研究:CSR-PDOS:为现实故障模型设计大规模分布式系统
  • 批准号:
    0546688
  • 财政年份:
    2005
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403312
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC CORE: Federated-Learning-Driven Traffic Event Management for Intelligent Transportation Systems
合作研究:OAC CORE:智能交通系统的联邦学习驱动的交通事件管理
  • 批准号:
    2414474
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402947
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Distributed Graph Learning Cyberinfrastructure for Large-scale Spatiotemporal Prediction
合作研究:OAC Core:用于大规模时空预测的分布式图学习网络基础设施
  • 批准号:
    2403313
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Large-Scale Spatial Machine Learning for 3D Surface Topology in Hydrological Applications
合作研究:OAC 核心:水文应用中 3D 表面拓扑的大规模空间机器学习
  • 批准号:
    2414185
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Learning AI Surrogate of Large-Scale Spatiotemporal Simulations for Coastal Circulation
合作研究:OAC Core:学习沿海环流大规模时空模拟的人工智能替代品
  • 批准号:
    2402946
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403088
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403090
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC: Core: Harvesting Idle Resources Safely and Timely for Large-scale AI Applications in High-Performance Computing Systems
合作研究:OAC:核心:安全及时地收集闲置资源,用于高性能计算系统中的大规模人工智能应用
  • 批准号:
    2403399
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: CropDL - Scheduling and Checkpoint/Restart Support for Deep Learning Applications on HPC Clusters
合作研究:OAC 核心:CropDL - HPC 集群上深度学习应用的调度和检查点/重启支持
  • 批准号:
    2403089
  • 财政年份:
    2024
  • 资助金额:
    $ 28万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了