SHF: Medium: Collaborative Research: ANACIN-X: Analysis and modeling of Nondeterminism and Associated Costs in eXtreme scale applications

SHF:中:协作研究:ANACIN-X:极端规模应用中的非确定性和相关成本的分析和建模

基本信息

  • 批准号:
    1900765
  • 负责人:
  • 金额:
    $ 31.62万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-08-01 至 2025-07-31
  • 项目状态:
    未结题

项目摘要

Nondeterminism (i.e., the properties of a scientific application to exhibit different behaviors in numerical results and execution patterns during multiple executions) is an increasingly entrenched property of high performance computing (HPC) applications as the scientific community is moving their simulations on larger and highly heterogeneous computing systems. Nondeterminism can drastically increase the cost of scientific reproducibility in terms of developer time and computational resources, debugging applications when moving from a smaller to a larger scale or from one platform to another, and ensuring fault-tolerance when executions may need to recover from a system fault. These three challenges can ultimately compromise the amount and quality of scientific discovery through computer simulations. Tools for addressing aspects of the nondeterministic problem have emerged, including Record-and-replay (R&R) techniques that monitor and record changes in program states over one execution (i.e., the recorded execution) of an application; and reproduce those changes, and thus, the behavior of the application during a subsequent execution (i.e., the replayed execution). However, these tools impose overheads on the underlying application and thus present HPC users with the problem of balancing tool utility against tool overhead. HPC users may opt to not use the tool at all rather than deal with unpredictable overheads. This project supports HPC users by modeling the relationship between application nondeterminism and variability in tool overhead, and uses this knowledge to identify hot spots in terms of tool cost as well as regions in executions that trigger nondeterministic behaviors in the applications. The aim of the project is to model nondeterministic executions by determining points (motif) of nondeterminism in executions of HPC applications and to apply the motif modeling with R&R techniques, to study the cost on R&R techniques of certain motifs. The outcome of this project impacts four communities of application developers with the identification of sources of unintended nondeterminism and their management; the HPC research community working on fault-tolerance, resilience, and reproducibility at exascale; data center administrators who use evaluation tools for and with application developers; and educators and trainers in resource constrained environments to promote HPC without the need of accessing high-end, expensive computers.This project advances the study of nondeterministic HPC applications by studying the recording costs of Record-and-replay (R&R) tools and by defining strategy so that these tools can scale to the exascale domain. In addition to the more commonly studied factors of time and memory overhead, the project integrates power usage in the modeling. The project relies on graph theory to develop expressive and scalable graph-based representations of the dependencies between events in a program, and develops algorithms to identify motifs in the graph that indicate points of nondeterminism. These motifs are applied to quantify the associated costs of nondeterminism, including developing metrics to measure dissimilarities between different executions, modeling the costs of recording executions and assessing the overhead of recordings. Based on these motifs, work on this project generates 'fingerprints' (i.e., a holistic characterization of how and where nondeterminism manifests during the application executions) of real world HPC applications including N-Body problems (e.g., simulating particle, atomic, and planetary interactions); (2) Graph analytics (e.g., Graph500 benchmark); (3) Bioinformatics (e.g., mpiBLAST); and (4) Task-based data analysis application (e.g., WordCount, Join, Octree Clustering on top of MapReduce Over MPI frameworks). The fingerprints illuminate previously-overlooked similarities between the nondeterminism that manifests across multiple classes of applications and allow users to probe the relationship between process communication patterns, the motifs of the actual resulting executions, and the regions of those executions in which tool overhead accumulates for nondeterministic HPC applications.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
非决定论(即,科学应用程序在多次执行期间在数值结果和执行模式中表现出不同行为的特性)是高性能计算(HPC)应用程序越来越根深蒂固的特性,因为科学界正在将它们的模拟移动到更大和高度异构的计算系统上。不确定性可能会大幅增加科学再现性的成本,包括开发人员时间和计算资源、从较小规模转移到较大规模或从一个平台转移到另一个平台时调试应用程序,以及在执行可能需要从系统故障中恢复时确保容错。这三个挑战最终可能会损害通过计算机模拟进行的科学发现的数量和质量。 已经出现了用于解决不确定性问题的各方面的工具,包括在一次执行中监视和记录程序状态的变化的记录和重放(RR)技术(即,记录的执行);并再现这些改变,并因此再现应用程序在后续执行期间的行为(即,重播的执行)。然而,这些工具对底层应用程序施加了开销,因此给HPC用户带来了平衡工具实用性和工具开销的问题。HPC用户可以选择根本不使用该工具,而不是处理不可预测的开销。该项目支持HPC用户通过建模应用程序的不确定性和工具开销的变化之间的关系,并使用这些知识来确定热点的工具成本以及区域的执行,触发应用程序中的不确定性行为。该项目的目的是通过确定HPC应用程序执行中的不确定性点(motif)来对不确定性执行进行建模,并将motif建模与R R技术相结合,研究某些motif的R R技术成本。该项目的成果影响了四个社区的应用程序开发人员与非预期的不确定性的来源和管理的识别; HPC研究社区的容错性,弹性,并在exascale再现性工作;数据中心管理员谁使用评估工具,并与应用程序开发人员;以及教育工作者和培训人员在资源有限的环境中推广HPC,而无需访问高端,该项目通过研究记录和重放(R R)工具的记录成本并通过定义策略来推进非确定性HPC应用的研究&,使得这些工具可以扩展到exascale域。除了时间和内存开销等更常见的研究因素外,该项目还在建模中集成了功耗。该项目依赖于图论来开发程序中事件之间依赖关系的表达性和可扩展的基于图的表示,并开发算法来识别图中指示非确定性点的图案。这些主题被应用于量化非确定性的相关成本,包括开发度量标准来衡量不同执行之间的差异,建模记录执行的成本和评估记录的开销。基于这些图案,这个项目的工作产生了“指纹”(即,在应用执行期间非确定性如何以及在何处显现的整体表征)的真实的世界HPC应用,包括N体问题(例如,模拟粒子、原子和行星相互作用);(2)图形分析(例如,Graph 500基准);(3)生物信息学(例如,mpiBLAST);和(4)基于任务的数据分析应用(例如,WordCount,Join,八叉树聚类,MapReduce Over MPI框架)。指纹照亮了以前被忽视的非确定性之间的相似性,这些非确定性在多个应用程序类别中表现出来,并允许用户探测进程通信模式之间的关系,实际执行结果的主题,以及非确定性HPC应用程序的工具开销累积的执行区域。该奖项反映了NSF的法定使命,并通过评估被认为值得支持使用基金会的知识价值和更广泛的影响审查标准。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Sanjukta Bhowmick其他文献

Sanjukta Bhowmick的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Sanjukta Bhowmick', 18)}}的其他基金

Collaborative Research: CCRI: Planning: A Multilayer Network (MLN) Community Infrastructure for Data,Interaction,Visualization, and softwarE(MLN-DIVE)
合作研究:CCRI:规划:数据、交互、可视化和软件的多层网络 (MLN) 社区基础设施 (MLN-DIVE)
  • 批准号:
    2120414
  • 财政年份:
    2021
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: Framework Implementations: CSSI: CANDY: Cyberinfrastructure for Accelerating Innovation in Network Dynamics
合作研究:框架实施:CSSI:CANDY:加速网络动态创新的网络基础设施
  • 批准号:
    2104076
  • 财政年份:
    2021
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: NetSplicer: Scalable Decoupling-based Algorithms for Multilayer Network Analysis
合作研究:SHF:中:NetSplicer:用于多层网络分析的可扩展的基于解耦的算法
  • 批准号:
    1956373
  • 财政年份:
    2020
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
XPS: EXPL: FP: Collaborative Research: SPANDAN: Scalable Parallel Algorithms for Network Dynamics Analysis
XPS:EXPL:FP:协作研究:SPANDAN:用于网络动态分析的可扩展并行算法
  • 批准号:
    1924486
  • 财政年份:
    2018
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
SPX: Collaborative Research: SANDY: Sparsification-Based Approach for Analyzing Network Dynamics
SPX:协作研究:SANDY:基于稀疏化的网络动态分析方法
  • 批准号:
    1916084
  • 财政年份:
    2018
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Continuing Grant
SPX: Collaborative Research: SANDY: Sparsification-Based Approach for Analyzing Network Dynamics
SPX:协作研究:SANDY:基于稀疏化的网络动态分析方法
  • 批准号:
    1725566
  • 财政年份:
    2017
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Continuing Grant
XPS: EXPL: FP: Collaborative Research: SPANDAN: Scalable Parallel Algorithms for Network Dynamics Analysis
XPS:EXPL:FP:协作研究:SPANDAN:用于网络动态分析的可扩展并行算法
  • 批准号:
    1533881
  • 财政年份:
    2015
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant

相似海外基金

Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
  • 批准号:
    2402804
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
  • 批准号:
    2403408
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
  • 批准号:
    2423813
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
  • 批准号:
    2402806
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403135
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
  • 批准号:
    2403409
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
  • 批准号:
    2402805
  • 财政年份:
    2024
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: High-Performance, Verified Accelerator Programming
合作研究:SHF:中:高性能、经过验证的加速器编程
  • 批准号:
    2313024
  • 财政年份:
    2023
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Verifying Deep Neural Networks with Spintronic Probabilistic Computers
合作研究:SHF:中:使用自旋电子概率计算机验证深度神经网络
  • 批准号:
    2311295
  • 财政年份:
    2023
  • 资助金额:
    $ 31.62万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了