SHF: Medium: Collaborative Research: ANACIN-X: Analysis and modeling of Nondeterminism and Associated Costs in eXtreme scale applications

SHF:中:协作研究:ANACIN-X:极端规模应用中的非确定性和相关成本的分析和建模

基本信息

  • 批准号:
    1900888
  • 负责人:
  • 金额:
    $ 91.57万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-08-01 至 2025-07-31
  • 项目状态:
    未结题

项目摘要

Nondeterminism (i.e., the properties of a scientific application to exhibit different behaviors in numerical results and execution patterns during multiple executions) is an increasingly entrenched property of high performance computing (HPC) applications as the scientific community is moving their simulations on larger and highly heterogeneous computing systems. Nondeterminism can drastically increase the cost of scientific reproducibility in terms of developer time and computational resources, debugging applications when moving from a smaller to a larger scale or from one platform to another, and ensuring fault-tolerance when executions may need to recover from a system fault. These three challenges can ultimately compromise the amount and quality of scientific discovery through computer simulations. Tools for addressing aspects of the nondeterministic problem have emerged, including Record-and-replay (R&R) techniques that monitor and record changes in program states over one execution (i.e., the recorded execution) of an application; and reproduce those changes, and thus, the behavior of the application during a subsequent execution (i.e., the replayed execution). However, these tools impose overheads on the underlying application and thus present HPC users with the problem of balancing tool utility against tool overhead. HPC users may opt to not use the tool at all rather than deal with unpredictable overheads. This project supports HPC users by modeling the relationship between application nondeterminism and variability in tool overhead, and uses this knowledge to identify hot spots in terms of tool cost as well as regions in executions that trigger nondeterministic behaviors in the applications. The aim of the project is to model nondeterministic executions by determining points (motif) of nondeterminism in executions of HPC applications and to apply the motif modeling with R&R techniques, to study the cost on R&R techniques of certain motifs. The outcome of this project impacts four communities of application developers with the identification of sources of unintended nondeterminism and their management; the HPC research community working on fault-tolerance, resilience, and reproducibility at exascale; data center administrators who use evaluation tools for and with application developers; and educators and trainers in resource constrained environments to promote HPC without the need of accessing high-end, expensive computers.This project advances the study of nondeterministic HPC applications by studying the recording costs of Record-and-replay (R&R) tools and by defining strategy so that these tools can scale to the exascale domain. In addition to the more commonly studied factors of time and memory overhead, the project integrates power usage in the modeling. The project relies on graph theory to develop expressive and scalable graph-based representations of the dependencies between events in a program, and develops algorithms to identify motifs in the graph that indicate points of nondeterminism. These motifs are applied to quantify the associated costs of nondeterminism, including developing metrics to measure dissimilarities between different executions, modeling the costs of recording executions and assessing the overhead of recordings. Based on these motifs, work on this project generates ?fingerprints? (i.e., a holistic characterization of how and where nondeterminism manifests during the application executions) of real world HPC applications including N-Body problems (e.g., simulating particle, atomic, and planetary interactions); (2) Graph analytics (e.g., Graph500 benchmark); (3) Bioinformatics (e.g., mpiBLAST); and (4) Task-based data analysis application (e.g., WordCount, Join, Octree Clustering on top of MapReduce Over MPI frameworks). The fingerprints illuminate previously-overlooked similarities between the nondeterminism that manifests across multiple classes of applications and allow users to probe the relationship between process communication patterns, the motifs of the actual resulting executions, and the regions of those executions in which tool overhead accumulates for nondeterministic HPC applications.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
非决定论(即,科学应用程序在多次执行期间在数值结果和执行模式中表现出不同行为的特性)是高性能计算(HPC)应用程序越来越根深蒂固的特性,因为科学界正在将它们的模拟移动到更大和高度异构的计算系统上。不确定性可能会大幅增加科学再现性的成本,包括开发人员时间和计算资源、从较小规模转移到较大规模或从一个平台转移到另一个平台时调试应用程序,以及在执行可能需要从系统故障中恢复时确保容错。这三个挑战最终可能会损害通过计算机模拟进行的科学发现的数量和质量。 已经出现了用于解决不确定性问题的各方面的工具,包括在一次执行中监视和记录程序状态的变化的记录和重放(RR)技术(即,记录的执行);并再现这些改变,并因此再现应用程序在后续执行期间的行为(即,重播的执行)。然而,这些工具对底层应用程序施加了开销,因此给HPC用户带来了平衡工具实用性和工具开销的问题。HPC用户可以选择根本不使用该工具,而不是处理不可预测的开销。该项目支持HPC用户通过建模应用程序的不确定性和工具开销的变化之间的关系,并使用这些知识来确定热点的工具成本以及区域的执行,触发应用程序中的不确定性行为。该项目的目的是通过确定HPC应用程序执行中的不确定性点(motif)来对不确定性执行进行建模,并将motif建模与R R技术相结合,研究某些motif的R R技术成本。该项目的成果影响了四个社区的应用程序开发人员与非预期的不确定性的来源和管理的识别; HPC研究社区的容错性,弹性,并在exascale再现性工作;数据中心管理员谁使用评估工具,并与应用程序开发人员;以及教育工作者和培训人员在资源有限的环境中推广HPC,而无需访问高端,该项目通过研究记录和重放(R R)工具的记录成本并通过定义策略来推进非确定性HPC应用的研究&,使得这些工具可以扩展到exascale域。除了时间和内存开销等更常见的研究因素外,该项目还在建模中集成了功耗。该项目依赖于图论来开发程序中事件之间依赖关系的表达性和可扩展的基于图的表示,并开发算法来识别图中指示非确定性点的图案。这些主题被应用于量化非确定性的相关成本,包括开发度量标准来衡量不同执行之间的差异,建模记录执行的成本和评估记录的开销。基于这些主题,这个项目的工作产生?指纹吗 (i.e.,在应用执行期间非确定性如何以及在何处显现的整体表征)的真实的世界HPC应用,包括N体问题(例如,模拟粒子、原子和行星相互作用);(2)图形分析(例如,Graph 500基准);(3)生物信息学(例如,mpiBLAST);和(4)基于任务的数据分析应用(例如,WordCount,Join,八叉树聚类,MapReduce Over MPI框架)。指纹照亮了以前被忽视的非确定性之间的相似性,这些非确定性在多个应用程序类别中表现出来,并允许用户探测进程通信模式之间的关系,实际执行结果的主题,以及非确定性HPC应用程序的工具开销累积的执行区域。该奖项反映了NSF的法定使命,并通过评估被认为值得支持使用基金会的知识价值和更广泛的影响审查标准。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
A Research-Based Course Module to Study Non-determinism in High Performance Applications
用于研究高性能应用中的非确定性的研究型课程模块
ANACIN-X: A software framework for studying non-determinism in MPI applications
ANACIN-X:用于研究 MPI 应用中的非确定性的软件框架
  • DOI:
    10.1016/j.simpa.2021.100151
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Bell, Patrick;Suarez, Kae;Chapp, Dylan;Tan, Nigel;Bhowmick, Sanjukta;Taufer, Michela
  • 通讯作者:
    Taufer, Michela
A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance Computing
Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph Kernels
通过图内核识别 MPI 应用中非确定性的程度和来源
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Michela Taufer其他文献

Enhancing Scientific Research with FAIR Digital Objects in the National Science Data Fabric
利用国家科学数据结构中的 FAIR 数字对象加强科学研究
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Michela Taufer;Heberth Martinez;Jakob Luettgau;Lauren Whitnah;G. Scorzelli;P. Newell;Aashish Panta;P. Bremer;Douglas Fils;Christine R. Kirkpatrick;V. Pascucci;Kathryn Mohror;J. Shalf
  • 通讯作者:
    J. Shalf
Integrating FAIR Digital Objects (FDOs) into the National Science Data Fabric (NSDF) to Revolutionize Dataflows for Scientific Discovery
将 FAIR 数字对象 (FDO) 集成到国家科学数据结构 (NSDF) 中,彻底改变科学发现的数据流
  • DOI:
  • 发表时间:
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Michela Taufer;Heberth Martinez;Jakob Luettgau;Lauren Whitnah;†. GiorgioScorzelli;†. PaniaNewel;Aashish Panta;Timo Bremer;§. DougFils;¶. ChristineR.Kirkpatrick;Nina McCurdy;V. Pascucci;U. Knoxville;†. U.Utah;R. LLNL ‡;Research Center
  • 通讯作者:
    Research Center

Michela Taufer的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Michela Taufer', 18)}}的其他基金

EAGER: A Comprehensive Approach for Generating, Sharing, Searching, and Using High-Resolution Terrain Parameters
EAGER:生成、共享、搜索和使用高分辨率地形参数的综合方法
  • 批准号:
    2334945
  • 财政年份:
    2023
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Small: Model-driven Design and Optimization of Dataflows for Scientific Applications
协作研究:SHF:小型:科学应用数据流的模型驱动设计和优化
  • 批准号:
    2331152
  • 财政年份:
    2023
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
SHF: Small: Methods, Workflows, and Data Commons for Reducing Training Costs in Neural Architecture Search on High-Performance Computing Platforms
SHF:小型:降低高性能计算平台上神经架构搜索训练成本的方法、工作流程和数据共享
  • 批准号:
    2223704
  • 财政年份:
    2022
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: Elements: SENSORY: Software Ecosystem for kNowledge diScOveRY - a data-driven framework for soil moisture applications
协作研究:要素:SENSORY:知识发现的软件生态系统 - 土壤湿度应用的数据驱动框架
  • 批准号:
    2103845
  • 财政年份:
    2021
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: PPoSS: Planning: Performance Scalability, Trust, and Reproducibility: A Community Roadmap to Robust Science in High-throughput Applications
协作研究:PPoSS:规划:性能可扩展性、信任和可重复性:高通量应用中稳健科学的社区路线图
  • 批准号:
    2028923
  • 财政年份:
    2020
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: EAGER: Advancing Reproducibility in Multi-Messenger Astrophysics
合作研究:EAGER:提高多信使天体物理学的可重复性
  • 批准号:
    2041977
  • 财政年份:
    2020
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative: EAGER: Exploring and Advancing the State of the Art in Robust Science in Gravitational Wave Physics
合作:EAGER:探索和推进引力波物理学稳健科学的最新技术
  • 批准号:
    1841399
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative: EAGER: Exploring and Advancing the State of the Art in Robust Science in Gravitational Wave Physics
合作:EAGER:探索和推进引力波物理学稳健科学的最新技术
  • 批准号:
    1823372
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
SHF:Medium:Collaborative Research:A comprehensive methodology to pursue reproducible accuracy in ensemble scientific simulations on multi- and many-core platforms
SHF:中:协作研究:在多核和众核平台上追求集合科学模拟的可重复精度的综合方法
  • 批准号:
    1841552
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
BIGDATA: IA: Collaborative Research: In Situ Data Analytics for Next Generation Molecular Dynamics Workflows
BIGDATA:IA:协作研究:下一代分子动力学工作流程的原位数据分析
  • 批准号:
    1841758
  • 财政年份:
    2018
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant

相似海外基金

Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
  • 批准号:
    2402804
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
  • 批准号:
    2403408
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Toward Understandability and Interpretability for Neural Language Models of Source Code
合作研究:SHF:媒介:实现源代码神经语言模型的可理解性和可解释性
  • 批准号:
    2423813
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
  • 批准号:
    2402806
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403135
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
  • 批准号:
    2403409
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling GPU Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的 GPU 性能仿真
  • 批准号:
    2402805
  • 财政年份:
    2024
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: High-Performance, Verified Accelerator Programming
合作研究:SHF:中:高性能、经过验证的加速器编程
  • 批准号:
    2313024
  • 财政年份:
    2023
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Verifying Deep Neural Networks with Spintronic Probabilistic Computers
合作研究:SHF:中:使用自旋电子概率计算机验证深度神经网络
  • 批准号:
    2311295
  • 财政年份:
    2023
  • 资助金额:
    $ 91.57万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了