权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

OBSERVABILITY COMPENSATION PARADIGM: LEVERAGING ADAPTIVE EXECUTION TRACING AND ANALYSIS

可观测性补偿范式：利用自适应执行跟踪和分析

基本信息

批准号：
RGPIN-2021-04285
负责人：
EzzatiJivan, Naser
金额：
$ 1.75万
依托单位：
Brock University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2021
资助国家：
加拿大
起止时间：
2021-01-01 至 2022-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=742959
关键词：
OBSERVABILITY COMPENSATION PARADIGM LEVERAGING ADAPTIVE

项目摘要

Distributed systems have been increasingly adopted by various industry sectors as well as the general public, emerging in the form of cloud services, IoT devices, and smart vehicles. In these systems, a simple task such as asking about the weather or conducting an online financial transaction can involve several parallel modules running on various nodes. Upon repeated instantiations of a task, the same operation may be executed over a completely different set of nodes. To observe the execution correctness, or diagnose the root cause of runtime issues such as a delay, several nodes and execution layers should be observed and traced, potentially incurring large CPU, memory, and storage overhead. Additionally, processing and analyzing this potentially large trace data can be challenging in itself as it requires effort to correlate the collected data, model, analyze, and understand it as a whole, especially for live production systems. Therefore, there is a need for new methodologies to strike a balance between the scope and resolution of tracing with the level of observability it achieves and the overhead it incurs on the system as a whole. The long-term objective of the proposed research is to present a new paradigm that enables adaptive tailoring of software system observability according to the objectives at hand. To achieve this, the first short term objective is to devise algorithms and strategies for dynamic adaptive data collection as well as dynamic optimization of the amount, speed and resolution of the collected data, based on the learned profile or the current behavior of the system under investigation. This will ensure that the proposed methods will collect just enough data around the runtime operations and problems to perform correctness validation and problem analysis. The second objective is to develop incremental machine learning based trace analysis and processing modules to efficiently model and understand runtime system behavior as well as provide input to the adaptive tracing system for the iterative and online adjustment of trace collection. This will ensure that system observability is desirable while its operational performance is maintained within an acceptable range despite the tracing overhead. Finally, the third objective is to develop a number of methods and algorithms to correlate the changes in performance and the extracted runtime behavioral models to be used for root cause analysis of performance misbehavior. The significance and originality of this work are found in addressing the fundamental challenges currently faced by software developers and administrators in the observability of their systems. Many Canadian companies, from IT and phone providers to finance, media, transportation, and energy have already moved to Cloud, IoT and Edge services, and would benefit from the proposed methods and results of this research, through increased software observability and efficiency, and decreased maintenance costs.

分布式系统以云服务、物联网设备和智能汽车的形式出现，越来越多地被各行业和公众所采用。在这些系统中，询问天气或进行在线金融交易等简单任务可能涉及在不同节点上运行的多个并行模块。在任务的重复实例化中，相同的操作可能会在完全不同的节点集上执行。为了观察执行正确性，或诊断运行时问题（如延迟）的根本原因，应该观察和跟踪多个节点和执行层，这可能会导致大量CPU、内存和存储开销。此外，处理和分析这些潜在的大量跟踪数据本身就具有挑战性，因为它需要努力将收集的数据关联起来，对其进行建模、分析和整体理解，特别是对于实时生产系统。因此，需要新的方法在跟踪的范围和分辨率与它所达到的可观察性水平和它在整个系统上引起的开销之间取得平衡。所提出的研究的长期目标是提出一种新的范例，使软件系统的可观察性能够根据手边的目标进行自适应剪裁。为了实现这一目标，第一个短期目标是根据所研究的系统的学习概况或当前行为，设计动态自适应数据收集的算法和策略，以及收集数据的数量、速度和分辨率的动态优化。这将确保所建议的方法将在运行时操作和问题周围收集足够的数据，以执行正确性验证和问题分析。第二个目标是开发基于增量机器学习的跟踪分析和处理模块，以有效地建模和理解运行时系统行为，并为自适应跟踪系统提供输入，用于跟踪收集的迭代和在线调整。这将确保系统的可观察性是理想的，同时它的操作性能保持在一个可接受的范围内，尽管有跟踪开销。最后，第三个目标是开发一些方法和算法，将性能变化与提取的运行时行为模型关联起来，用于性能不当行为的根本原因分析。这项工作的意义和独创性在于解决软件开发人员和管理员在其系统的可观察性方面当前面临的基本挑战。许多加拿大公司，从IT和电话提供商到金融、媒体、交通和能源，已经转向云、物联网和边缘服务，并将从本研究提出的方法和结果中受益，通过提高软件的可观察性和效率，降低维护成本。