权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Test FLARE (Test Flakiness Automated Reproduction and Explanation)

测试 FLARE（测试片状自动再现和解释）

基本信息

批准号：
EP/X024539/1
负责人：
Philip McMinn
金额：
$ 69.35万
依托单位：
University of Sheffield
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2023
资助国家：
英国
起止时间：
2023 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FX024539%2F1
关键词：
Test FLARE Flakiness Automated Reproduction

项目摘要

The cost of software failures is a huge burden to the worldwide economy that was estimated to be at least £1.3 trillion in 2017. Consequently, software testing, a vital defence against failures, contributes to a large proportion of software development effort and cost. Flaky tests are a particular strain on resources allocated to software development, because they intermittently pass and fail without changes to tests or project code, with often maddening, non-obvious causes. Flaky tests are tests that fundamentally do not always tell the truth: they can fail when code is working, and pass when it isn't. Because developers can no longer trust the results of their tests, they are unable to gain confidence that software is working correctly, potentially exposing end-users to the consequences of software failures. Flaky tests are a common occurrence in industry, significantly disrupting software development - even for companies with the greatest amount of resources to tackle them, such as Microsoft, Facebook, and Google.A test can produce different pass/fail (i.e., flaky) outcomes because of differing, unpredicted ways that the execution environment in which it runs interacts with its behaviour and/or the code that it tests. For instance, a machine may be experiencing a heavy concurrent task load, causing it to execute tests slowly, sometimes triggering timeouts in the code under test, and sometimes not. Or, network access is erratic on the testing infrastructure, meaning the availability of network resources may be compromised. Or, a program under test's logic is time and date dependent. These are just a few real examples of the different ways in which tests can be flaky. For some environmental conditions, the test passes, but in an alternative context, the same test fails.To remove flaky test behaviour, a developer has to modify test code or the code that it tests to control for aspects of its execution environment; i.e., the potential sources of its intermittent behaviour. But to accurately assess the differences in code execution behaviour and the places in the code that need to be changed, a developer must be able to reliably reproduce the differing pass/fail test outcomes. However, this not only involves recreating the environmental conditions that lead to the flaky behaviour, but also figuring out exactly what the environmental conditions were that caused the flakiness in the first place. Solving these issues and reproducing flaky tests manually can be extremely challenging for developers since the environmental conditions concerned (a) are intermittent; and (b) may be unrelated to anything the test is actually checking, and/or far-removed from the code being tested. Existing research techniques are insufficient for addressing these problems, and despite developer incentives for removing flakiness, Google, for instance, reports an astonishing one in seven tests as flaky.What the Test FLARE Project Will Do: The Test FLARE project will develop and empirically evaluate techniques capable of (1) automatically reproducing flaky behaviour that is due to the execution environment. It will also provide developers with (2) automated, human-readable explanations that help developers further understand the reasons for the flaky behaviour.

软件故障的成本对全球经济来说是一个巨大的负担，据估计，2017年至少有1.3万亿英镑。因此，软件测试，一个重要的防御失败，有助于软件开发的努力和成本的很大一部分。不稳定的测试对分配给软件开发的资源来说是一种特殊的压力，因为它们间歇性地通过和失败，而不需要对测试或项目代码进行更改，通常是令人抓狂的，不明显的原因。不完整的测试从根本上说并不总是说真话：当代码工作时，它们可能会失败，而当代码不工作时，它们可能会通过。由于开发人员不再信任他们的测试结果，他们无法获得软件正常工作的信心，这可能会使最终用户面临软件故障的后果。不稳定的测试在行业中很常见，严重破坏了软件开发-即使对于拥有最多资源来解决它们的公司，如Microsoft，Facebook和Google。因为它运行的执行环境与它的行为和/或它测试的代码交互的不同的、不可预测的方式而导致的结果。例如，一台机器可能正在经历沉重的并发任务负载，导致它缓慢地执行测试，有时会触发被测代码中的超时，有时不会。或者，测试基础设施上的网络访问不稳定，这意味着网络资源的可用性可能会受到影响。或者，被测程序的逻辑依赖于时间和日期。这些只是测试可以被验证的不同方式的几个真实的例子。对于某些环境条件，测试通过，但在另一个上下文中，相同的测试失败。为了删除重复测试行为，开发人员必须修改测试代码或其测试的代码以控制其执行环境的方面;即，其间歇性行为的潜在来源。但是，为了准确地评估代码执行行为的差异以及代码中需要更改的位置，开发人员必须能够可靠地重现不同的通过/失败测试结果。然而，这不仅涉及重新创建导致片状行为的环境条件，而且还涉及首先弄清楚导致片状行为的环境条件到底是什么。手动解决这些问题并重现测试对于开发人员来说是极具挑战性的，因为所涉及的环境条件（a）是间歇性的;以及（B）可能与测试实际检查的任何东西都无关，和/或远离被测试的代码。现有的研究技术不足以解决这些问题，尽管开发人员鼓励消除片状，例如，谷歌报告了令人惊讶的七分之一的测试作为测试。测试FLARE项目将做什么：测试FLARE项目将开发和经验评估技术，能够（1）自动再现由于执行环境导致的不稳定行为。它还将为开发人员提供（2）自动化的，人类可读的解释，帮助开发人员进一步了解恶意行为的原因。