权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Mining Software Test Histories to Identify Flaky Tests and Model Commit Failure Risk

挖掘软件测试历史来识别不稳定的测试和模型提交失败风险

基本信息

批准号：
RGPIN-2020-06807
负责人：
Rigby, Peter
金额：
$ 2.11万
依托单位：
Concordia University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2021
资助国家：
加拿大
起止时间：
2021-01-01 至 2022-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=741443
关键词：
Mining Software Test Histories Identify

项目摘要

My research begins by listening to the daily struggles of software developers and ultimately provides theoretically based general solutions to common software testing problems. Objective 1, Flaky tests: Flaky software tests fail without identifying a fault and waste developer time and reduce confidence in test outcomes. Flaky tests affect all projects and are particularly damaging on large projects or in complex test environments. The conventional practice is to re-run flaky tests or to quarantine them and require developers to fix them. However, I have observed that flaky tests have varying degrees of flakiness. For example, a test that has a flaky failure 1 in 100 runs is more reliable than a test that has 10 flaky failures 100 runs. By quantifying the historical FlakeRate, e.g. 10/100, we can use statistical tests to determine when a test is failing at its normal FlakeRate, 10%, or when it FlakeRate has changed, indicating instability and a potential product fault. My objective is to quantify the degree of test flakiness from prior runs and statistically identify changes in the FlakeRate as indications that a test has become unstable. Tool support will allow developers to rank tests by FlakeRate instability. This should reduce investigation effort because flaky failures that do not have a statistically significant corresponding change in FlakeRate are stable and less likely to need investigation. Objective 2, Commit failure risk: A common development practice is to test each commit individually to isolate immediately the commit that caused a test failure. Since most tests do not fail and most commits pass all tests, we propose to batch commits based on how likely a commit is to contain a failing test. A batch that contains 10 commits, will require 1 run of the test suite saving 9 executions over the test-all-passing scenario. However, on failure, a bisection process will require additional runs. We will adapt commit risk models that identify bug-introducing changes to the context of selecting low risk commits to be included in large batches and high risk commits that should be tested individually. The adaption and development of these statistical approaches is complex and evaluating them in production CI environments requires systematic rigor. Objective 3, Generalize and Tools: In the long-term, I will increase the use of software analytics by software testers and developers by simulating and demonstrating improved efficiency in a wide range of software environments. In my experience developers have a single focus of reducing faults and releasing software on time. This focus has limited the use of mining and analytics in practice. By focusing on developer problems, I have been able to and will continue to provide general solutions to time consuming, unpleasant, and expensive development problems.

我的研究从倾听软件开发人员的日常挣扎开始，最终为常见的软件测试问题提供基于理论的通用解决方案。目标1，不稳定的测试：不稳定的软件测试在没有确定错误的情况下失败，浪费了开发人员的时间，降低了对测试结果的信心。不稳定的测试影响所有项目，对大型项目或复杂的测试环境尤其有害。传统的做法是重新运行零散的测试，或者隔离它们，并要求开发人员修复它们。然而，我观察到片状测试有不同程度的片状。例如，100次运行中有1次零星故障的测试比100次运行中有10次零星故障的测试更可靠。通过量化历史的flakrate（例如10/100），我们可以使用统计测试来确定何时测试在正常的flakrate（10%）下失败，或者何时flakrate发生变化（表明不稳定和潜在的产品故障）。我的目标是量化之前运行的测试碎片的程度，并统计识别FlakeRate的变化，作为测试变得不稳定的指示。工具支持将允许开发人员根据FlakeRate不稳定性对测试进行排名。这将减少调查工作，因为在FlakeRate中没有统计上显著的相应变化的片状故障是稳定的，不太可能需要调查。目标2，提交失败风险：一个常见的开发实践是单独测试每个提交，以立即隔离导致测试失败的提交。由于大多数测试不会失败，并且大多数提交都通过了所有测试，因此我们建议根据提交包含失败测试的可能性对提交进行批处理。一个包含10个提交的批处理将需要运行一次测试套件，在测试全部通过的场景中节省9次执行。然而，如果出现故障，分割过程将需要额外的运行。我们将调整提交风险模型，以识别引入bug的更改，以选择将低风险提交包含在大批量中，以及应该单独测试的高风险提交。这些统计方法的适应和开发是复杂的，在生产CI环境中对它们进行评估需要系统的严格性。目标3，概括和工具：从长远来看，我将通过在广泛的软件环境中模拟和演示改进的效率来增加软件测试人员和开发人员对软件分析的使用。根据我的经验，开发人员只关注减少错误和按时发布软件。这种关注限制了挖掘和分析在实践中的使用。通过关注开发人员的问题，我已经能够并将继续为耗时、令人不快和昂贵的开发问题提供通用的解决方案。