权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

A framework for evaluating and explaining the robustness of NLP models

评估和解释 NLP 模型稳健性的框架

基本信息

批准号：
EP/X04162X/1
负责人：
Oana Cocarascu
金额：
$ 40.55万
依托单位：
King's College London
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2024
资助国家：
英国
起止时间：
2024 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FX04162X%2F1
关键词：
framework evaluating explaining robustness NLP

项目摘要

The standard practice for evaluating the generalisation of supervised machine learning models in NLP tasks is to use previously unseen (i.e. held-out) data and report the performance on it using various metrics such as accuracy. Whilst metrics reported on held-out data summarise a model's performance, ultimately these results represent aggregate statistics on benchmarks and do not reflect the nuances in model behaviour and robustness when applied in real-world systems.We propose a robustness evaluation framework for NLP models concerned with arguments and facts, which encompasses explanations for robustness failures to support systematic and efficient evaluation. We will develop novel methods for simulating real-world texts stemming from existing datasets, to help evaluate the stability and consistency of models when deployed in the wild. The simulation methods will be used to challenge NLP models through text-based transformations and distribution shifts on datasets as well as on data sub-sets that capture linguistic patterns, to provide a systematic coverage of real-world linguistic phenomena. Furthermore, our framework will shed insights into a model's robustness by generating explanations for robustness failures along the lexical, morphological, and syntactic dimensions, extracted from the various dataset simulations and data sub-sets, thus departing from current approaches that solely provide a metric to quantify robustness. We will focus on two NLP research areas, argument mining and fact verification, however, several simulation methods and the robustness explanations are also scalable to other NLP tasks.

评估NLP任务中监督机器学习模型泛化的标准做法是使用以前未见过的（即保留）数据，并使用各种指标（如准确性）报告其性能。虽然在保留数据上报告的指标总结了模型的性能，但最终这些结果代表了基准的总体统计数据，并不能反映模型行为和鲁棒性的细微差别，当应用于实际系统时。我们提出了一个关于论据和事实的NLP模型的鲁棒性评估框架，其中包括对鲁棒性失败的解释，以支持系统和有效的评估。我们将开发新的方法来模拟来自现有数据集的真实世界文本，以帮助评估模型在野外部署时的稳定性和一致性。模拟方法将通过基于文本的转换和数据集上的分布转移以及捕获语言模式的数据子集来挑战NLP模型，以提供对现实世界语言现象的系统覆盖。此外，我们的框架将通过从各种数据集模拟和数据子集中提取的词汇、形态和句法维度生成鲁棒性失败的解释，从而脱离当前仅提供量化鲁棒性指标的方法，从而深入了解模型的鲁棒性。我们将重点关注两个NLP研究领域，即论证挖掘和事实验证，然而，几种模拟方法和鲁棒性解释也可扩展到其他NLP任务。