权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Leveraging Unlabeled and Pseudo Data for Clinical Information Extraction

利用未标记和伪数据进行临床信息提取

基本信息

批准号：
9813134
负责人：
Ozlem Uzuner
金额：
$ 41.48万
依托单位：
GEORGE MASON UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2019
资助国家：
美国
起止时间：
2019-08-01 至 2022-07-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9813134
关键词：
Accident and Emergency department Address Adverse drug event Affect Clinic Clinical Clinical Data Communities Computer software Data Data Set Development Discipline of Nursing Electronic Health Record Engineering Evaluation Frequencies Gold Growth Healthcare Hospitals Institution Israel Knowledge Label Learning Linguistics Location Machine Learning Measures Medical Medical center Methods Modeling Names Natural Language Processing Nature Outcome Pattern Performance Persons Pharmaceutical Preparations Plant Roots Procedures Psychiatry Publications Records Reporting Research Resources Route Sampling Semantics Signs and Symptoms Social Work Structure Supervision System Systems Development Task Performances Telephone Test Result Testing Text Thinness Time Training Universities Variant Virginia Washington computerized deep learning dosage field study improved learning strategy medication administration novel open source response supervised learning tool

项目摘要

Project Summary/Abstract Electronic Health Records (EHRs) contain significant information that can benefit many downstream uses. However, most of this information is in unstructured narrative form and is inaccessible to computerized methods that rely on structured representations for exploring, retrieving, and presenting the information. Natural language processing (NLP) and information extraction (IE) open this trove of information to studies that would otherwise be without. Over the past decades, many IE systems have been developed. These systems have typically focused on one task at a time. In addition, most have studied only specific types of records, e.g., discharge summaries, and addressed their task on data from a single institution. Performances achieved by the state-of-the-art IE systems developed under these conditions ranged from 44% F-measure to 99% F-measure. This observed variation can be attributed to the nature of the tasks: some target entities like dates tend to be better represented in the data and also more rigidly stick to known patterns of expression as opposed to reasons for medication administration which are relatively sparse in the data and can show wider linguistic diversity. However, this may not be the only reason: the data used can also explain the performance variation. Narratives of EHRs vary in their style, format, and content going from one department to another, from one hospital to another. Even the same record type in two different hospitals can be very different in narrative style and pose different challenges for IE. Understanding IE performance therefore requires studies of multiple tasks on multiple record types that come from multiple institutions. One major bottleneck for evaluation of IE systems on such a large scale is annotation. The same bottleneck also limits system development. This proposal aims to address this bottleneck for both evaluation and development. It first generates a multi-institution corpus consisting of multiple record types from five institutions. It studies four different IE tasks that broadly represent IE in clinical records and can inform the field of IE as a whole: de-identification, clinical concept extraction, medication extraction, and adverse drug event extraction. Within the context of these IE tasks, the proposal then puts forward methods that learn from unlabeled or pseudo data that can help alleviate reliance on annotated data for development. It evaluates these methods both for performance and generalizability on multiple types of records from multiple institutions. As a result of these activities, this proposal generates de-identified data, annotations, methods, software, and machine learning models which it then makes available to the research community.

项目总结/摘要电子健康记录（EHR）包含重要信息，可以使许多下游用途受益。然而，这些信息大部分是非结构化的叙述形式，无法用计算机化方法获取依赖于结构化表示来探索、检索和呈现信息。自然语言信息处理（NLP）和信息提取（IE）为研究打开了这一信息宝库，没有。在过去的几十年里，已经开发了许多IE系统。这些系统通常集中在一个任务一次。此外，大多数人只研究了特定类型的记录，例如，出院总结，以及处理来自单一机构的数据的任务。最先进的IE系统所实现的性能在这些条件下开发的范围从44% F-测量到99% F-测量。这种观察到的变化可以归因于任务的性质：一些目标实体（如日期）往往在数据中得到更好的表示并且也更严格地坚持已知的表达模式，而不是药物施用的原因这在数据中相对稀疏，并且可以显示更广泛的语言多样性。然而，这可能不是唯一的原因：使用的数据也可以解释性能变化。EHR的叙述在风格、格式、内容从一个部门到另一个部门，从一家医院到另一家医院。即使是相同的记录类型，两家不同的医院在叙述风格上可能有很大的不同，并对IE提出不同的挑战。因此，了解IE的性能需要研究多个记录类型上的多个任务，来自多个机构。在如此大的规模上评估IE系统的一个主要瓶颈是注释。同样的瓶颈也限制了系统的发展。该提案旨在解决这一瓶颈，评价和发展。它首先生成一个由多种记录类型组成的多机构语料库，五个机构。它研究了四种不同的IE任务，这些任务广泛代表了临床记录中的IE，并可以告知 IE的整体领域：去标识化、临床概念提取、药物提取和药物不良事件萃取在这些IE任务的背景下，该提案提出了从未标记的学习方法或者伪数据，其可以帮助减轻对用于开发的注释数据的依赖。它评估这些方法对于来自多个机构的多种类型的记录的性能和普遍性两者。的结果这些活动，本提案生成去识别数据、注释、方法、软件和机器学习模型，然后提供给研究界。