权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Semi-structured Information Retrieval in Clinical Text for Cohort Identification

用于队列识别的临床文本中的半结构化信息检索

基本信息

批准号：
8811565
负责人：
HONGFANG LIU
金额：
$ 46.07万
依托单位：
MAYO CLINIC ROCHESTER
依托单位国家：
美国
项目类别：
财政年份：
2014
资助国家：
美国
起止时间：
2014-09-20 至 2019-07-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/8811565
关键词：
Accounting Address Adopted Adoption Asthma Clinic Clinical Collection Communities Computerized Medical Record Computers Data Dictionary Disease Electronic Health Record Epidemiologist Epidemiology Evaluation Event Evidence Based Medicine Evolution Goals Health Information Retrieval Information Retrieval Systems Institution Interest Group Investigation Judgment Language Learning Machine Learning Measures Medical Medical Records Methodology Methods Metric Modeling Modification Morphologic artifacts Names Natural Language Processing Outcome Patient Recruitments Patients Performance Pharmaceutical Preparations Phase Physicians Process Publishing Qualifying Records Research Research Personnel Resources Rest Retrieval Sampling Semantics Site Smoke Source Specific qualifier value Structure System Techniques Testing Text Validation Weight Work Writing asthmatic patient base cohort improved indexing novel open source public health relevance syntax text searching tool

项目摘要

DESCRIPTION (provided by applicant): Natural Language Processing (NLP) techniques have shown promise for extracting data from the free text of electronic health records (EHRs), but studies have consistently found that techniques do not readily generalize across application settings. Unfortunately, most of the focus in applying NLP to real use cases has remained on a paradigm of single, well-defined application settings, so that generalizability to unseen use cases remains implicitly unaddressed. We propose to explicitly account for unseen application settings by adopting an information retrieval (IR) perspective with the objective of patient-level cohort identification. To do so, we introduce layered language models, an IR framework that enables the reuse of NLP-produced artifacts. Our long term goal is to accelerate investigations of patient health and disease by providing robust, user- centric tools that are necessary to process, retrieve, and utilize the free text of EHRs. The main goal of this proposal is to accurately retrieve ad hoc, realistic cohorts from clinical text at Mayo Clinic and OHSU, establishing methods, resources, and evaluation for patient-level IR. We hypothesize that cohort identification can be addressed in a generalizable fashion by a new IR framework: layered language models. We will test this hypothesis through four specific aims. In Aim 1, we will make medical NLP artifacts searchable in our layered language IR framework. This involves storing and indexing the NLP artifacts, as well as using statistical language models to retrieve documents based on text and its associated NLP artifacts. In Aim 2, we deal with the practical setting of ad hoc cohort identification, moving to patient-level (rather than document-level) IR. To accurately handle patient cohorts in which qualifying evidence may be spread over multiple documents, we will develop and implement patient-level retrieval models that account for cross- document relational and temporal combinations of events. In Aim 3, we will construct parallel IR test collections using EHR data from two sites; a diverse set of cohort queries written by multiple people toward various clinical or epidemiological ends; and assessments of which patients are relevant to which queries at both sites. Finally, in Aim 4, we refine and evaluate patient-level layered language IR on the ad hoc cohort identification task, making comparisons across the users, queries, optimization metrics, and institutions. We will draw additional extrinsic comparisons with pre-existing techniques, e.g., for cohorts from the Electronic Medical Records and Genonmics network. The expected outcomes of the proposed work are: (i) An open-source cohort identification tool, usable by clinicians and epidemiologists, that makes principled use of NLP artifacts for unseen queries; ii) A parallel test collection for cohort identification, includig two intra-institutional document collections, diverse test topics and user-produced text queries, and patient-level judgments of relevance to each query; and (iii) Validation of the reusability of medical NLP via the task of retrieving patient cohorts.

描述（由申请人提供）：自然语言处理 (NLP) 技术已显示出从电子健康记录 (EHR) 的自由文本中提取数据的前景，但研究一致发现该技术不易在应用程序设置中推广。不幸的是，将 NLP 应用于实际用例的大部分重点仍然停留在单一的、定义良好的应用程序设置范例上，因此对未见过的用例的通用性仍然隐含地没有得到解决。我们建议通过采用信息检索（IR）视角来明确解释不可见的应用程序设置，以实现患者级别队列识别的目标。为此，我们引入了分层语言模型，这是一个能够重用 NLP 生成的工件的 IR 框架。我们的长期目标是通过提供处理、检索和利用 EHR 自由文本所必需的强大的、以用户为中心的工具来加速对患者健康和疾病的调查。该提案的主要目标是从 Mayo Clinic 和 OHSU 的临床文本中准确检索临时、现实的队列，建立患者级 IR 的方法、资源和评估。我们假设队列识别可以通过新的 IR 框架（分层语言模型）以通用的方式解决。我们将通过四个具体目标来检验这一假设。在目标 1 中，我们将在分层语言 IR 框架中使医学 NLP 工件可搜索。这涉及存储和索引 NLP 工件，以及使用统计语言模型来检索基于文本及其相关 NLP 工件的文档。在目标 2 中，我们处理临时队列识别的实际设置，转向患者级别（而不是文档级别）IR。为了准确处理合格证据可能分布在多个文档中的患者群体，我们将开发和实施患者级检索模型，该模型考虑了事件的跨文档关系和时间组合。在目标 3 中，我们将使用来自两个站点的 EHR 数据构建并行 IR 测试集合；由多个人编写的一组不同的队列查询人们朝着不同的临床或流行病学目的；以及评估哪些患者与两个站点的哪些查询相关。最后，在目标 4 中，我们在临时队列识别任务上完善和评估患者级分层语言 IR，对用户、查询、优化指标和机构进行比较。我们将与现有技术进行额外的外部比较，例如来自电子病历和基因组学网络的队列。拟议工作的预期成果是：（i）一个可供临床医生和流行病学家使用的开源队列识别工具，原则上使用 NLP 工件来处理未见过的查询； ii) 用于队列识别的并行测试集合，包括两个机构内文档集合、不同的测试主题和用户生成的文本查询，以及与每个查询相关的患者级别判断； (iii) 通过检索患者队列的任务验证医学 NLP 的可重用性。