权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

POET: Consolidated, Comprehensive Clinical Text Preprocessing

POET：整合、全面的临床文本预处理

基本信息

批准号：
7570254
负责人：
JOHN F. HURDLE
金额：
$ 16.93万
依托单位：
UNIVERSITY OF UTAH
依托单位国家：
美国
项目类别：
财政年份：
2008
资助国家：
美国
起止时间：
2008-09-30 至 2010-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/7570254
关键词：
Abbreviations Adverse event Algorithms Architecture Body of uterus Clinical Clinical Pharmacists Clinical Research Computer Systems Consult Data Development Discipline Discipline of Nursing Electronic Health Record Ensure Excision Hand Internet Java Laboratories Licensing Linguistics Literature Medical Mind Mining Natural Language Processing Nature Nurses Output Paste substance Pathology Pathology Report Pharmacy facility Physical assessment Process PubMed Publishing Radiology Specialty Report (document)Reporting Research Research Ethics Committees Research Personnel Resolution Services Source Specific qualifier value Standards of Weights and Measures Structure Study Section System Testing Text Thinking Unified Medical Language System Vocabulary Work Writing abstracting base data mining design discrete data interest novel open source programs spelling tool trend

项目摘要

DESCRIPTION (provided by applicant): As electronic health records (EHRs) continue their expansion into clinical settings, there has been a corresponding increase in interest in mining the data they contain, both for research as well as for clinical decision support. Informaticists are increasingly studying ways to mine EHR textual content. This is an important trend, because there is a wealth of information contained in clinical text not represented anywhere else in the EHR. There is a low level text-as-data issue which presents a significant obstacle to the widespread use of available medical NLP systems: hand-typed clinical narratives in EHRs are usually ungrammatical; short or telegraphic in style; full of abbreviations, acronyms, and misspellings; formatted in a templated or pseudo-tabular form; and contain embedded non-text such as a list of laboratory values cut-and-pasted from elsewhere in the EHR. As we show in the Preliminary Studies Section, this makes high-level processing by popular tools like MedLEE and MetaMap effectively useless for all but a few "clean" document types like discharge summaries or consult reports (e.g., pathology or radiology reports). This in turn explains why there is so little published about what is certainly the preponderance of clinical texts, those that are not as well-behaved lexically and syntactically as a discharge summary. In this application we distinguish clinical narratives (e.g., a progress note) from biomedical narratives (e.g., a PubMed abstract). We are interested in texts that arise in the clinical or research setting; texts that are composed by clinicians and researchers directly into a computer system. We propose to build and publish a tool called POET (Parsable Output Extracted from Text). POET will be designed to accept unstructured textual documents and return structured, linguistic equivalents that are, to the extent possible, parsable by higher-level NLP engines. POET will have an architecture is modular, extensible, and based on open-source platforms and sources (e.g., Java, Perl, UMLS, NegEx, the Stanford Parser, HL7 Clinical Document Architecture, caGRID, etc.). To implement POET, we will collect, program, and evaluate published as well as novel algorithms for: acronym/abbreviation resolution; spelling correction; template and pseudo-table re-writing; and removal of embedded non-text. To test POET we will use a large corpus of cross-discipline (e.g., medical, nursing, pharmacy, etc.) clinical note types, as well as the clinical research texts MedWatch reports and IRB adverse event reports. The development of POET will combine the best practices found in the literature and new research efforts as part of the project. To validate the fidelity of POET processing we plan a formal analysis of information loss and information gain pre- and post-process. To ensure broad access to the tools, POET will be released under an open-source license. Finally, we plan to assess the feasibility of offering POET as a Web service for remote processing.

描述（由申请人提供）：随着电子健康记录 (EHR) 继续扩展到临床环境，人们对挖掘其中包含的数据的兴趣也相应增加，无论是用于研究还是临床决策支持。信息学家越来越多地研究挖掘 EHR 文本内容的方法。这是一个重要的趋势，因为临床文本中包含大量信息，而 EHR 中其他任何地方都没有提供这些信息。存在低水平的文本数据问题，这对现有医疗 NLP 系统的广泛使用构成了重大障碍：EHR 中的手写临床叙述通常不符合语法；风格简短或电报；充满缩写词、首字母缩略词和拼写错误；以模板或伪表格形式格式化；并包含嵌入的非文本，例如从 EHR 中其他位置剪切并粘贴的实验室值列表。正如我们在初步研究部分中所示，这使得 MedLEE 和 MetaMap 等流行工具的高级处理对于除少数“干净”文档类型（如出院摘要或咨询报告（例如病理学或放射学报告）之外的所有文档类型）实际上毫无用处。这反过来解释了为什么关于临床文本的优势的出版物如此之少，这些文本在词汇和句法上的表现不如出院摘要。在此应用中，我们将临床叙述（例如进度说明）与生物医学叙述（例如 PubMed 摘要）区分开来。我们对临床或研究环境中出现的文本感兴趣；由临床医生和研究人员直接编写到计算机系统中的文本。我们建议构建并发布一个名为 POET（从文本中提取的可解析输出）的工具。 POET 将被设计为接受非结构化文本文档并返回结构化的语言等价物，这些等价物在可能的范围内可由更高级别的 NLP 引擎解析。 POET 将具有模块化、可扩展的架构，并且基于开源平台和源（例如 Java、Perl、UMLS、NegEx、Stanford Parser、HL7 临床文档架构、caGRID 等）。为了实施 POET，我们将收集、编程和评估已发布的以及新颖的算法：首字母缩写词/缩写词解析；拼写纠正；模板和伪表重写；并删除嵌入的非文本。为了测试 POET，我们将使用大量跨学科（例如医学、护理、药学等）临床记录类型的语料库，以及临床研究文本 MedWatch 报告和 IRB 不良事件报告。作为该项目的一部分，POET 的开发将结合文献中发现的最佳实践和新的研究工作。为了验证 POET 处理的保真度，我们计划对处理前后的信息丢失和信息增益进行正式分析。为了确保工具的广泛使用，POET 将在开源许可下发布。最后，我们计划评估提供 POET 作为远程处理 Web 服务的可行性。