权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Natural Language Processing for Cancer Research Network Surveillance Studies

癌症研究网络监测研究的自然语言处理

基本信息

批准号：
7944035
负责人：
DAVID S. CARRELL
金额：
$ 49.45万
依托单位：
KAISER FOUNDATION HEALTH PLAN OF WASHINGTON
依托单位国家：
美国
项目类别：
财政年份：
2009
资助国家：
美国
起止时间：
2009-09-30 至 2012-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/7944035
关键词：
Address Adopted Adoption Adverse effects Algorithms Applied Research Area Arts Bioinformatics Breeding Cancer Research Network Charge Clinic Clinical Clinical Data Clinical Research Collaborations Complex Comprehensive Health Care Computer software Computerized Medical Record Consultations Data Data Element Data Quality Development Diagnosis Disease Disease Progression Doctor of Philosophy Environment Epidemiology Exercise Future Generic Drugs Hand Health Health Planning Health system plans Healthcare Human Resources Individual Informatics Information Systems Information Technology Institution Knowledge Knowledge Extraction Learning Licensing Life Malignant Neoplasms Manuals Measurement Medical center Methods Mining Modeling NCI Center for Cancer Research Natural Language Processing Operating System Outcome Participant Patients Pharmaceutical Preparations Population Positioning Attribute Process Public Health Recording of previous events Recurrence Research Research Infrastructure Research Personnel Research Project Grants Residual state Resources Risk Site Solutions Strategic Planning System Technology Testing Text Therapeutic Training Treatment Effectiveness Universities Vision Woman base biomedical informatics breast cancer diagnosis cost design experience feeding firewall flexibility functional status human capital improved innovation innovative technologies malignant breast neoplasm novel open source patient privacy programs repository skills software systems surveillance study text searching tool

项目摘要

DESCRIPTION (provided by applicant): This application addresses Broad Challenge Area: (10) Information Technology for Processing Health Care Data and specific Challenge Topic: 10-CA-107 Expand Spectrum of Cancer Surveillance through Informatics Approaches. The proposed project launches a collaborative effort to advance adoption within the HMO Cancer Research Network (CRN) of "industrial-strength" natural language processing (NLP) systems useful for mining valuable, research-grade information from unstructured clinical text. Such text is available for processing, now in the electronic medical record (EMR) systems of affiliated CRN health plans. The proposed NLP methods will create ongoing capacity to tap what has recently been described as "a treasure trove of historical unstructured data that provides essential information for the study of disease progression, treatment effectiveness and long-term outcomes" (5). The vision of advancing widespread NLP capacity across the CRN, as well as the approach we present here for implementing it, grew out of an in-depth strategic planning effort we completed in December 2008. That effort involved participants from six CRN sites guided by a blue-ribbon panel of NLP experts from three of the nation's leading centers of clinical NLP research: University of Pittsburgh Medical Center, Vanderbilt University, and Mayo Clinic. The vision is to deploy a powerful NLP system locally, manage it with newly hired and trained local NLP technical staff, and conduct NLP-based research projects initiated by local investigators, in consultation with higher-level external NLP experts. Our planning efforts suggest this collaborative model is feasible; we will test the model in the context of the proposed project. An important development in April 2009 yielded what we believe is a potentially transformative opportunity to accelerate adoption of NLP capacity in applied research settings: release of the open-source Clinical Text Analysis and Knowledge Extraction System (cTAKES) software. This software was the result of a collaborative effort between IBM and Mayo Clinic. Built on the same framework Mayo Clinic currently uses to process its repository of over 40 million clinical documents, cTAKES dramatically lowers the cost of adopting a comprehensive and flexible NLP system. Deployment and use of such systems was previously only feasible in institutions with large, academically-oriented biomedical informatics research programs. Still, other deployment challenges and the need to acquire NLP training for local staff present residual barriers to adopting comprehensive NLP systems such as cTAKES. In collaboration with five other CRN sites the proposed project mitigates these challenges in two ways: 1) it develops configurable open-source software modules needed to streamline and therefore reduce the cost of deploying cTAKES, and 2) it presents and tests a model for training local staff through hands-on NLP projects overseen by outside NLP expert consultants. The potential impact of this project is evident most clearly in the vast untapped opportunities for text mining represented in CRN-affiliated health plans, where EMR systems have been in place since at least 2005, and whose patients represent 4% of the U.S. population. Clinical text mining offers the potential to provide new or improved data elements for cancer surveillance and other types of research requiring information about patient functional status, medication side-effects, details of therapeutic approaches, and differential information about clinical findings. Another significant impact of this project is its plan to integrate into the cTAKES system an open-source de-identification tool based on state of the art, best of breed NLP approaches developed by the MITRE Corporation. De-identification of clinical text will make it easier for researchers to get access to clinical text, and will also facilitate multi-site collaborations while protecting patient privacy. Finally, if successful, the NLP algorithm we propose as a proof-of-principle project at Group Health-which will classify sets of patient charts as either containing or not containing a diagnosis of recurrent breast cancer-could dramatically reduce the cost of research in this area; currently all recurrent breast cancer endpoints must be established through costly manual chart abstraction. Novel aspects of the proposed project include its talented and transdisciplinary research team, including national experts in NLP, and its resourceful strategy for building the technical resources and "human capital" needed to support an ongoing program of applied NLP research. Natural language processing is itself a highly innovative technology; when successfully established in multiple CRN in the future it will represent a watershed moment in the CRN's already impressive history of exploiting data systems to support innovative research. Newly hired staff positions total approximately 2.0 FTE in each project year, most of which we anticipate will be supported by ongoing new research programs after the proposed project concludes. Project narrative The proposed project develops new measurement technologies for extracting information about disease processes and treatment, currently documented only in clinical text, based on natural language processing approaches. Because these methods are generic they will potentially contribute to public health by advancing research in a wide variety of areas. The "proof of principle" algorithm developed in the project to identify recurrent breast cancer diagnoses will advance epidemiologic and clinical research pertaining to the 2.5 million women currently living with breast cancer.

描述(由申请人提供)：本申请涉及广泛的挑战领域：(10)用于处理医疗保健数据的信息技术和特定挑战主题：10-CA-107通过信息学方法扩展癌症监测的频谱。拟议的项目发起了一项合作努力，以促进在HMO癌症研究网络(CRN)内采用“工业强度”的自然语言处理(NLP)系统，该系统有助于从非结构化的临床文本中挖掘有价值的研究级信息。这种文本可以进行处理，现在可以在附属CRN健康计划的电子病历(EMR)系统中进行处理。提出的自然语言处理方法将建立持续的能力，以挖掘最近被描述为“历史宝库”的资源为研究疾病进展、治疗提供基本信息的非结构化数据有效性和长期成果“(5)。在整个CRN推进广泛的NLP能力的愿景，以及我们在这里提出的实施方法，源于我们于2008年12月完成的深入战略规划工作。这项工作涉及来自六个CRN地点的参与者，由来自全国三个领先的临床NLP研究中心的NLP专家组成的蓝丝带小组指导：匹兹堡大学医学中心、范德比尔特大学和梅奥诊所。其愿景是在当地部署一个强大的自然资源规划系统，与新雇用和培训的当地自然资源规划技术人员一起管理该系统，并与更高级别的外部自然资源规划专家协商，开展由当地调查人员发起的基于自然资源规划的研究项目。我们的规划工作表明，这种协作模式是可行的；我们将在拟议的项目背景下测试该模式。2009年4月的一项重要进展带来了我们认为可能具有变革性的机会，以加快在应用研究环境中采用NLP能力：发布开源临床文本分析和知识提取系统(CTAKES)软件。这款软件是IBM和梅奥诊所合作的结果。CTAKES建立在Mayo Clinic目前用来处理其4000多万份临床文档的相同框架上，大大降低了采用全面而灵活的NLP系统的成本。这种系统的部署和使用以前只在拥有大型、学术导向的生物医学信息学研究项目的机构中才是可行的。尽管如此，其他部署挑战以及需要为当地工作人员提供NLP培训仍然存在采用cTAKES等综合自然资源规划系统的障碍。该拟议项目与其他五个CRN网站合作，以两种方式减轻了这些挑战：1)它开发了简化cTAKES部署所需的可配置的开源软件模块，从而降低了部署cTAKES的成本；2)它提出并测试了一种通过由外部NLP专家顾问监督的实际NLP项目来培训当地工作人员的模式。该项目的潜在影响在CRN附属医疗计划中代表的大量未开发的文本挖掘机会中最为明显，该计划的EMR系统至少从2005年起就已经存在，其患者占美国人口的4%。临床文本挖掘提供了为癌症监测和其他类型的研究提供新的或改进的数据元素的潜力，这些研究需要关于患者功能状态、药物副作用、治疗方法的细节以及关于临床结果的差异信息的信息。该项目的另一个重大影响是计划融入cTAKES系统这是一款基于MITRE公司开发的最先进的最佳NLP方法的开源识别工具。取消对临床文本的识别将使研究人员更容易获得临床文本，还将在保护患者隐私的同时促进多站点协作。最后，如果成功，我们在Group Health提出的作为原则证明项目的NLP算法-将患者图表集分类为包含或不包含复发乳腺癌诊断的集-可以极大地降低这一领域的研究成本；目前必须通过昂贵的手动图表提取来建立所有复发乳腺癌终点。拟议项目的新方面包括其才华横溢的跨学科研究团队，包括自然语言规划方面的国家专家，以及它的足智多谋的战略，以建立支持正在进行的应用自然语言编程研究计划所需的技术资源和“人力资本”。自然语言处理本身就是一项高度创新的技术；当它在未来成功地建立在多个CRN中时，它将是CRN利用数据系统支持创新研究的已经令人印象深刻的历史的分水岭时刻。每个项目年新招聘的员工职位总数约为2.0FTE，我们预计其中大部分将在拟议的项目结束后得到正在进行的新研究计划的支持。项目简介拟议的项目开发了新的测量技术，用于提取有关疾病过程和治疗的信息，目前仅在临床文本中记录，基于自然语言处理方法。由于这些方法是通用的，它们将通过推进广泛领域的研究而潜在地为公共卫生做出贡献。在该项目中开发的用于识别复发乳腺癌诊断的“原则证明”算法将推进与目前患有乳腺癌的250万妇女有关的流行病学和临床研究。