权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Interactive machine learning methods for clinical natural language processing

用于临床自然语言处理的交互式机器学习方法

基本信息

批准号：
8818096
负责人：
HUA XU
金额：
$ 55.84万
依托单位：
UNIVERSITY OF TEXAS HLTH SCI CTR HOUSTON
依托单位国家：
美国
项目类别：
财政年份：
2010
资助国家：
美国
起止时间：
2010-05-31 至 2018-09-28
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/8818096
关键词：
Abbreviations Active Learning Address Adoption Algorithms Attention Biomedical Research Classification Clinical Clinical Data Clinical Informatics Clinical Research Cognitive Communities Data Data Set Development Disease Educational workshop Electronic Health Record Face Goals Grant Human Hybrids Knowledge Label Learning Linguistics Machine Learning Manuals Medical Methodology Methods Modeling Names Natural Language Processing Patients Pattern Performance Pharmaceutical Preparations Physicians Process Research Research Personnel Research Priority Resources Sampling Solutions Source Specific qualifier value Statistical Methods Statistical Models System Technology Testing Text Time United States National Library of Medicine base clinical application clinical phenotype cohort computer human interaction computerized cost experience improved model development novel open source statistics success tool usability

项目摘要

DESCRIPTION (provided by applicant): Growing deployments of electronic health records (EHRs) systems have made massive clinical data available electronically. However, much of detailed clinical information of patients is embedded in narrative text and is not directly accessible for computerized clinical applications. Therefore, natural language processing (NLP) technologies, which can unlock information in narrative document, have received great attention in the medical domain. Current state-of-the-art NLP approaches often involve building probabilistic models. However, the wide adoption of statistical methods in clinical NLP faces two grand challenges: 1) the lack of large annotated clinical corpora; and 2) the lack of methodologies that can efficiently integrate linguistic and domain knowledge with statistical learning. High-performance statistical NLP methods rely on large scale and high quality annotations of clinical text, but it is time-consuming and costly to create large annotated clinica corpora as it often requires manual review by physicians. Moreover, the medical domain is knowledge intensive. To achieve optimal performance, probabilistic models need to leverage medical domain knowledge. Therefore, methods that can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost would be highly desirable for clinical text processing. In this study, we propose to investigate interactive machine learning (IML) methods to address the above challenges in clinical NLP. An IML system builds a classification model in an iterative process, which can actively select informative samples for annotation based on models built on previously annotated samples, thus reducing the annotation cost for model development. More importantly, an IML system also involves human inputs to the learning process (e.g., an expert can specify important features for a classification task based on domain knowledge). Thus, IML is an ideal framework for efficiently integrating rule-based (via domain experts specifying features) and statistics-based (via different learning algorithms) approaches to clinical NLP. To achieve our goal, we propose three specific aims. In Aim 1, we plan to investigate different aspects of IML for word sense disambiguation, including developing new active learning algorithms and conducting cognitive usability analysis for efficient feature annotation by users. To demonstrate the broad uses of IML, we further extend IML approaches to two other important clinical NLP classification tasks: named entity recognition and clinical phenoytping in Aim 2. Finally we propose to disseminate the IML methods and tools to the biomedical research community in Aim 3.

描述（由申请人提供）：越来越多的电子健康记录（EHRs）系统部署使得大量临床数据以电子方式可用。然而，许多患者的详细临床信息是嵌入在叙事文本中，不能直接用于计算机临床应用。因此，自然语言处理（NLP）技术在医学领域受到了广泛的关注，该技术可以在叙事文档中解锁信息。目前最先进的NLP方法通常涉及建立概率模型。然而，统计方法在临床NLP中的广泛应用面临两大挑战：1)缺乏大型注释临床语料库；2)缺乏有效地将语言和领域知识与统计学习相结合的方法。高性能的统计NLP方法依赖于大规模和高质量的临床文本注释，但创建大型注释临床语料库既耗时又昂贵，因为它通常需要医生手工审查。此外，医学领域是知识密集型的。为了获得最佳性能，概率模型需要利用医学领域知识。因此，能够有效地将领域和专家知识与机器学习过程结合起来，以最小的注释成本快速构建高质量的概率模型，将是临床文本处理非常需要的方法。