权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

DeconDTN: Deconfounding Deep Transformer Networks for Clinical NLP

DeconDTN：为临床 NLP 解构深度 Transformer 网络

基本信息

批准号：
10467107
负责人：
Trevor Cohen
金额：
$ 34.53万
依托单位：
UNIVERSITY OF WASHINGTON
依托单位国家：
美国
项目类别：
财政年份：
2022
资助国家：
美国
起止时间：
2022-06-01 至 2026-02-28
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10467107
关键词：
Address Architecture Area Artificial Intelligence Automobile Driving Behavior Bridge to Artificial Intelligence COVID-19 Caring Characteristics Classification Clinical Clinical Services Cognitive Computer software Confounding Factors (Epidemiology)Coupled Data Data Aggregation Data Set Data Sources Dementia Development Diagnosis Diagnostic Ensure Equilibrium Evaluation Goals High Prevalence Individual Institution Investments Label Language Learning Linguistics Location Medical Methods Modeling Modification Natural Language Processing Nature Neural Network Simulation Outcome Output Participant Patients Performance Physicians Predictive text Prevalence Research SARS-CoV-2 positive Sampling Services Site Source Speech Systematic Bias Testing Text Time Training Transcript United States National Institutes of Health United States National Library of Medicine Update Vision Weight Work base coronavirus disease deep learning deep learning model design heterogenous data interest large datasets learning strategy loss of function machine learning model network models novel open source open source tool portability predictive modeling programs relating to nervous system statistical and machine learning

项目摘要

Natural Language Processing (NLP) methods have been broadly applied to clinical problems, from recognition of clinical findings in physician notes to identification of transcribed speech samples indicating changes in cognitive status. Deep transformer networks (DTNs) have dramatically advanced NLP accuracy. These deep learning models have multiple hidden layers that may correspond to billions of trainable parameters, allowing them to apply information learned from training on large unlabeled corpora to a specific task of interest. However, their size leaves them especially vulnerable to confounding bias, induced by variables that can influence both the predictor (text) and the outcome (e.g. an associated diagnosis) of a predictive model. Such systematic biases are a recognized danger in the application of artificial intelligence methods to clinical problems, and are the focus of NLM NOT-LM-19-003 which invites applications proposing methods to identify and address them. Deep learning models in general require large amounts of training data, spurring initiatives to aggregate medical data from across institutional siloes. This can increase data set size and enhance model portability, but leaves the resulting models vulnerable to confounding by provenance, where models learn to recognize the origin of dataset components and make biased predictions based on site-specific class distributions (e.g. COVID prevalence). Such models will assign classes based on indicators of dataset provenance, rather than diagnostically meaningful linguistic differences, and make erroneous predictions when the provenance-specific distributions at the point of deployment differ from those in the training set. Confounding of this nature is a pervasive problem that presents a fundamental barrier to the portability of trained models, and threatens the utility of datasets assembled from across institutions and services. Unlike traditional statistical and machine learning models, with deep transformer networks feature representations are distributed across parameters spread throughout the entire network. New methods are needed to meet the challenge of identifying and mitigating the influence of confounding variables in such models. In the proposed research we will develop a systematic approach to Deconfounding Deep Transformer Networks (DeconDTN), embodied in an eponymous and publicly available set of open source tools for (1) identification of provenance-related biases, (2) mitigation of these biases using a novel set of validated methods, and (3) systematic evaluation of the resulting effects on model performance. While DeconDTN will be generally applicable, development and evaluation will occur in the context of three use cases involving data sets drawn from different sources: classification of speech transcripts from participants with dementia drawn from two locations, identification of goals-of-care discussions in clinical notes drawn from multiple studies involving a range of clinical services, and prediction of COVID-19 status in notes drawn from different clinical units. Our driving hypothesis is that the resulting models will make more accurate predictions in these heterogenous datasets than corresponding models without correction for confounding by provenance.

自然语言处理（NLP）方法已广泛应用于临床问题，从识别