权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Natural language processing in healthcare data

医疗数据中的自然语言处理

基本信息

批准号：
RGPIN-2019-04701
负责人：
Rudzicz, Frank
金额：
$ 1.68万
依托单位：
University of Toronto
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2019
资助国家：
加拿大
起止时间：
2019-01-01 至 2020-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=679354
关键词：
Natural language processing healthcare data

项目摘要

Word embeddings (i.e., 'word vectors' or 'distributed representations') are dense numeric representations of words, which serve as input to various statistical machine learning methods. Typically, by optimizing contextual statistics, these embeddings induce latent dimensions that encode aspects of morphology, syntax, and even semantics. The results therefore can capture meaningful relationships among concepts in the data not afforded by traditional methods.******The Vector Institute is partnering with the Institute for Clinical Evaluative Sciences (ICES) around the collaborative use of the EMRALD corpus, which consists of text from a variety of primary care sources (e.g., consult notes, referrals, risk factors, past medical history) sourced from hundreds of doctors in Ontario. EMRALD is an order of magnitude larger, in vocabulary and overall size, than Google's news corpus which is one of the de facto corpora used for training embeddings. Currently, the extremely large vocabulary size appears to produce two main consequences: i) a preponderance of technical terms and their many variants, and b) spelling mistakes. These consequences lead to very sparse contextual matrices.******We have three primary goals in this program of research:******1) To enrich word embeddings with ontological information. Our team has developed a method of 'enriching' embeddings using a multi-task learning approach and normative lexical data, from crowd-sourced statistics. For example, enriching the embedding process with norms of sentiment increases the accuracy not only of sentiment analysis, but out--of--domain tasks as well, e.g., machine translation. Here, we intend to apply a similar approach but with structured ontological information from medical texts and resources. ******2) To produce explainable and private models. It is increasingly important to audit decisions made by classifiers, and to ensure the privacy of personal information in their respective models. For instance, it was recently shown that it is possible to re-identify patients in an anonymized data set using another set of minimally linked data. In order to increase the explainability of our models, we will apply methods such as LIME , text--based 'anchoring', and differential privacy. We will explore whether generative adversarial networks can also synthesize distributions with similar properties. ******3) To perform longitudinal classification. An initial goal will be to use the structured data in EMRALD to perform supervised classification of diagnostic codes given clinical notes. Given the longitudinal nature of the data, this will include recurrent neural networks and convolutional neural networks with attention. We will similarly explore semi-supervised learning either by removing some structured data or adding noise to the labels. The long-term aim is to combine these approaches in order to predict various long-term trends and human trajectories.**

词嵌入（即“词向量”或“分布式表示”）是词的密集数字表示，可作为各种统计机器学习方法的输入。通常，通过优化上下文统计，这些嵌入会产生对形态学、语法甚至语义方面进行编码的潜在维度。因此，结果可以捕捉到传统方法无法提供的数据中概念之间有意义的关系。******病媒研究所正在与临床评价科学研究所（ICES）合作，共同使用EMRALD语料库，该语料库由来自安大略省数百名医生的各种初级保健来源（例如，咨询说明、转诊、风险因素、过去病史）的文本组成。EMRALD在词汇量和总体大小上比b谷歌的新闻语料库大一个数量级，b谷歌的新闻语料库是用于训练嵌入的事实上的语料库之一。目前，极其庞大的词汇量似乎产生了两个主要后果：1)专业术语及其众多变体占主导地位；2)拼写错误。这些结果导致了非常稀疏的上下文矩阵。******我们在这个研究项目中有三个主要目标：******1)用本体信息丰富词嵌入。我们的团队开发了一种“丰富”嵌入的方法，使用多任务学习方法和来自众包统计的规范词汇数据。例如，用情感规范丰富嵌入过程不仅可以提高情感分析的准确性，还可以提高域外任务的准确性，例如机器翻译。在这里，我们打算应用类似的方法，但是使用来自医学文本和资源的结构化本体信息。******2)产生可解释的和私有的模型。审计分类器所做的决策，并在各自的模型中确保个人信息的隐私性变得越来越重要。例如，最近有研究表明，使用另一组最小关联数据可以在匿名数据集中重新识别患者。为了提高我们模型的可解释性，我们将应用LIME、基于文本的“锚定”和差异隐私等方法。我们将探讨生成对抗网络是否也可以合成具有类似性质的分布。******3)纵向分类。最初的目标是使用EMRALD中的结构化数据对给定临床记录的诊断代码进行监督分类。考虑到数据的纵向性质，这将包括递归神经网络和卷积神经网络。我们将通过删除一些结构化数据或在标签上添加噪声来类似地探索半监督学习。长期目标是将这些方法结合起来，以预测各种长期趋势和人类轨迹

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Rudzicz, Frank其他文献

Validating pertussis data measures using electronic medical record data in Ontario, Canada 1986-2016.

DOI：
10.1016/j.jvacx.2023.100408
发表时间：
2023-12
期刊：
VACCINE: X
影响因子：
3.8
作者：
Mcburney, Shilo H.;Kwong, Jeffrey C.;Brown, Kevin A.;Rudzicz, Frank;Chen, Branson;Candido, Elisa;Crowcroft, Natasha S.
通讯作者：
Crowcroft, Natasha S.