Scalable and Robust Clinical Text De-Identification Tools

可扩展且强大的临床文本去识别工具

基本信息

项目摘要

DESCRIPTION (provided by applicant): Exploiting the full potential of information rich and rapidly growing repositories of patient clinical text is hampered by the absence of scalable and robust de-identification tools. Clinical text contains protected health information (PHI), and the Health Insurance Portability and Accountability Act (HIPAA) restricts research use of patient information containing PHI to specific, limited, IRB-approved projects. As a result, vast repositories of clinical text remain under-used by internal researchers, and are even less available for external transmission to outside collaborators or for centralized processing by state-of-the-art natural language processing (NLP) technologies. De-identification, which is the removal of PHI from clinical text, is challenging. Despite their availability for over a decade, commercially available automated systems are expensive, require local tailoring, and have not gained widespread market penetration. Manual methods are costly and do not scale, yet continue to be used despite the small amount of residual PHI they leave behind. Open source de-identification tools based on state-of-the-art machine learning technologies can perform at or above the level of manual approaches but also suffer from the residual PHI problem. Current de-identification approaches, then, also severely limit the use and mobility of clinical text while exposing patients to privacy risks. These approaches redact PHI, blacking it out or replacing it with symbols (e.g., "Here for cardiac eval is Mr. **PT_NAME<AA>, a **AGE<60s> yo male with his son Doug ..."). Traditional approaches leave residual PHI ("Doug" in this example) to be easily noticed by readers of the text, as it remains plainly visible among the prominent redactions. We developed and pilot tested an alternative approach we believe addresses the residual PHI problem. Our approach uses the strategy of concealing, rather than trying to eliminate, residual PHI. We call it the "Hiding In Plain Sight" (HIPS) approach. HIPS replaces all known PHI with "surrogate" PHI- fictional names, ages, etc.-that look real but do not refer to any actual patient. A HIPS version of the above text is: "Here for cardiac eval is Mr. Jones, a 64 yo male with his son Doug ..." where the name "Jones" and age "64" are fictional surrogates, but the name "Doug" is residual PHI. To a reader, the surrogates and the residual PHI are indistinguishable. This prevents the reader from detecting the latter, avoiding disclosure. Our preliminary studies suggest that HIPS can reduce the risk of disclosure of residual PHI by a factor of 10. This yields overall performance that far surpasses the performance attainable by manual methods, and is unlikely to be matched, we believe, by additional incremental improvements in PHI tagging models (i.e., efforts to reduce residual PHI). Our pilot studies indicate IRBs would welcome the HIPS approach if it were shown to be effective through rigorous evaluation. To expand usage of clinical text and enhance patient privacy, we propose to formalize rules of effective surrogate generation (Aim 1), extend related de-identification confidence scoring methods (Aim 2), and conduct rigorous efficacy testing of HIPS in diverse institutional settings (Aim 3).
描述(由申请人提供):由于缺乏可扩展和强大的识别工具,患者临床文本的丰富和快速增长的信息库的全部潜力受到阻碍。临床文本包含受保护的健康信息(PHI),而《健康保险携带和责任法案》(HIPAA)将包含PHI的患者信息的研究使用限制在特定的、有限的、IRB批准的项目中。因此,大量的临床文本仍未被内部研究人员充分利用,更不能用于外部合作者的外部传输或通过最先进的自然语言处理(NLP)技术进行集中处理。去身份识别,即从临床文本中删除PHI,是具有挑战性的。尽管已有十多年的历史,但商业上可用的自动化系统价格昂贵,需要本地定制,而且尚未获得广泛的市场渗透率。人工方法成本高,规模小,但仍在继续使用,尽管它们留下了少量残留的PHI。基于最先进的机器学习技术的开放源码识别工具的性能可以达到或高于手动方法的水平,但也存在残存的PHI问题。因此,当前的去身份识别方法也严重限制了临床文本的使用和移动性,同时 使患者面临隐私风险。这些方法修改了PHI,将其涂黑或代之以符号(例如,“这里用于心脏评估的是**PT_NAME&lt;AA&gt;先生,一个**年龄60岁的男性和他的儿子Doug……”)。传统的方法使剩余的PHI(在本例中为“Doug”)很容易被文本的读者注意到,因为它在显著的密文中仍然清晰可见。我们开发并试行了一种替代方法,我们认为可以解决PHI的残留问题。我们的方法使用的是隐藏而不是试图消除残余PHI的策略。我们称其为“隐藏在视线中”(HIPS)方法。HIPS用“代理”PHI--虚构的名字、年龄等--取代了所有已知的PHI,这些看起来像真的,但并不是指任何实际的病人。上面这段文字的髋关节版本是:“琼斯先生,一个男,和他的儿子道格……来做心脏评估。”其中,名字“琼斯”和年龄“”是虚构的代孕,但名字“道格”是残存的PHI。对于读者来说,代理人和剩余的PHI是难以区分的。这防止了读者发现后者,从而避免了泄露。我们的初步研究表明,HIPS可以将剩余PHI的泄露风险降低10倍。这产生的总体性能远远超过人工方法所能达到的性能,我们认为,PHI标记模型的额外增量改进(即努力减少剩余PHI)不太可能与之相匹配。我们的初步研究表明,如果HIPS方法通过严格的评估被证明是有效的,IRBs将欢迎它。为了扩大临床文本的使用和增强患者隐私,我们建议正式制定有效替代品生成的规则(目标1),扩展相关的去身份置信度评分方法(目标2),并在不同的机构环境中对髋关节进行严格的疗效测试(目标3)。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

DAVID S. CARRELL其他文献

DAVID S. CARRELL的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('DAVID S. CARRELL', 18)}}的其他基金

DAT- Implementing routine screening for cannabis and other drug use disorders in primary care: impact on diagnosis and treatment in a randomized pragmatic trial in 22 clinics
DAT-在初级保健中实施大麻和其他药物使用障碍的常规筛查:22 个诊所的随机实用试验对诊断和治疗的影响
  • 批准号:
    10237870
  • 财政年份:
    2020
  • 资助金额:
    $ 31.88万
  • 项目类别:
DAT- Implementing routine screening for cannabis and other drug use disorders in primary care: impact on diagnosis and treatment in a randomized pragmatic trial in 22 clinics
DAT-在初级保健中实施大麻和其他药物使用障碍的常规筛查:22 个诊所的随机实用试验对诊断和治疗的影响
  • 批准号:
    9884229
  • 财政年份:
    2020
  • 资助金额:
    $ 31.88万
  • 项目类别:
Scalable and Robust Clinical Text De-Identification Tools
可扩展且强大的临床文本去识别工具
  • 批准号:
    8722030
  • 财政年份:
    2012
  • 资助金额:
    $ 31.88万
  • 项目类别:
Natural Language Processing for Cancer Research Network Surveillance Studies
癌症研究网络监测研究的自然语言处理
  • 批准号:
    7944035
  • 财政年份:
    2009
  • 资助金额:
    $ 31.88万
  • 项目类别:
Natural Language Processing for Cancer Research Network Surveillance Studies
癌症研究网络监测研究的自然语言处理
  • 批准号:
    7839706
  • 财政年份:
    2009
  • 资助金额:
    $ 31.88万
  • 项目类别:

相似国自然基金

靶向递送一氧化碳调控AGE-RAGE级联反应促进糖尿病创面愈合研究
  • 批准号:
    JCZRQN202500010
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
对香豆酸抑制AGE-RAGE-Ang-1通路改善海马血管生成障碍发挥抗阿尔兹海默病作用
  • 批准号:
    2025JJ70209
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
AGE-RAGE通路调控慢性胰腺炎纤维化进程的作用及分子机制
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    0 万元
  • 项目类别:
    面上项目
甜茶抑制AGE-RAGE通路增强突触可塑性改善小鼠抑郁样行为
  • 批准号:
    2023JJ50274
  • 批准年份:
    2023
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
蒙药额尔敦-乌日勒基础方调控AGE-RAGE信号通路改善术后认知功能障碍研究
  • 批准号:
  • 批准年份:
    2022
  • 资助金额:
    33 万元
  • 项目类别:
    地区科学基金项目
LncRNA GAS5在2型糖尿病动脉粥样硬化中对AGE-RAGE 信号通路上相关基因的调控作用及机制研究
  • 批准号:
    n/a
  • 批准年份:
    2022
  • 资助金额:
    10.0 万元
  • 项目类别:
    省市级项目
围绕GLP1-Arginine-AGE/RAGE轴构建探针组学方法探索大柴胡汤异病同治的效应机制
  • 批准号:
    81973577
  • 批准年份:
    2019
  • 资助金额:
    55.0 万元
  • 项目类别:
    面上项目
AGE/RAGE通路microRNA编码基因多态性与2型糖尿病并发冠心病的关联研究
  • 批准号:
    81602908
  • 批准年份:
    2016
  • 资助金额:
    18.0 万元
  • 项目类别:
    青年科学基金项目
高血糖激活滑膜AGE-RAGE-PKC轴致骨关节炎易感的机制研究
  • 批准号:
    81501928
  • 批准年份:
    2015
  • 资助金额:
    18.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

Collaborative Research: Resolving the LGM ventilation age conundrum: New radiocarbon records from high sedimentation rate sites in the deep western Pacific
合作研究:解决LGM通风年龄难题:西太平洋深部高沉降率地点的新放射性碳记录
  • 批准号:
    2341426
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Continuing Grant
Collaborative Research: Resolving the LGM ventilation age conundrum: New radiocarbon records from high sedimentation rate sites in the deep western Pacific
合作研究:解决LGM通风年龄难题:西太平洋深部高沉降率地点的新放射性碳记录
  • 批准号:
    2341424
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Continuing Grant
PROTEMO: Emotional Dynamics Of Protective Policies In An Age Of Insecurity
PROTEMO:不安全时代保护政​​策的情绪动态
  • 批准号:
    10108433
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    EU-Funded
The role of dietary and blood proteins in the prevention and development of major age-related diseases
膳食和血液蛋白在预防和发展主要与年龄相关的疾病中的作用
  • 批准号:
    MR/X032809/1
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Fellowship
Atomic Anxiety in the New Nuclear Age: How Can Arms Control and Disarmament Reduce the Risk of Nuclear War?
新核时代的原子焦虑:军控与裁军如何降低核战争风险?
  • 批准号:
    MR/X034690/1
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Fellowship
Walkability and health-related quality of life in Age-Friendly Cities (AFCs) across Japan and the Asia-Pacific
日本和亚太地区老年友好城市 (AFC) 的步行适宜性和与健康相关的生活质量
  • 批准号:
    24K13490
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Discovering the (R)Evolution of EurAsian Steppe Metallurgy: Social and environmental impact of the Bronze Age steppes metal-driven economy
发现欧亚草原冶金的(R)演变:青铜时代草原金属驱动型经济的社会和环境影响
  • 批准号:
    EP/Z00022X/1
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Research Grant
ICF: Neutrophils and cellular senescence: A vicious circle promoting age-related disease.
ICF:中性粒细胞和细胞衰老:促进与年龄相关疾病的恶性循环。
  • 批准号:
    MR/Y003365/1
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Research Grant
Doctoral Dissertation Research: Effects of age of acquisition in emerging sign languages
博士论文研究:新兴手语习得年龄的影响
  • 批准号:
    2335955
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Standard Grant
Shaping Competition in the Digital Age (SCiDA) - Principles, tools and institutions of digital regulation in the UK, Germany and the EU
塑造数字时代的竞争 (SCiDA) - 英国、德国和欧盟的数字监管原则、工具和机构
  • 批准号:
    AH/Y007549/1
  • 财政年份:
    2024
  • 资助金额:
    $ 31.88万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了