Applying Large Language Models to Accelerate Abstraction of Cancer Pathology Reports for Cancer Registry (LLMs for Unstructured Data Extraction)
应用大型语言模型加速癌症登记的癌症病理报告的抽象(非结构化数据提取法学硕士)
基本信息
- 批准号:10890243
- 负责人:
- 金额:$ 30万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:1998
- 资助国家:美国
- 起止时间:1998-02-18 至 2027-01-31
- 项目状态:未结题
- 来源:
- 关键词:AccelerationAddressAdvanced Malignant NeoplasmArchitectureAttentionBreastCancer CenterCertificationCharacteristicsClassificationClinicalClinical DataClinical TrialsCodeComplexComputer softwareDataData ElementData SetDevelopmentDiagnosticEnsureFamilyFosteringFoundationsGoalsHandednessHistologyInstitutionInstructionInternational Statistical Classification of Diseases and Related Health Problems, Tenth Revision (ICD-10)KnowledgeLabelLanguageLanguage PathologyLengthLesionLinkLocationMalignant NeoplasmsManualsMethodologyMethodsMicroscopicModelingNatural Language ProcessingNomenclatureOncologyPathologistPathologyPathology ReportPerformancePhysiciansPlayPopulationProcessPrognosisPubMedRare DiseasesReportingResearchResourcesRoleSNOMED Clinical TermsSelection for TreatmentsSiteSolid NeoplasmSourceStainsStandardizationStomachStructureSupervisionTechniquesTerminologyTestingTissue SampleTrainingVariantWorkanticancer researchcancer preventioncancer therapycancer typeconvolutional neural networkcost effectivedeep learningethnic minorityexperiencegender minorityimpressionimprovedinnovationmalignant breast neoplasmmalignant stomach neoplasmmultitaskneoplasm registryopen sourceoperationphrasesprognosticracial minorityrare cancerresearch and developmentresponserisk stratificationscreeningstatistical and machine learningsuccesstext searchingtransfer learningtumortumor diagnosisunstructured datavector
项目摘要
Pathology reports, containing critical information on tissue samples and lesions, play a significant role in
determining cancer treatment selection, prognosis, risk stratification, and clinical trial screening. Yet, manually
extracting tumor characteristics from these unstructured or semi-structured reports is a complex, laborious
process. Recent advances in Natural Language Processing (NLP) via deep learning methodologies show
promising potential. Though Bidirectional Encoder Representations from Transformers (BERT) has achieved
notable results in various NLP tasks, its application in pathology is constrained due to the limited allowable
input length. Our recent study addressed this by transfer learning a BERT-based model on increasingly
complex knowledge sources including Wikipedia, PubMed, MIMIC-III, and Moffitt institutional pathology
reports. This language model was further fine-tuned to identify site, histology, and associated ICD-O-3 codes
from pathology reports. Despite the promising preliminary results, our pilot work focuses on extractive
question-anwsering of single primary solid tumor diagnosis, overlooking rich terminology and variation of the
pathology language.
Our long-term goal is to employ Large Language Models (LLMs) to extract information from all types of clinical
notes, assisting institutional certified tumor registrars in data abstraction for the Cancer Registry. In this work,
we specifically focus on pathology reports, and proposes to train LLMs on 349,544 institutional pathology
reports to identify five key cancer data elements: primary site, histology, stage, grade, and laterality. The study
will focus on common (breast) and rare (gastric) cancers.
We will leverage existing LLMs pretrained on large public corpora, retrain them on institutional pathology
reports, and finally fine-tune them to predict specific cancer data elements.
We pursue two specific aims. Aim 1: predict breast cancer data elements by abstractive question-aswering
using the existing cabernet architecture (Aim 1a), and by a prompt-based finetuning technique (Aim 1b). Aim 2:
utilize zero-shot inference (Aim 2a) and soft-prompt tuning (Aim 2b) on these fine-tuned models to predict
gastric cancer data elements. This proposal is innovatite by using LLMs to identify key cancer data elements in
real-world settings, and has broad impacts by accelerating research, streamlining cancer registry operations,
and fostering the development of effective cancer prevention and treatment therapies.
病理学报告包含组织样本和病变的关键信息,在以下方面发挥着重要作用:
确定癌症治疗选择、预后、风险分层和临床试验筛选。然而,手动
从这些非结构化或半结构化报告中提取肿瘤特征是一项复杂、费力的工作
过程通过深度学习方法进行自然语言处理(NLP)的最新进展显示,
很有潜力虽然双向编码器表示从变压器(BERT)已经实现了
尽管在各种NLP任务中取得了显著的成绩,但由于允许的有限性,其在病理学中的应用受到限制
输入长度。我们最近的研究通过迁移学习解决了这一问题,迁移学习是一种基于BERT的模型,
复杂的知识来源,包括维基百科,PubMed,MIMIC-III和Moffitt机构病理学
报道进一步微调该语言模型,以识别部位、组织学和相关ICD-O-3代码
从病理报告中。尽管初步结果令人鼓舞,但我们的试点工作主要集中在采掘业。
单一原发性实体瘤诊断的问题回答,忽略了丰富的术语和变化,
病理学语言
我们的长期目标是采用大型语言模型(LLM)从所有类型的临床信息中提取信息。
注意到,协助机构认证的肿瘤登记在数据提取癌症登记。在这项工作中,
我们特别关注病理学报告,并建议对349,544名机构病理学的法学硕士进行培训
报告,以确定五个关键的癌症数据元素:原发部位,组织学,阶段,等级和偏侧性。研究
将重点关注常见(乳腺癌)和罕见(胃癌)癌症。
我们将利用现有的LLM在大型公共语料库上预先培训,
报告,并最终微调它们以预测特定的癌症数据元素。
我们追求两个具体目标。目标1:通过抽象问题询问预测乳腺癌数据元素
使用现有的解百纳架构(Aim 1a),并通过基于葡萄酒的微调技术(Aim 1b)。目标二:
在这些微调过的模型上利用零触发推理(Aim 2a)和软提示调整(Aim 2b)来预测
胃癌数据元素。该提案是创新的,通过使用LLM来识别关键的癌症数据元素,
现实世界的环境,并通过加速研究,简化癌症登记操作,
以及促进有效的癌症预防和治疗疗法的发展。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
John L. Cleveland其他文献
Myc rescue of a mutant CSF-1 receptor impaired in mitogenic signalling
对有丝分裂信号传导受损的突变 CSF-1 受体的 Myc 拯救
- DOI:
10.1038/353361a0 - 发表时间:
1991-09-26 - 期刊:
- 影响因子:48.500
- 作者:
Marline F. Roussel;John L. Cleveland;Sheila A. Shurtleff;Charles J. Sherr - 通讯作者:
Charles J. Sherr
Oncogenes: clinical relevance.
癌基因:临床相关性。
- DOI:
10.1007/978-3-642-72624-8_97 - 发表时间:
1987 - 期刊:
- 影响因子:0
- 作者:
Ulf R. Rapp;Stephen M. Storm;John L. Cleveland - 通讯作者:
John L. Cleveland
A radical approach to treatment
一种激进的治疗方法
- DOI:
10.1038/35030277 - 发表时间:
2000-09-21 - 期刊:
- 影响因子:48.500
- 作者:
John L. Cleveland;Michael B. Kastan - 通讯作者:
Michael B. Kastan
raf family serine/threonine protein kinases in mitogen signal transduction.
raf 家族丝氨酸/苏氨酸蛋白激酶在丝裂原信号转导中的作用。
- DOI:
- 发表时间:
1988 - 期刊:
- 影响因子:0
- 作者:
Ulf R. Rapp;Gisela Heidecker;Mahmoud Huleihel;John L. Cleveland;W. C. Choi;T. Pawson;James N. Ihle;W. Anderson - 通讯作者:
W. Anderson
Activation of Apoptosis Associated With Enforced <em>Myc</em> Expression in Myeloid Progenitor Cells Is Dominant to the Suppression of Apoptosis by Interleukin-3 or Erythropoietin
- DOI:
10.1182/blood.v82.7.2079.2079 - 发表时间:
1993-10-01 - 期刊:
- 影响因子:
- 作者:
David S. Askew;James N. Ihle;John L. Cleveland - 通讯作者:
John L. Cleveland
John L. Cleveland的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('John L. Cleveland', 18)}}的其他基金
New Therapeutic Vulnerabilities for Aggressive B-Cell Lymphoma
侵袭性 B 细胞淋巴瘤的新治疗漏洞
- 批准号:
10153731 - 财政年份:2020
- 资助金额:
$ 30万 - 项目类别:
New Therapeutic Vulnerabilities for Aggressive B-Cell Lymphoma
侵袭性 B 细胞淋巴瘤的新治疗漏洞
- 批准号:
10405450 - 财政年份:2020
- 资助金额:
$ 30万 - 项目类别:
New Therapeutic Vulnerabilities for Aggressive B-Cell Lymphoma
侵袭性 B 细胞淋巴瘤的新治疗漏洞
- 批准号:
10653834 - 财政年份:2020
- 资助金额:
$ 30万 - 项目类别:
Epigenetic Regulation of Drug Resistance to ABT-199 in B-cell Malignancies
B 细胞恶性肿瘤中 ABT-199 耐药性的表观遗传调控
- 批准号:
9904591 - 财政年份:2019
- 资助金额:
$ 30万 - 项目类别:
Therapeutic Targeting of Casein Kinase-1-delta in Primary and Metastatic Breast Cancer
酪蛋白激酶-1-δ 在原发性和转移性乳腺癌中的治疗靶向
- 批准号:
10524031 - 财政年份:2018
- 资助金额:
$ 30万 - 项目类别:
Therapeutic Targeting of Casein Kinase-1-delta in Primary and Metastatic Breast Cancer
酪蛋白激酶-1-δ 在原发性和转移性乳腺癌中的治疗靶向
- 批准号:
9710619 - 财政年份:2018
- 资助金额:
$ 30万 - 项目类别:
Therapeutic Targeting of Casein Kinase-1-delta in Primary and Metastatic Breast Cancer
酪蛋白激酶-1-δ 在原发性和转移性乳腺癌中的治疗靶向
- 批准号:
10064576 - 财政年份:2018
- 资助金额:
$ 30万 - 项目类别:
相似海外基金
Rational design of rapidly translatable, highly antigenic and novel recombinant immunogens to address deficiencies of current snakebite treatments
合理设计可快速翻译、高抗原性和新型重组免疫原,以解决当前蛇咬伤治疗的缺陷
- 批准号:
MR/S03398X/2 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Fellowship
CAREER: FEAST (Food Ecosystems And circularity for Sustainable Transformation) framework to address Hidden Hunger
职业:FEAST(食品生态系统和可持续转型循环)框架解决隐性饥饿
- 批准号:
2338423 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Continuing Grant
Re-thinking drug nanocrystals as highly loaded vectors to address key unmet therapeutic challenges
重新思考药物纳米晶体作为高负载载体以解决关键的未满足的治疗挑战
- 批准号:
EP/Y001486/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Research Grant
Metrology to address ion suppression in multimodal mass spectrometry imaging with application in oncology
计量学解决多模态质谱成像中的离子抑制问题及其在肿瘤学中的应用
- 批准号:
MR/X03657X/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Fellowship
CRII: SHF: A Novel Address Translation Architecture for Virtualized Clouds
CRII:SHF:一种用于虚拟化云的新型地址转换架构
- 批准号:
2348066 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
The Abundance Project: Enhancing Cultural & Green Inclusion in Social Prescribing in Southwest London to Address Ethnic Inequalities in Mental Health
丰富项目:增强文化
- 批准号:
AH/Z505481/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Research Grant
ERAMET - Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
ERAMET - 快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
- 批准号:
10107647 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
EU-Funded
BIORETS: Convergence Research Experiences for Teachers in Synthetic and Systems Biology to Address Challenges in Food, Health, Energy, and Environment
BIORETS:合成和系统生物学教师的融合研究经验,以应对食品、健康、能源和环境方面的挑战
- 批准号:
2341402 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Standard Grant
Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
- 批准号:
10106221 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
EU-Funded
Recite: Building Research by Communities to Address Inequities through Expression
背诵:社区开展研究,通过表达解决不平等问题
- 批准号:
AH/Z505341/1 - 财政年份:2024
- 资助金额:
$ 30万 - 项目类别:
Research Grant