DOCKET: accelerating knowledge extraction from biomedical data sets
DOCKET:加速从生物医学数据集中提取知识
基本信息
- 批准号:10548024
- 负责人:
- 金额:$ 60.91万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-01-24 至 2022-11-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Component type: This Knowledge Provider project will continue and significantly extend work
done by the Translator Consortium Blue Team, focusing on deriving knowledge from real-world
data through complex analytic workflows, integrated to the Translator Knowledge Graph, and
served via tools like Big GIM and the Translator Standard API.
The problem: We aim to solve the “first mile” problem of translational research: how to
integrate the multitude of dynamic small-to-large data sets that have been produced by the
research and clinical communities, but that are in different locations, processed in different
ways, and in a variety of formats that may not be mutually interoperable. Integrating these data
sets requires significant manual work downloading, reformatting, parsing, indexing and
analyzing each data set in turn. The technical and ethical challenges of accessing diverse
collections of big data, efficiently selecting information relevant to different users’ interests, and
extracting the underlying knowledge are problems that remain unsolved. Here, we propose to
leverage lessons distilled from our previous and ongoing big data analysis projects to develop a
highly automated tool for removing these bottlenecks, enabling researchers to analyze and
integrate many valuable data sets with ease and efficiency, and making the data FAIR [1].
Plan: (AIM 1) We will analyze and extract knowledge from rich real-world biomedical data sets
(listed in the Resources page) in the domains of wellness, cancer, and large-scale clinical
records. (AIM 2) We will formalize methods from Aim 1 to develop DOCKET, a novel tool for
onboarding and integrating data from multiple domains. (AIM 3) We will work with other teams
to adapt DOCKET to additional knowledge domains. ■ The DOCKET tool will offer 3 modules:
(1) DOCKET Overview: Analysis of, and knowledge extraction from, an individual data set. (2)
DOCKET Compare: Comparing versions of the same data set to compute confidence values,
and comparing different data sets to find commonalities. (3) DOCKET Integrate: Deriving
knowledge through integrating different data sets. ■ Researchers will be able to parameterize
these functions, resolve inconsistencies, and derive knowledge through the command line,
Jupyter notebooks, or other interfaces as specified by Translator Standards. ■ The outcome will
be a collection of nodes and edges, richly annotated with context, provenance and confidence
levels, ready for incorporation into the Translator Knowledge Graph (TKG). ■ All analyses and
derived knowledge will be stored in standardized formats, enabling querying through the
Reasoner Std API and ingestion into downstream AI assisted machine learning. ■ Example
questions this will allow us to address include: (Wellness) Which clinical analytes, metabolites,
proteins, microbiome taxa, etc. are significantly correlated, and which changing analytes predict
transition to which disease? [2,3] (Cancer) Which gene mutations in any of X pathways are
associated with sensitivity or resistance to any of Y drugs, in cell lines from Z tumor types? (All
data sets) Which data set entities are similar to this one? Are there significant clusters? What
distinguishes between the clusters? What significant correlations of attributes can be observed?
How can this set of entities be expanded by adding similar ones? How do these N versions of
this data set differ, and how stable is each knowledge edge as the data set changes over time?
Collaboration strengths: Our team has extensive experience with biomedical and domainagnostic
data analytics, integrating multiple relevant data types: omics, clinical measurements
and electronic health records (EHRs). We have participated in large collaborative consortia and
have subject matter experts willing to advise on proper data interpretation. Our application
synergizes with those of other Translator teams (see Letters of Collaboration).
Challenges: Data can come in a bewildering diversity of formats. Our solution will be modular,
will address the most common formats first, and will leverage established technologies like
DataFrames and importers (like pandas.io) where possible. Mapping nodes and edge types
onto standard ontologies is crucial for knowledge integration; we will collaborate with the
Standards component to maximize success.
组件类型:此知识提供者项目将继续并显著扩展工作
由翻译联盟蓝队完成,专注于从现实世界中获取知识
通过复杂的分析工作流程收集数据,并集成到翻译人员知识图谱中,
通过Big GIM和Translator Standard API等工具提供。
问题:我们的目标是解决转化研究的“第一英里”问题:如何
整合了大量由小到大的动态数据集,
研究和临床社区,但在不同的地点,在不同的处理,
方式,以及可能无法相互互操作的各种格式。整合这些数据
集需要大量的手工工作下载,重新格式化,解析,索引和
依次分析每个数据集。获取多样化信息的技术和道德挑战
收集大数据,有效地选择与不同用户兴趣相关的信息,
提取潜在的知识是仍然没有解决的问题。在此,我们建议
利用从我们以前和正在进行的大数据分析项目中提取的经验教训,
高度自动化的工具,消除这些瓶颈,使研究人员能够分析和
轻松高效地整合许多有价值的数据集,并使数据公平[1]。
计划:(目标1)我们将从丰富的现实世界生物医学数据集中分析和提取知识
(列在参考资料页中)在健康、癌症和大规模临床
记录(AIM 2)我们将形式化目标1中的方法,以开发DOCKET,一种用于
从多个域导入和集成数据。(AIM 3)我们将与其他团队合作
使DOCKET适应其他知识领域。■ DOCKET工具将提供3个模块:
(1)DOCKET概述:分析和知识提取,一个单独的数据集。(二)
DOCKET Compare:比较相同数据集的版本以计算置信度值,
并比较不同的数据集以找到共性。(3)DOCKET Integrate:导出
通过整合不同的数据集。■研究人员将能够参数化
这些功能,解决不一致,并通过命令行获取知识,
笔记本电脑,或其他接口指定的翻译标准。结果将
是一个节点和边的集合,用上下文、出处和置信度进行了丰富的注释
级别,准备纳入翻译知识图谱(TKG)。■所有分析和
派生的知识将以标准化格式存储,
Reasoner Std API和下游AI辅助机器学习的摄取。■实例
这将使我们能够解决的问题包括:(健康)哪些临床分析物,代谢物,
蛋白质、微生物组分类群等显著相关,
向什么疾病过渡?[2,3](癌症)X通路中的哪些基因突变是
与Z肿瘤类型的细胞系对任何Y药物的敏感性或耐药性相关?(所有
数据集)哪些数据集实体与此相似?是否存在重要的集群?什么
区分不同的集群?可以观察到哪些显著的属性相关性?
如何通过添加类似的实体来扩展这组实体?这N个版本的
这个数据集不同,随着数据集随时间的推移而变化,每个知识边缘的稳定性如何?
协作优势:我们的团队在生物医学和领域不确定性方面拥有丰富的经验。
数据分析,集成多种相关数据类型:组学、临床测量
电子健康记录(EHR)。我们参与了大型合作财团,
拥有愿意就正确的数据解释提供建议的主题专家。我们的应用程序
与其他翻译团队协同工作(见合作函)。
挑战:数据可能以令人眼花缭乱的多种格式出现。我们的解决方案将是模块化的,
将首先处理最常见的格式,并将利用现有的技术,
数据框架和导入程序(如pandas.io)。映射节点和边类型
标准本体论是知识集成的关键;我们将与
标准的组成部分,以最大限度地取得成功。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Gwênlyn Glusman其他文献
Gwênlyn Glusman的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Gwênlyn Glusman', 18)}}的其他基金
DOCKET: accelerating knowledge extraction from biomedical data sets
DOCKET:加速从生物医学数据集中提取知识
- 批准号:
10057127 - 财政年份:2020
- 资助金额:
$ 60.91万 - 项目类别:
DOCKET: accelerating knowledge extraction from biomedical data sets
DOCKET:加速从生物医学数据集中提取知识
- 批准号:
10330627 - 财政年份:2020
- 资助金额:
$ 60.91万 - 项目类别:
DOCKET: accelerating knowledge extraction from biomedical data sets
DOCKET:加速从生物医学数据集中提取知识
- 批准号:
10706750 - 财政年份:2020
- 资助金额:
$ 60.91万 - 项目类别:
Biomedical Data Translator Technical Feasibility Assessment and Architecture Design
生物医学数据转换器技术可行性评估和架构设计
- 批准号:
9338977 - 财政年份:2016
- 资助金额:
$ 60.91万 - 项目类别:
Biomedical Data Translator Technical Feasibility Assessment and Architecture Design
生物医学数据转换器技术可行性评估和架构设计
- 批准号:
9486059 - 财政年份:2016
- 资助金额:
$ 60.91万 - 项目类别:
相似海外基金
CAREER: Towards Open World Event Knowledge Extraction with Weak Supervision
职业:在弱监督下实现开放世界事件知识提取
- 批准号:
2238940 - 财政年份:2023
- 资助金额:
$ 60.91万 - 项目类别:
Continuing Grant
Research on clinical knowledge extraction infrastructure using real-world data derived from electronic medical records
使用电子病历中的真实数据提取临床知识基础设施的研究
- 批准号:
23K17001 - 财政年份:2023
- 资助金额:
$ 60.91万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Collaborative Research: CISE-MSI: DP: IIS: Event Detection and Knowledge Extraction via Learning and Causality Analysis for Resilience Emergency Response
协作研究:CISE-MSI:DP:IIS:通过学习和因果关系分析进行事件检测和知识提取,以实现弹性应急响应
- 批准号:
2219615 - 财政年份:2023
- 资助金额:
$ 60.91万 - 项目类别:
Standard Grant
Collaborative Research: CISE-MSI: DP: IIS: Event Detection and Knowledge Extraction via Learning and Causality Analysis for Resilience Emergency Response
协作研究:CISE-MSI:DP:IIS:通过学习和因果关系分析进行事件检测和知识提取,以实现弹性应急响应
- 批准号:
2219614 - 财政年份:2023
- 资助金额:
$ 60.91万 - 项目类别:
Standard Grant
CAREER: Knowledge Extraction and Discovery from Massive Text Corpora via Extremely Weak Supervision
职业:通过极弱监督从海量文本语料库中提取和发现知识
- 批准号:
2239440 - 财政年份:2023
- 资助金额:
$ 60.91万 - 项目类别:
Continuing Grant
A new approach for traffic data management and modeling that combines storage efficiency and immediate knowledge extraction
一种结合存储效率和即时知识提取的交通数据管理和建模新方法
- 批准号:
23K17800 - 财政年份:2023
- 资助金额:
$ 60.91万 - 项目类别:
Grant-in-Aid for Challenging Research (Exploratory)
A real-time system for data streaming and knowledge extraction on mobile devices
移动设备上的数据流和知识提取的实时系统
- 批准号:
DDG-2019-05756 - 财政年份:2021
- 资助金额:
$ 60.91万 - 项目类别:
Discovery Development Grant
Accelerating medicine development timelines through new approaches in knowledge extraction from diverse biological data sets
通过从不同生物数据集中提取知识的新方法加快药物开发进程
- 批准号:
MR/W003996/1 - 财政年份:2021
- 资助金额:
$ 60.91万 - 项目类别:
Research Grant
Revamping Real Estate investments with a Neural Network pipeline for image recognition and knowledge extraction from floor plans and planning applications
使用神经网络管道改造房地产投资,以进行图像识别并从平面图和规划应用程序中提取知识
- 批准号:
10004707 - 财政年份:2021
- 资助金额:
$ 60.91万 - 项目类别:
Collaborative R&D
Knowledge Extraction via Learning Processes and Data Models with Imprecision
通过不精确的学习过程和数据模型提取知识
- 批准号:
RGPIN-2017-06245 - 财政年份:2021
- 资助金额:
$ 60.91万 - 项目类别:
Discovery Grants Program - Individual