权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Collaborative Research: ABI Innovation: Interpretable Machine Learning to Identify Molecular Markers for Complex Phenotypes

合作研究：ABI 创新：可解释的机器学习来识别复杂表型的分子标记

基本信息

批准号：
1759487
负责人：
Su-In Lee
金额：
$ 149.93万
依托单位：
University of Washington
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-06-01 至 2024-05-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1759487&HistoricalAwards=false
关键词：
Collaborative Research ABI Innovation Interpretable

项目摘要

Biologists are now able to gather complete sets of gene expression data and protein concentrations for particular targets from specific tissues. The presence and concentrations of these molecules serve as features when determining a diagnostic pattern for specific states of development or disease. The approach to biomarker identification taken in this research attempts to find a set of features (here, gene expression levels) that best predict an outcome (protein levels occurring in the condition). The identified features, biomarkers, can help determine the molecular basis for the condition. Unfortunately, false positive biomarkers are very common, as evidenced by low success rates of replication in independent data sets and therefore low success in such markers becoming important in applications such as diagnostics in clinical practice. We seek to radically shift the current paradigm in biomarker discovery by resolving fundamental problems with the current approach by using novel, theoretically well-founded machine learning (ML) methods to learn interpretable models from data, and follow this up with a systematic experimental validation system in model organisms. The disease model we are using is for Alzheimer's disease (AD), an urgent national and international research priority. Amyloid plaques and neurofibrillary tangles are the hallmark of AD, and their building blocks are Amyloid-alpha and tau proteins, respectively. These proteins can be measured accurately from human brain tissues, as can global gene expression values. At present, we lack an understanding of the set of genes that affect formation of plaques and tangles, or any protective or pathological responses to these toxic peptides. Biomarker discovery using high-throughput molecular data (e.g., gene expression data) has significantly advanced our knowledge of molecular biology and genetics. The current approach attempts to find a set of features (e.g., gene expression levels) that best predict a phenotype and use the selected features, molecular markers, to determine the molecular basis for the phenotype. However, the low success rates of replication in independent data indicate three fundamental problems with this approach. First, high-dimensionality, hidden variables, and feature correlations create a discrepancy between predictability (i.e., statistical associations) and true biological interactions; we need new feature selection criteria to make the model better explain rather than simply predict phenotypes. Second, complex models (e.g., deep learning or ensemble models) can more accurately describe intricate relationships between genes and phenotypes than simpler, linear models, but they lack interpretability. Third, analyzing observational data without conducting interventional experiments does not prove causal relations. To address these problems, we propose an integrated machine learning methodology for learning interpretable models from data by 1) selecting interpretable features, 2) making interpretable predictions, and 3) validating and refining predictions through interventional experiments. This approach has the following aims:1. Develop NEBULA (network-based unsupervised feature learning) framework to learn interpretable features that will likely provide meaningful phenotype explanations from publicly available multi-omic data sets. 2. Develop a unified framework, called SHAP (Shapley additive explanation), to interpret the predictions of complex models by estimating the importance of each feature to a particular prediction.3. Validate and refine predictions through interventional experiments using high-throughput assays of gene knockdown on powerful nematode models of proteotoxicity. For further information see the project website at: http://suinlee.cs.washington.edu/projects/im3.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

生物学家现在能够从特定组织中收集特定靶点的完整基因表达数据和蛋白质浓度。这些分子的存在和浓度在确定特定发育状态或疾病的诊断模式时用作特征。本研究中采用的生物标志物鉴定方法试图找到一组最能预测结果（条件中发生的蛋白质水平）的特征（此处为基因表达水平）。所确定的特征，生物标志物，可以帮助确定条件的分子基础。不幸的是，假阳性生物标志物是非常常见的，如独立数据集中复制的低成功率所证明的，并且因此这种标志物的低成功率在诸如临床实践中的诊断的应用中变得重要。我们寻求从根本上改变生物标志物发现的当前范式，通过使用新颖的、理论上有充分依据的机器学习（ML）方法从数据中学习可解释的模型来解决当前方法的根本问题，并通过系统的实验验证系统来跟进这一点。模式生物。我们正在使用的疾病模型是阿尔茨海默病（AD），这是一个紧迫的国家和国际研究重点。淀粉样蛋白斑块和神经元缠结是AD的标志，它们的结构单元分别是淀粉样蛋白-α和tau蛋白。这些蛋白质可以从人脑组织中准确测量，全球基因表达值也可以。目前，我们缺乏对影响斑块和缠结形成的基因组的理解，或者对这些有毒肽的任何保护性或病理性反应。使用高通量分子数据的生物标志物发现（例如，基因表达数据）显著地推进了我们对分子生物学和遗传学的认识。当前的方法试图找到一组特征（例如，基因表达水平），其最好地预测表型，并使用所选择的特征、分子标记来确定表型的分子基础。然而，在独立数据中复制的低成功率表明这种方法存在三个基本问题。首先，高维、隐变量和特征相关性在可预测性（即，统计关联）和真实的生物相互作用;我们需要新的特征选择标准，使模型更好地解释，而不是简单地预测表型。第二，复杂的模型（例如，深度学习或集成模型）可以比简单的线性模型更准确地描述基因和表型之间的复杂关系，但它们缺乏可解释性。第三，分析观察数据而不进行干预实验并不能证明因果关系。为了解决这些问题，我们提出了一种集成的机器学习方法，通过1）选择可解释的特征，2）进行可解释的预测，3）通过干预实验验证和改进预测，从数据中学习可解释的模型。这种方法有以下目的：1.开发基于网络的无监督特征学习框架，以学习可解释的特征，这些特征可能会从公开的多组学数据集中提供有意义的表型解释。2.开发一个统一的框架，称为SHAP（Shapley加法解释），通过估计每个特征对特定预测的重要性来解释复杂模型的预测。通过干预性实验，使用高通量基因敲除测定强大的线虫模型蛋白毒性，验证和完善预测。欲了解更多信息，请参阅项目网站：http://suinlee.cs.washington.edu/projects/im3.This奖项反映了NSF的法定使命，并已被认为是值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估的支持。

项目成果

期刊论文数量（3）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

From local explanations to global understanding with explainable AI for trees

DOI：
10.1038/s42256-019-0138-9
发表时间：
2020-01-01
期刊：
NATURE MACHINE INTELLIGENCE
影响因子：
23.8
作者：
Lundberg, Scott M.;Erion, Gabriel;Lee, Su-In
通讯作者：
Lee, Su-In

Unified AI framework to uncover deep interrelationships between gene expression and Alzheimer's disease neuropathologies.

DOI：
10.1038/s41467-021-25680-7
发表时间：
2021-09-10
期刊：
Nature communications
影响因子：
16.6
作者：
Beebe-Wang N;Celik S;Weinberger E;Sturmfels P;De Jager PL;Mostafavi S;Lee SI
通讯作者：
Lee SI

Improving performance of deep learning models with axiomatic attribution priors and expected gradients

DOI：
10.1038/s42256-021-00343-w
发表时间：
2021-05-31
期刊：
NATURE MACHINE INTELLIGENCE
影响因子：
23.8
作者：
Erion, Gabriel;Janizek, Joseph D.;Lee, Su-In
通讯作者：
Lee, Su-In

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Su-In Lee其他文献

Titanizing on the surface of iron metal foam

DOI：
10.1016/j.tca.2014.02.008
发表时间：
2014-04-10
期刊：
Research article
影响因子：
作者：
Su-In Lee;Jung-Yeul Yun;Tae-Soo Lim;Byoung-Kee Kim;Young-Min Kong;Jei-Pil Wang;Dong-Won Lee
通讯作者：
Dong-Won Lee

Deep profiling of gene expression across 18 human cancers

对 18 种人类癌症中基因表达的深度剖析

DOI：
10.1038/s41551-024-01290-8
发表时间：
2024-12-17
期刊：
Nature Biomedical Engineering
影响因子：
26.600
作者：
Wei Qiu;Ayse B. Dincer;Joseph D. Janizek;Safiye Celik;Mikael J. Pittet;Kamila Naxerova;Su-In Lee
通讯作者：
Su-In Lee