权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Opening the Black Box of Machine Learning Models

打开机器学习模型的黑匣子

基本信息

批准号：
10020414
负责人：
Su-In Lee
金额：
$ 38.88万
依托单位：
UNIVERSITY OF WASHINGTON
依托单位国家：
美国
项目类别：
财政年份：
2018
资助国家：
美国
起止时间：
2018-07-01 至 2023-06-30
项目状态：
已结题

项目摘要

Project Summary Biomedical data is vastly increasing in quantity, scope, and generality, expanding opportunities to discover novel biological processes and clinically translatable outcomes. Machine learning (ML), a key technology in modern biology that addresses these changing dynamics, aims to infer meaningful interactions among variables by learning their statistical relationships from data consisting of measurements on variables across samples. Accurate inference of such interactions from big biological data can lead to novel biological discoveries, therapeutic targets, and predictive models for patient outcomes. However, a greatly increased hypothesis space, complex dependencies among variables, and complex “black-box” ML models pose complex, open challenges. To meet these challenges, we have been developing innovative, rigorous, and principled ML techniques to infer reliable, accurate, and interpretable statistical relationships in various kinds of biological network inference problems, pushing the boundaries of both ML and biology. Fundamental limitations of current ML techniques leave many future opportunities to translate inferred statistical relationships into biological knowledge, as exemplified in a standard biomarker discovery problem – an extremely important problem for precision medicine. Biomarker discovery using high-throughput molecular data (e.g., gene expression data) has significantly advanced our knowledge of molecular biology and genetics. The current approach attempts to find a set of features (e.g., gene expression levels) that best predict a phenotype and use the selected features, or molecular markers, to determine the molecular basis for the phenotype. However, the low success rates of replication in independent data and of reaching clinical practice indicate three challenges posed by current ML approach. First, high-dimensionality, hidden variables, and feature correlations create a discrepancy between predictability (i.e., statistical associations) and true biological interactions; we need new feature selection criteria to make the model better explain rather than simply predict phenotypes. Second, complex models (e.g., deep learning or ensemble models) can more accurately describe intricate relationships between genes and phenotypes than simpler, linear models, but they lack interpretability. Third, analyzing observational data without conducting interventional experiments does not prove causal relations. To address these problems, we propose an integrated machine learning methodology for learning interpretable models from data that will: 1) select interpretable features likely to provide meaningful phenotype explanations, 2) make interpretable predictions by estimating the importance of each feature to a prediction, and 3) iteratively validate and refine predictions through interventional experiments. For each challenge, we will develop a generalizable ML framework that focuses on different aspects of model interpretability and will therefore be applicable to any formerly intractable, high-impact healthcare problems. We will also demonstrate the effectiveness of each ML framework for a wide range of topics, from basic science to disease biology to bedside applications.

项目摘要生物医学数据在数量、范围和通用性方面都在大幅增加，新的生物学过程和临床可转化的结果。机器学习（ML），现代生物学致力于解决这些变化的动态，旨在推断变量之间有意义的相互作用通过从由对样本中变量的测量组成的数据中学习它们的统计关系。从大的生物学数据中准确推断这种相互作用可以导致新的生物学发现，治疗目标和患者结果的预测模型。然而，假设空间大大增加，变量之间的复杂依赖关系和复杂的“黑盒”ML模型构成了复杂的开放性挑战。为了应对这些挑战，我们一直在开发创新的，严格的，有原则的ML技术来推断在各种生物网络推理问题中的可靠、准确和可解释的统计关系，推动机器学习和生物学的边界。当前机器学习技术的基本局限性为未来的翻译推断留下了许多机会。将统计关系转化为生物学知识，如标准生物标志物发现问题中所例示的，这是精准医疗的一个极其重要的问题。利用高通量分子生物学技术发现生物标志物数据（例如，基因表达数据）显著地推进了我们对分子生物学和遗传学的认识。当前的方法试图找到一组特征（例如，基因表达水平）最能预测表型并使用所选择的特征或分子标记来确定表型的分子基础。然而，在独立数据和达到临床实践的复制成功率低表明，当前ML方法带来的挑战。第一，高维、隐变量和特征相关性在可预测性（即，统计关联）和真实的生物学相互作用;我们需要新的特征选择标准，使模型更好地解释而不是简单地预测表型。第二、复杂模型（例如，深度学习或集成模型）可以更准确地描述复杂的关系基因和表型之间的关系比简单的线性模型更好，但它们缺乏可解释性。第三，分析没有进行干预性实验的观察数据不能证明因果关系。为了解决这些问题，我们提出了一个集成的机器学习方法来学习可解释的模型这些数据将：1）选择可能提供有意义的表型解释的可解释特征，2）通过估计每个特征对预测的重要性来解释预测，以及3）迭代地验证并通过干预性实验来完善预测。对于每个挑战，我们将制定一个可概括的 ML框架专注于模型可解释性的不同方面，因此适用于任何以前难以解决的高影响力的医疗保健问题。我们还将展示每个ML的有效性涵盖广泛主题的框架，从基础科学到疾病生物学再到床边应用。