权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Causal graphical methods for high-dimensional heterogeneous biomedical data

高维异构生物医学数据的因果图方法

基本信息

批准号：
10388447
负责人：
Tyler Lovelace
金额：
$ 4.68万
依托单位：
UNIVERSITY OF PITTSBURGH AT PITTSBURGH
依托单位国家：
美国
项目类别：
财政年份：
2022
资助国家：
美国
起止时间：
2022-03-21 至 2025-03-20
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/10388447
关键词：
Address Algorithms Automobile Driving Biological Categories Cells Clinical Complex Data Data Analytics Data Set Development Dimensions Event Explosion Generations Genes Graph Heterogeneity Immunologics Individual Intensive Care Units Intervention Investigation Learning Life Malignant Neoplasms Measures Medical Medicine Methodology Methods Mining Modeling Mortality Determinants Motivation Outcome Patients Performance Periodicity Phenotype Process Property RNA Research Research Personnel Resolution Skeleton Structure System Testing The Cancer Genome Atlas Time Validation Ventilator Work analytical method cancer care causal model cell type clinically relevant cohort complex data design experimental study flexibility gene regulatory network graph learning high dimensionality learning algorithm learning strategy machine learning method malignant breast neoplasm method development mortality multiple omics novel predictive modeling prognostic model single-cell RNA sequencing tool vector

项目摘要

In the past decade, there has been an explosion of data collected from biological and biomedical systems, both in terms of type and volume. Mining these high-dimensional, heterogeneous, and often dynamic datasets to make biologically or medically important inferences or develop predictive models requires new sophisticated data analytics methods. New machine learning methods have begun filling this gap, but most of these methods generate “black box” models that lack clear interpretability. Additionally, these methods are associative, and are thus incapable of teasing out the complex cause-effect relationships among features in the dataset. Directed causal graphical models (DCGMs) are a powerful tool for filling this gap. DCGMs, learned from observational datasets, can represent causal relationships between variables. This allows DCGMs to generate hypotheses of mechanisms and construct parsimonious, causally informed predictive models. However, biomedical datasets often have features that make it difficult to construct causal graphical models over the full dataset. Examples include: data type heterogeneity, high dimensionality, multicollinearity, cyclicity, and nonstationarity. To address these problems, I propose to develop methods for learning causal graphs in datasets containing (1) a heterogeneous mixture of continuous, categorical, and censored variables, (2) high dimensionality and multicollinearity, and (3) cyclicity and nonstationarity. In Aim 1, I will develop a new causal discovery algorithm that accommodates continuous, categorical and censored variables (e.g., survival). In Aim 2, I will test and compare various methods for matrix decomposition and dimensionality reduction in their ability to learn a meaningful low-dimensional latent feature space to be used in graph learning methods. In Aim 3, I will develop a new method for causal discovery in dynamic, possibly cyclic, gene regulatory networks at single cell resolution. In all cases, testing and validation will be performed on synthetic and real-life publicly available datasets. These methodological improvements constitute important steps forward in the field of causal discovery and they can be utilized together or independently to provide a flexible and powerful platform for analysis of a wide range of biomedical datasets. Once made available, they will enable researchers to make inferences about causal mechanisms, generate hypotheses, and build robust, parsimonious predictive models.

在过去的十年里，从生物和生物医学系统收集的数据激增，在类型和数量上。挖掘这些高维、异构且经常是动态的数据集，做出生物学或医学上重要的推论或开发预测模型需要新的复杂的数据分析方法。新的机器学习方法已经开始填补这一空白，但其中大多数方法生成缺乏清晰可解释性的“黑箱”模型。此外，这些方法是关联的，因此无法梳理出数据集中特征之间的复杂因果关系。引导因果图模型（DCGM）是填补这一空白的有力工具。DCGMs，从观察中学习数据集，可以表示变量之间的因果关系。这允许DCGM生成以下假设：机制，并构建简约的，因果关系知情的预测模型。然而，生物医学数据集通常具有使得难以在整个数据集上构建因果图模型的特征。示例包括：数据类型异质性、高维性、多重共线性、循环性和非平稳性。解决这些问题，我建议开发方法，学习因果图的数据集包含（1）连续、分类和删失变量的异质混合，（2）高维和多重共线性;（3）周期性和非平稳性。在目标1中，我将开发一种新的因果发现算法适应连续的、分类的和删失的变量（例如，生存）。在目标2中，我将测试和比较各种矩阵分解和降维方法的学习能力有意义的低维潜在特征空间，用于图学习方法。在目标3中，我将开发一种新的方法，因果关系的发现在动态的，可能是循环的，基因调控网络在单细胞分辨率。在所有情况下，测试和验证将在合成和现实生活中公开可用的数据集上进行。这些方法上的改进构成了因果发现领域的重要步骤，它们可以可以一起使用或独立使用，为分析各种生物医学数据集一旦可用，它们将使研究人员能够推断因果关系，机制，生成假设，并建立稳健、简约的预测模型。