权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Developing Machine Learning Models for the Analysis of Splicing Data in Large Heterogeneous Cohorts

开发机器学习模型来分析大型异构队列中的拼接数据

基本信息

批准号：
10315802
负责人：
David Wang
金额：
$ 4.6万
依托单位：
UNIVERSITY OF PENNSYLVANIA
依托单位国家：
美国
项目类别：
财政年份：
2021
资助国家：
美国
起止时间：
2021-08-01 至 2024-07-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10315802
关键词：
Acute Myelocytic Leukemia Address Affect Aftercare Algorithms Alternative Splicing B-Cell Acute Lymphoblastic Leukemia Bayesian Modeling Biological Blast Cell Cancer Patient Caring Catalogs Cells Characteristics Clinic Code Complex Computer software Computing Methodologies Data Data Set Detection Disease Event Excision Follow-Up Studies Gene Expression Genes Genetic Goals Hematologic Neoplasms Heterogeneity Institution Letters Machine Learning Malignant Neoplasms Masks Measures Methods Minority Missense Mutation Modeling Modification Multiomic Data Mutation Patients Pharmaceutical Preparations Process Prognostic Marker Protocols documentation Quality Control RNA RNA Degradation RNA Splicing RNA analysis Relapse Reproducibility Resources Reverse Transcriptase Polymerase Chain Reaction Sampling Signal Transduction Source Statistical Models Structure Techniques Therapeutic Time Tissue Procurements Training Validation Variant Xenograft procedure acute care base biobank bioinformatics tool cell type clinically relevant cohort computerized tools data integration disease phenotype disorder subtype drug sensitivity experience heterogenous data improved leukemia multiple data sources multiple omics new therapeutic target non-Gaussian model novel patient subsets personalized medicine precision medicine prognostic tool response tool transcriptome sequencing transcriptomics translational impact unsupervised learning

项目摘要

Abstract Analysis of RNA sequencing (RNASeq) data obtained from large patient cohorts can reveal transcriptomic perturbations that are associated with complex disease and facilitate the identification of disease subtypes. This is typically framed as an unsupervised learning task to discover latent structure in a matrix of RNASeq based quantification of gene expression or local splicing variations (LSVs). However, several factors make analysis of such heterogeneous data challenging. First, such datasets are comprised of samples processed at multiple institutions which might employ different sequencing protocols and quality control steps. This introduces confounding factors into the data like inconsistent sample quality or variable cell type proportions which can hinder detection of true biological signal. Second, in acute myeloid leukemia (AML), mutations in splice factor genes occurring in a subset of the patients may only result in alteration of a subset of coregulated splicing events. Thus, instead of measuring global similarity between samples based on all transcriptomic features, there is a need to efficiently identify “tiles”, defined by a subset of samples and splicing events with abnormal signals. Although several algorithms have been proposed for this task, they fail to overcome many of the computational challenges associated with modeling splicing data and are not well suited to handle missing values. To facilitate analysis of heterogeneous splicing datasets by reducing false positive discoveries and boosting true biological signal, we will first develop a model to correct for the effects of RNA degradation and cell type mixtures. Then in order to efficiently identify AML subtypes characterized by splicing events and account for splicing specific modeling challenges, we propose CHESSBOARD (Characterizing Heterogeneity of Expression and Splicing by Search for Blocks of Abnormalities and Outliers in RNA Datasets), a non- parametric Bayesian model for unsupervised discovery of tiles. We will apply our models to synthetic datasets and show it outperforms several baseline approaches. Next, we will show that it recovers tiles characterized by known and novel splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we will show that tiles discovered are correlated with drug response to therapeutics, pointing to the translational impact of our findings.

摘要从大量患者队列中获得的RNA测序(RNAseq)数据的分析可以揭示转录与复杂疾病相关的扰动，有助于确定疾病亚型。这通常被认为是发现RNAseq矩阵中潜在结构的无监督学习任务基于基因表达或局部剪接变异(LSV)的量化。然而，有几个因素使对这种异质数据的分析具有挑战性。首先，此类数据集由在以下位置处理的样本组成可能采用不同测序方案和质量控制步骤的多个机构。这在数据中引入混杂因素，如样本质量不一致或细胞类型比例变化这可能会阻碍对真实生物信号的检测。第二，在急性髓系白血病(AML)中，在部分患者中出现的剪接因子基因可能只会导致共调节基因子集的改变拼接事件。因此，不是基于所有转录本来衡量样本之间的全局相似性功能，因此需要有效地识别由样本子集定义的“平铺”，并使用异常信号。尽管已经为这项任务提出了几种算法，但它们未能克服许多与拼接数据建模相关的计算挑战不太适合处理缺失价值观。通过减少误报发现和增强来促进异类剪接数据集的分析真正的生物信号，我们将首先开发一个模型来修正RNA降解和细胞类型的影响混合物。然后为了有效地识别以剪接事件为特征的AML亚型并解释为了拼接特定的建模挑战，我们提出了国际象棋(Characterating Heteristic of 通过在RNA数据集中搜索异常和异常值的块来表达和剪接)，非无监督瓷砖发现的参数贝叶斯模型。我们将把我们的模型应用于合成数据集并表明它的性能超过了几种基准方法。接下来，我们将展示它可以恢复具有以下特征的切片已知的和新的剪接异常可在多个AML患者队列中重现。最后，我们会表明发现的瓷砖与药物对治疗的反应有关，指出翻译我们发现的影响。