权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Improving detection in high-throughput sequencing data with gene/locus-specific models

使用基因/位点特异性模型改进高通量测序数据的检测

基本信息

批准号：
RGPIN-2019-06604
负责人：
Perkins, Theodore
金额：
$ 2.99万
依托单位：
University of Ottawa
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2021
资助国家：
加拿大
起止时间：
2021-01-01 至 2022-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=738996
关键词：
Improving detection throughput sequencing data

项目摘要

The field of bioinformatics has borrowed from other fields of computer science and mathematics--such as statistics, machine learning, probabilistic modelling, and optimization--to develop sound, general algorithms for analyzing high-throughput genetic and molecular data. However, almost without exception, those algorithms treat every "entity" under consideration the same. For example, to identify which genes are differentially expressed between two conditions, the same statistical model is applied individually to every gene. When we want to identify regions of the genomic DNA bound by a certain protein, the same statistical model is applied to every genomic locus. Of course the observed data for each gene or each locus, provided by the high-throughput assay, is different. But the test applied is the same, and for a simple reason: traditionally, bioinformatics has dealt with situations where the number of observations per entity (e.g. two conditions, a handful of time points, or a few tens of patients) is vastly outnumbered by the number of entities (e.g. tens of thousands of genes or millions of genomic loci). Anything but simple statistical models would be in danger of overfitting the sparse data available. However, the accumulation of massive public databases of personal genomes, epigenomes, and cell-, tissue-, and disease-specific expression profiles, means that we now have at our disposal high-throughput data from tens or hundreds of thousands of "conditions". Moreover, statistical analyses of such data reveals a startling fact: all genes and all genomic loci are not alike. For example, the expression of some genes is inherently more variable than others. Furthermore, our measurements of some genes are noisier and/or more systematically biased than for other genes. Similarly for genomic loci, where we have varying signal-to-noise ratios in different assays, and different sources and amounts of measurement bias. The central idea behind this proposal is to use that mass of already-collected data to build and test more sophisticated, machine learning-based models of every single gene or locus in the genome. Further, we can use those models not just for the sake of analyzing that same data, but rather for creating tools to analyze new datasets, whatever their size. By modelling the particular biases and variability of each gene or locus, we can get a more accurate measure of the novelty of new measurements, and more successfully identify truly significant alterations in gene and genome behaviour.

生物信息学领域借鉴了计算机科学和数学的其他领域，如统计学，机器学习，概率建模和优化，以开发用于分析高通量遗传和分子数据的可靠的通用算法。然而，几乎无一例外，这些算法对待每一个“实体”的考虑相同。例如，为了鉴定哪些基因在两种条件之间差异表达，将相同的统计模型单独应用于每个基因。当我们想要识别与某种蛋白质结合的基因组DNA区域时，相同的统计模型适用于每个基因组位点。当然，由高通量测定提供的每个基因或每个基因座的观察数据是不同的。但应用的测试是相同的，原因很简单：传统上，生物信息学处理的是每个实体（例如两个条件，少数时间点或几十个患者）的观察数量远远超过实体数量（例如数万个基因或数百万个基因组位点）的情况。除了简单的统计模型，任何东西都有过度拟合稀疏数据的危险。然而，个人基因组、表观基因组以及细胞、组织和疾病特异性表达谱的大量公共数据库的积累意味着我们现在可以处理来自数万或数十万种“状况”的高通量数据。此外，对这些数据的统计分析揭示了一个惊人的事实：所有基因和所有基因组位点都不一样。例如，某些基因的表达本质上比其他基因更易变。此外，我们对某些基因的测量比对其他基因的测量更具噪音和/或更系统性的偏见。类似地，对于基因组基因座，我们在不同的测定中具有不同的信噪比，以及不同的测量偏差的来源和量。这一提议背后的核心思想是利用大量已经收集的数据来构建和测试更复杂的、基于机器学习的基因组中每个基因或位点的模型。此外，我们可以使用这些模型不仅仅是为了分析相同的数据，而是为了创建工具来分析新的数据集，无论它们的大小。通过对每个基因或基因座的特定偏差和变异性进行建模，我们可以更准确地衡量新测量的新奇，并更成功地识别基因和基因组行为中真正显着的改变。