权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Development of reference-free algorithms for low coverage RNA-Seq characterization of cell states

开发用于细胞状态低覆盖率 RNA-Seq 表征的无参考算法

基本信息

批准号：
RGPIN-2022-04260
负责人：
Lemieux, Sébastien
金额：
$ 2.48万
依托单位：
Université de Montréal
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=757501
关键词：
Development reference free algorithms low

项目摘要

Since the early days of microarrays, transcriptomics had a central role in modern molecular biology by providing a rich and dynamic view of active molecular processes within cells. RNA-Seq is nowadays a routine methodology accessible to most academic laboratories and provides an infinitely richer view of the transcriptome when compared to microarrays. Analyses pipelines for RNA-Seq unfortunately discard precious observations present in the original data such as unannotated transcripts, gene splicing events or gene rearrangements. Using the resulting biased expression profiles to train artificial intelligence algorithms such as deep neural networks prevents them from reaching their full potential. We propose a most drastic reorganization of RNA-Seq data analysis around the use of k-mer count tables (KCTs) as purely data-driven summaries. As k-mers are short, fixed-length sequences associated with an expression value, they are the perfect input to deep neural networks. This representation retains a much wider range of features present in the transcriptome and could readily be applied to organisms in which the genome is unannotated. The main challenge brought by this representation is its size, easily reaching tens of millions of k-mers for a single experiment. Work on this program will proceed on three fronts. First, we will fully master the various intricacies of quantitatively representing transcriptomes using k-mer count tables. Of particular interest will be the determination of the optimal k-mer length and the identification of an appropriate normalization to account for the varying sequencing depth. This first step will be done to accommodate datasets of several thousands of samples. Second, we will extend a neural network-based algorithm developed in our laboratory, the factorized embedding to allow k-mer-based representations to be used as input. This algorithm has the property to return a short numerical vector that summarizes, as well as possible, all quantitative observations presented at the training stage. Third, we will take advantage of these numerical summaries as ideal input to a second layer of deep neural network that we will design and train to make useful predictions on represented samples. Typically these predictions will be phenotypes of interest of varying levels of complexity. Both the factorized embeddings and these neural networks will be trained on large and well-established, public RNA-Seq datasets. This program will open a whole new perspective on the analysis of RNA-Seq data, yielding high-performance, open source software tools for the transcriptomic user's community. These advances should be particularly transformative in fields studying less genetically characterized organisms. As we expect to confirm the sufficiency of very low depth RNA-Seq, this program will open applications where RNA-Seq has been deemed unaffordable.

自微阵列的早期以来，转录学通过提供对细胞内活跃的分子过程的丰富和动态的观察，在现代分子生物学中发挥着核心作用。如今，RNA-Seq是大多数学术实验室可以使用的常规方法，与微阵列相比，它提供了关于转录组的无限丰富的视图。不幸的是，对RNA-Seq的分析管道丢弃了原始数据中存在的宝贵观察结果，如未注释的转录本、基因剪接事件或基因重排。使用产生的有偏见的表达模式来训练人工智能算法，如深度神经网络，会阻止它们充分发挥潜力。我们提出了一种最激进的RNA-Seq数据分析重组，围绕着使用k-mer计数表(KCT)作为纯粹的数据驱动的摘要。由于k-MERS是与表达值相关联的短的、固定长度的序列，因此它们是深度神经网络的完美输入。这种表示保留了转录组中存在的更广泛的特征，并且可以很容易地应用于其中基因组未被注释的生物体。这种表示法带来的主要挑战是它的规模，一次实验很容易达到数千万k-MERs。这项计划的工作将从三个方面进行。首先，我们将完全掌握使用k-mer计数表定量表示转录本的各种复杂情况。特别令人感兴趣的将是确定最优k-聚体长度和确定适当的归一化以考虑到不同的测序深度。这将是第一步，以适应数千个样本的数据集。其次，我们将扩展我们实验室开发的基于神经网络的算法，即因式分解嵌入，以允许使用基于k-mer的表示作为输入。该算法具有返回一个简短的数字向量的特性，该向量尽可能地总结了在训练阶段呈现的所有定量观测。第三，我们将利用这些数字摘要作为第二层深度神经网络的理想输入，我们将设计和训练该网络，以对代表的样本进行有用的预测。通常，这些预测将是复杂程度不同的感兴趣的表型。因子分解嵌入和这些神经网络都将在大型和良好建立的公共RNA-Seq数据集上进行训练。这一计划将在RNA-Seq数据分析方面打开一个全新的视角，为转录用户社区产生高性能的开放源码软件工具。这些进展在研究遗传特征较少的生物体领域应该特别具有变革性。由于我们希望确认极低深度RNA-Seq的充分性，该计划将打开RNA-Seq被认为负担不起的应用。