权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

An automated pipeline for construction of Reference Transcript Datasets (RTD) to enable rapid and accurate gene expression analysis in plant species

用于构建参考转录数据集 (RTD) 的自动化管道，可实现植物物种中快速、准确的基因表达分析

基本信息

批准号：
BB/S020160/1
负责人：
Runxuan Zhang
金额：
$ 40.38万
依托单位：
James Hutton Institute
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2019
资助国家：
英国
起止时间：
2019 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=BB%2FS020160%2F1
关键词：
automated pipeline construction Reference Transcript

项目摘要

A gene is the basic physical and functional unit on the genome. Genes are turned off and on at different times of development and in response to external and internal signals. Protein-coding genes are copied (transcribed) into precursor messenger RNA (pre-mRNA) which are then processed in different ways into mRNAs which can then be translated into proteins. A goal of the biological research is to understand how genes work by measuring changes in gene expression. This is achieved by estimating the abundances of all of the transcripts produced at any particular time or condition. The current technologies to measure gene and transcript expression are called RNA sequencing (RNA-seq) which by sequencing millions of transcripts allows RNA levels to be measured on a genome-wide scale. The two main platforms are Illumina which generates short reads (currently 75 to 250 bp) and PacBio/Nanopore single molecule sequencing which produces full-length transcript reads. To measure gene expression, Illumina short reads are often mapped to the genome and assembled into transcripts which is an inaccurate process. PacBio/Nanopore have high sequencing error rates and do not generate sufficient depth of coverage of genes. These technologies, both in terms of chemistry and computational analyses, continue to advance at a rapid pace but a combination of the platforms is currently the best approach to generate RNA-seq data. In addition, the fastest and most accurate programs for computational quantification of transcript and gene expression require a comprehensive catalogue of transcripts which we call a Reference Transcript Dataset (RTD). Over the last four years, we developed an RTD for Arabidopsis (AtRTD2) based on extensive Illumina short read sequences. Through a series of iterations, we developed the computational methods to identify and retain high confidence transcripts while removing false transcripts. AtRTD2 greatly increased the accuracy of the quantification allowing, for example, identification of novel transcription and splicing factors in response to cold. The challenge now is to translate this knowledge and experience to other plant and crop (and animal) species. Currently, transcript sequence catalogues for most plant species are incomplete, missing large numbers of transcripts, and for those with RNA-seq data, out-of-date analysis procedures have produced large numbers of false transcripts. From developing AtRTD2, we have a prototype pipeline for constructing an RTD. The key features are multiple quality control filters which remove mis-assembled transcripts, redundant transcripts, chimaeric transcripts and transcript fragments. These multiple, iterative steps are currently individually coded and while the pipeline can be used, it will take up to 12 months to generate an RTD and requires the full-time expertise of a bioinformatician. We will develop a fully automated pipeline (RTDBox) which can be used by scientists with basic bioinformatics skills or bioinformaticians with little experience in transcriptomics. Such a pipeline would also be designed to allow the incremental improvement of the RTD with the automatic incorporation of any new RNA-seq data (Illumina, PacBio, Nanopore). Within the pipeline, we will develop a transcript evaluation suite (TES) which will provide evaluation metrics to help biologists to identify and remove mis-constructed transcripts from assembly programs as well as understand the quality and completeness of the RTD generated. All our experience and expertise will be brought together to make a user-friendly software for plant scientists to measure gene expressions more accurately and thereby improving the exploration of biological processes across the globe.

基因是基因组上的基本物理和功能单位。基因在发育的不同时期以及对外部和内部信号的反应中关闭和打开。蛋白质编码基因被复制（转录）成前体信使RNA（前mRNA），然后以不同的方式加工成mRNA，然后可以翻译成蛋白质。生物学研究的一个目标是通过测量基因表达的变化来了解基因如何工作。这是通过估计在任何特定时间或条件下产生的所有转录本的丰度来实现的。目前测量基因和转录本表达的技术被称为RNA测序（RNA-seq），通过对数百万个转录本进行测序，可以在全基因组范围内测量RNA水平。两个主要平台是Illumina，它产生短读段（目前为75至250 bp）和PacBio/Nanopore单分子测序，它产生全长转录本读段。为了测量基因表达，通常将Illumina短读段映射到基因组并组装成转录物，这是一个不准确的过程。PacBio/Nanopore具有高测序错误率，并且不能产生足够的基因覆盖深度。这些技术在化学和计算分析方面都在继续快速发展，但这些平台的组合是目前生成RNA-seq数据的最佳方法。此外，用于转录和基因表达的计算定量的最快和最准确的程序需要一个全面的转录本目录，我们称之为参考转录本数据集（RTD）。在过去的四年中，我们开发了一个RTD拟南芥（AtRTD 2）的基础上广泛的Illumina短读序列。通过一系列的迭代，我们开发了计算方法来识别和保留高置信度的成绩单，同时删除虚假成绩单。AtRTD 2大大提高了定量的准确性，例如，允许鉴定响应于冷的新转录和剪接因子。现在的挑战是将这些知识和经验转化为其他植物和作物（和动物）物种。目前，大多数植物物种的转录本序列目录是不完整的，缺少大量的转录本，而对于那些有RNA-seq数据的物种，过时的分析程序产生了大量的错误转录本。从开发AtRTD 2开始，我们就有了一个构建RTD的原型管道。其主要特征是多个质量控制过滤器，可去除错误组装的转录本、冗余转录本、嵌合转录本和转录本片段。这些多个迭代步骤目前是单独编码的，虽然可以使用管道，但生成RTD需要长达12个月的时间，并且需要生物信息学家的全职专业知识。我们将开发一个完全自动化的管道（RTDBox），可供具有基本生物信息学技能的科学家或在转录组学方面经验很少的生物信息学家使用。这样的管道还将被设计为允许RTD的增量改进，自动并入任何新的RNA-seq数据（Illumina，PacBio，Nanopore）。在管道中，我们将开发一个转录评估套件（TES），它将提供评估指标，以帮助生物学家识别和删除组装程序中的错误构建转录本，并了解所生成的RTD的质量和完整性。我们所有的经验和专业知识将汇集在一起，使一个用户友好的软件，植物科学家更准确地测量基因表达，从而改善整个地球仪的生物过程的探索。