权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data

用于准确、高效和可重复分析 RNA-Seq 数据的模块化框架

基本信息

批准号：
10238765
负责人：
Michael Isaiah Love
金额：
$ 29.5万
依托单位：
UNIV OF MARYLAND, COLLEGE PARK
依托单位国家：
美国
项目类别：
财政年份：
2020
资助国家：
美国
起止时间：
2020-03-12 至 2023-06-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/10238765
关键词：
Address Adopted Adoption Algorithms Alleles Archives Area Attention Biological Biological Assay Biomedical Research Characteristics Communities Data Data Set Databases Development Disease Event Follow-Up Studies Gene Expression Profiling Generations Genes Genetic Genome Genomics Goals Health Human Hybrids Infrastructure Knowledge Lead Location Measurement Metadata Methods Modeling Nucleotides Organism Phenotype Process Protein Isoforms RNA RNA Editing RNA analysis Reporting Reproducibility Reproducibility of Results Research Personnel Resources Salmon Sampling Science Sequence Alignment Source Speed Statistical Data Interpretation Testing Time Transcript Uncertainty Variant Vision Visualization Visualization software analysis pipeline computational pipelines cryptography design differential expression experimental study human error improved light weight task analysis tool transcriptome transcriptome sequencing transcriptomics wasting

项目摘要

PROJECT SUMMARY / ABSTRACT We propose to develop improved, modular pipelines for more accurate and reproducible RNA-seq analyses. RNA- seq experiments are widely used in biological and biomedical sciences to determine the expression level of all genes and isoforms across multiple samples. Raw RNA-seq data must be pre-processed to determine abundances of RNA molecules. State-of-the-art tools for quantifying RNA abundances are fast and efﬁcient, model and correct for common technical biases, and provide estimates of the uncertainty of the abundances. Downstream tools for visualization and statistical testing of abundance ideally should incorporate uncertainty of abundance estimates from the quantiﬁcation step, take into account the sampling variability inherent in observations in all sequencing experiments, and estimate, for each transcript, the underlying biological variation in abundances across samples. While isolated tools fulﬁll a subset of the above characteristics, we propose to develop a pipeline which addresses all of these, while at the same time leveraging the powerful existing infrastructure for gene expression analysis. Our modular approach to improving the current RNA-seq analysis pipelines will also seek to make use of the best downstream tools for gene set analysis and dynamic report generation. Current RNA-seq computational pipelines do not keep track of critical pieces of metadata throughout the analysis, including genome and transcriptome version, such that ﬁnal results cannot reliably be repro- duced or put in the correct genomic context as the information about annotation provenance may be lost. While fast and lightweight tools have been quickly adopted for gene- and transcript-level quantiﬁcation, they are not yet optimized for certain RNA-seq analysis tasks such as quantiﬁcation of allele speciﬁc expression. We have developed a set of top performing tools for abundance quantiﬁcation and downstream inference. We propose to formalize our existing tools into a pipeline, and build additional tools and infrastructure, which optimally estimates and propagates uncertainty from abundance estimation (described in Aim 1), and which stores critical provenance metadata automatically on the user's behalf — this metadata tagging and propagation will be integrated with community resources (described in Aim 2). Furthermore, we propose building out the capabilities of our existing quantiﬁcation infrastructure to allow for improved mapping accuracy and more robust and accurate allelic expression estimation (described in Aim 3).

项目总结/摘要我们建议开发改进的模块化管道，以实现更准确和可重复的RNA-seq分析。核糖核酸 seq实验广泛用于生物学和生物医学科学，以确定所有基因的表达水平。和同种型。必须对原始RNA-seq数据进行预处理，以确定RNA丰度分子。用于定量RNA丰度的最先进的工具是快速和有效的，模型和正确的常见技术偏差，并提供丰度不确定性的估计。用于可视化和丰度的统计检验在理想情况下应包括定量估算丰度的不确定性步骤，考虑所有测序实验中观察结果的固有采样变异性，并估计每一个转录本，样本中丰度的潜在生物变异。虽然孤立的工具可以填充一个子集考虑到上述特征，我们建议开发一条能够解决所有这些问题的管道，同时利用现有强大的基础设施进行基因表达分析。我们的模块化方法，以提高目前的RNA-seq分析管道也将寻求利用最好的下游工具进行基因集分析，动态报告生成。目前的RNA-seq计算管道没有跟踪关键的元数据片段在整个分析过程中，包括基因组和转录组版本，因此最终结果不能可靠地重现。引入或放入正确的基因组背景中，因为关于注释起源的信息可能丢失。而fast 轻量级工具已被迅速用于基因和转录水平的定量，但它们尚未优化用于某些RNA-seq分析任务，例如等位基因特异性表达的定量。我们开发了一套顶级的执行丰度量化和下游推断的工具。我们建议将我们现有的工具正规化并构建额外的工具和基础设施，以最佳方式估计和传播不确定性从丰度估计（目标1中所述），并自动存储关键出处元数据，该元数据标记和传播将与社区资源集成（描述了目标2）。此外，我们建议加强现有量化基础设施的能力，用于改进的定位准确性和更稳健和准确的等位基因表达估计（描述于目标3中）。