权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Data Discovery: Computational Methods for Searching Short-Read Sequencing Experiments

数据发现：搜索短读测序实验的计算方法

基本信息

批准号：
9287168
负责人：
Carleton Lee Kingsford
金额：
$ 28.43万
依托单位：
CARNEGIE-MELLON UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2017
资助国家：
美国
起止时间：
2017-05-01 至 2021-04-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9287168
关键词：
Algorithms Archives Area Basic Science Biological Cells Code Collection Complex Computing Methodologies DNA Sequencing Facility Darkness Data Data Discovery Data Set Databases Disease Progression Distributed Systems Elements Environment Exhibits Exons Family Foundations Generations Genes Genetic Variation Genomics Goals Healthcare Hospitals Human Microbiome Individual Investigation Malignant Neoplasms Metadata Metagenomics Methods Microbe Mutation Organism Pathway interactions Pharmacologic Substance Privatization Protein Isoforms Reproducibility Research Research Personnel Resources Sampling Scheme Silicon Dioxide Somatic Mutation Source Speed System Techniques Technology Testing The Cancer Genome Atlas Time Trees United States National Institutes of Health Variant Vision Work base cell type data sharing experimental study fusion gene gene function genetic variant genome sequencing improved indexing insertion/deletion mutation microbial community novel novel strategies open source petabyte repository transcriptome sequencing transcriptomics tumor whole genome

项目摘要

PROJECT SUMMARY / ABSTRACT This proposal aims to solve the sequencing experiment discovery problem. The data from hundreds of thou- sands of short-read sequencing experiments are now publicly available, and private collections of sequencing experiments are also growing rapidly. These experiments include hundreds of thousands of whole genome sequencing experiments, and tens of thousands of RNA-seq, metagenomic, and tumor sequencing samples. However, these experiments are vastly underused, with few analyses making use of more than a handful of ex- periments at a time and most analyses ignoring this collection of raw data entirely. One crucial reason for this is that merely ﬁnding the appropriate experiments is a signiﬁcant barrier to their use in downstream analyses. This is due to the lack of a computational platform that can search for relevant short-read sequencing data sets by the sequences they contain. It is not currently possible to ﬁnd all the metagenomic experiments in which the genes that form a particular pathway are present or to ﬁnd all experiments in which a novel lncRNA is observed. The experiment discovery problem is that of ﬁnding — on a global scale — those experiments that are relevant to an isoform, variant, or species under study. By building on our existing work in large-scale sequence search, we propose to develop a new distributed platform to index and search hundreds of thousands of raw short-read se- quencing data sets to enable researchers to quickly ﬁnd experiments that contain their query sequences. We will apply this system to searching RNA-seq, metagenomic, and cancer tumor samples. The research questions we will solve include how to improve the computational scaling, increase the types of biologically meaningful queries that can be answered, and increase our ability to ﬁnd relevant experiments in situations where muta- tions are common. We will produce a high-quality open-source implementation of the developed computational methods. The project will signiﬁcantly expand the usefulness of large repositories of raw sequencing reads and enabled new approaches for large-scale reanalysis and reuse of short-read experiments. The system will unlock a rich source of biological information for gene function prediction, for understanding microbial communities, and for connecting genetic variation with disease progression.

项目摘要/摘要