权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Meaningful Data Compression and Reduction of High-Throughput Sequencing Data

有意义的数据压缩和高通量测序数据的缩减

基本信息

批准号：
9336154
负责人：
Martin Farach-Colton
金额：
$ 24.33万
依托单位：
RUTGERS, THE STATE UNIV OF N.J.
依托单位国家：
美国
项目类别：
财政年份：
2015
资助国家：
美国
起止时间：
2015-09-18 至 2018-08-31
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9336154
关键词：
Address Algorithms Area Big Data Biological Biology Biomedical Research Clinical Cohort Studies Computer software Consensus Consensus Sequence DNA Sequence Data Data Compression Data Set Development Disease Evolution Funding Future Genes Genetic Screening Genetic Variation Genetic screening method Genome Genomic approach Genomics Goals Graph Healthcare High-Throughput Nucleotide Sequencing Hybrids Individual Joints Lead Libraries Life Location Malignant Neoplasms Maps Memory Methods Noise Outcome Positioning Attribute Prevalence Privacy Research Research Personnel Savings Scheme Scientific Inquiry Secure Seeds Sequence Alignment Source Standardization Technology Time United States National Institutes of Health Variant anticancer research application programming interface base big biomedical data clinical care cohort computer infrastructure computing resources cost data reduction experimental study flexibility genetic risk factor genetic variant genomic data genomic tools high throughput analysis improved indexing insight middleware open source operation personalized medicine public health relevance reference genome tool transmission process virtual

项目摘要

DESCRIPTION (provided by applicant): High-throughput sequencing (HTS), a technology to unravel DNA sequences on a large scale, is pervasive in clinical and biological applications such as studying the spectrum of genetic variations and their relation to disease. Due to further reductions in cost, sequencing is expected to gain significant momentum, since it will replace commonly used genetic tests in clinical care for life-threatening diseases such as cancer, and consequently produce enormous amounts of data. The rise of personalized medicine will eventually lead to the point where every individual can be routinely screened for genetic risk factors using HTS. The goal of the proposed research is to boost the analysis of HTS data with a compressive genomics middle-ware that provides compressed reduced representations of HTS data. The representations are meaningful in that sequence information which is likely to cover the same genomic location in the sequenced genome will be brought together. As existing and future methods and algorithms can operate directly on this representation, the proposal not only realizes a saving in space and transmission times, but also in CPU time needed for analysis. The project has three aims: 1) Develop a clustering algorithm for single and paired HTS read libraries that rapidly recognized overlapping. Establish a lossless compression scheme based on clusters, which facilitates downstream computations directly on the compressed data without decompression. Extend the approach to joint compression of multiple HTS libraries. 2) Introduce meaningful reduced representations which further decrease memory demands by prioritizing sequence information likely to be correct and discarding information likely to be erroneous. 3) Adapt important HTS analysis tools to our compressive genomics approach, in particular read mapping, de novo genome assembly by using cluster consensus sequences as virtual, elongated reads for a hybrid assembly scheme, and discovery of structural variants based on cluster mapping positions and ambiguities in assignment of sequences to clusters. Our results will aid in improving health care outcomes by increasing analysis quality, lowering costs and making the analysis of HTS data more widely accessible. This will impact areas of scientific inquiry from understanding genetic variations underlying disease to personal genomics.

描述（由申请人提供）：高通量测序（HTS）是一种大规模解析DNA序列的技术，在临床和生物学应用中非常普遍，例如研究遗传变异谱及其与疾病的关系。由于成本的进一步降低，预计测序将获得巨大的发展势头，因为它将取代癌症等危及生命的疾病临床护理中常用的基因检测，从而产生大量数据。个性化医疗的兴起最终将导致每个人都可以使用HTS常规筛查遗传风险因素。所提出的研究的目标是提高HTS数据的分析与压缩基因组学中间件，提供压缩减少HTS数据的表示。所述表示是有意义的，因为可能覆盖测序的基因组中的相同基因组位置的序列信息将被汇集在一起。由于现有的和未来的方法和算法可以直接在这种表示上操作，该建议不仅节省了空间和传输时间，而且还节省了分析所需的CPU时间。该项目有三个目标：1）开发快速识别重叠的单个和成对HTS读段库的聚类算法。建立了基于簇的无损压缩方案，该方案便于直接在压缩数据上进行下游计算而无需解压缩。将该方法扩展到多个HTS库的联合压缩。2)引入有意义的简化表示，通过优先考虑可能正确的序列信息并丢弃可能错误的信息，进一步降低内存需求。3)使重要的HTS分析工具适应我们的压缩基因组学方法，特别是读段映射，通过使用簇共有序列作为虚拟的从头基因组组装，用于混合组装方案的延长读段，以及基于簇映射位置和序列分配到簇的模糊性发现结构变体。我们的研究结果将有助于通过提高分析质量，降低成本和使HTS数据的分析更广泛地获得来改善医疗保健结果。这将影响从理解疾病背后的遗传变异到个人基因组学的科学研究领域。