权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Compressive genomics for large omics data sets: Algorithms applications & tools

大型组学数据集的压缩基因组学：算法应用

基本信息

批准号：
8599836
负责人：
BONNIE BERGER
金额：
$ 21.79万
依托单位：
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
依托单位国家：
美国
项目类别：
财政年份：
2013
资助国家：
美国
起止时间：
2013-09-05 至 2016-05-31
项目状态：
已结题

项目摘要

DESCRIPTION (provided by applicant): High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The goal of this project is to design and develop innovative compression-based algorithmic techniques and publicly-available software for large-scale genomic sequence data sets. The key underlying observation is that most genomes currently being sequenced share much similarity with genomes that have already been collected. Thus, the amount of new sequence information is growing much more slowly than the total size of genomic sequence data sets. In very recent work, we have provided a proof-of-concept that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data, a methodological paradigm we term "compressive genomics." In this proposal we broaden the framework of compressive genomics to several additional application areas in which algorithmic advances are urgently needed in order to keep pace with the growth in both genomic and protein sequencing data. In particular, we will build a novel comprehensive framework for compressive representation and highly efficient downstream analysis of large-scale next-generation sequencing (NGS) data sets; this will significantly advance the state of the art and scale over existing algorithms as the volume of genomic data grows, thus meeting the challenge of the expected future acceleration of sequencing technologies. Additionally, we will develop advanced, compressively-accelerated algorithms and software for specific applications of current interest in bioinformatics and apply them to real large-scale 'omics' data sets to accelerate data analytics and lead to novel biological discoveries. Namely, we will collaborate with the Kohane lab on analysis of high-throughput gene expression and NGS data sets from patients with neurodevelopmental disorders, including Autism Spectrum Disorder and Parkinson's; the broad, long-term goal is to apply our compressive approach to such massive data sets to elucidate the still obscure molecular landscape of these diseases. Understanding massive 'omics' data from patients will empower both rational, targeted drug design and more intelligent disease management, yet their sheer enormity threatens to make the arising problems computationally infeasible. Here, we develop computational methods and tools that will fundamentally advance the state-of-the-art in storage, retrieval and analysis of these rapidly expanding data sets.

描述(申请人提供)：高通量实验技术正在产生日益庞大和复杂的基因组序列数据集。虽然这些数据有望发现全新的生物学，但它们的巨大规模可能会使它们的解释在计算上变得不可行。该项目的目标是设计和开发用于大规模基因组序列数据集的创新的基于压缩的算法技术和公开可用的软件。关键的潜在观察是，目前正在测序的大多数基因组与已经收集的基因组有许多相似之处。因此，新序列信息量的增长比基因组序列数据集的总大小慢得多。在最近的工作中，我们提供了一个概念证明，可以通过以一种允许对压缩数据进行直接计算的方式压缩序列数据来利用这种冗余，这是一种我们称之为“压缩基因组学”的方法范式。在这个提案中，我们将压缩基因组学的框架扩展到几个额外的应用领域，在这些领域中迫切需要算法的进步，以跟上基因组和蛋白质测序数据的增长步伐。特别是，我们将构建一个新的全面的框架，用于大规模下一代测序(NGS)数据集的压缩表示和高效的下游分析；这将显著提高技术水平并扩展现有算法的规模，因为基因组数据增长，从而满足了预期的未来测序技术加速的挑战。此外，我们将为当前生物信息学感兴趣的特定应用开发先进的压缩加速算法和软件，并将它们应用于真正的大规模“组学”数据集，以加快数据分析并导致新的生物发现。也就是说，我们将与Kohane实验室合作分析来自神经发育障碍患者的高通量基因表达和NGS数据集，包括自闭症谱系障碍和帕金森氏症；广泛的长期目标是将我们的压缩方法应用于如此海量的数据集，以阐明这些疾病仍然鲜为人知的分子图景。理解患者的海量“组学”数据将使合理、有针对性的药物设计和更智能的疾病管理成为可能，但它们的巨大规模可能会使所产生的问题在计算上无法实现。在这里，我们开发的计算方法和工具将从根本上推动这些快速增长的数据集的存储、检索和分析的最先进水平。