权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Novel algorithms and hardware designs for ultra-fast next-gen sequence analysis

用于超快速下一代序列分析的新颖算法和硬件设计

基本信息

批准号：
8680279
负责人：
Onur Mutlu
金额：
$ 33.3万
依托单位：
CARNEGIE-MELLON UNIVERSITY
依托单位国家：
美国
项目类别：
财政年份：
2011
资助国家：
美国
起止时间：
2011-06-20 至 2016-04-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/8680279
关键词：
Address Algorithm Design Algorithmic Software Algorithms Architecture Area Atlases Computer Architectures Computer software Consumption DNA Sequence Data Data Analyses Dependence Development Disease Pathway Engineering Gene Expression Profile Genes Genetic Genome Genomics Goals Healthcare Systems High-Throughput Nucleotide Sequencing Human Genome Individual Institution Knowledge Large-Scale Sequencing Length Location Maintenance Maps Medical Medical Genetics Medicine Memory Metagenomics Methods Methylation Modeling Pattern Performance Pharmacogenomics Polymorphism Analysis Preventive Process Property RNA analysis Reading Reporting Research Research Infrastructure Running Sequence Analysis Small RNA Software Tools Source Speed Stream System Techniques Technology Time Translations Variant Work abstracting base cluster computing comparative computer infrastructure computing resources cost design epigenetic variation genome sequencing histone modification improved instrument mutant new technology next generation sequencing novel parallel computer processing speed prototype software development success tool tumor

项目摘要

DESCRIPTION (provided by applicant): Abstract With the introduction of next generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. These new methods have already started to fundamentally revolu- tionize the area of genome research through low-cost and high-throughput genome sequencing. NGS technologies promise to impact a broad range of genetic applications. These include, but are not limited to, large-scale sequencing studies, polymorphism detection, small RNA analysis, metagenomics, com- parative genomics, discovery of epigenetic variation (histone modification and methylation patterns), charac- terization of tumor DNA sequences, identification of mutant genes in disease pathways and transcriptome profiling. Low-cost sequencing will impact the whole health care system because sequencing of personal genomes will be a part of preventive and personalized medicine as a result of potential advancements in phar- macogenomics. The overall data throughput generated by these new technologies is enormous: for example, in the Illumina Genome Analyzer, each run produces up to 1 billion reads and >100 Gb of basepairs of sequence data. Due to the lower cost of these methods, large genome centers have started to upgrade their sequencing capa- bilities, and are now able to generate 500 gigabases of data per day when 40 instruments are used. Such large amounts of data overwhelm existing computational resources, and urgent action is needed to enable the translation of this rich new source of genomic information into medical benefit. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational tech- nologies that can process and analyze the enormous amounts of sequence data fast and in an energy-efficient manner. The goal of this proposal is to develop such technologies by combining the benefits of enhanced software algorithms and specialized hardware accelerators. Our proposed research aims to accelerate next generation sequence analysis 1000-fold or more by combin- ing our knowledge in genomic sequence analysis, algorithms development, and computer architecture/engineering. Our plan to address the problems of processing unprecedented amounts of sequence data has three major components. First, we will develop and improve sophisticated software algorithms and tools to handle large amounts of sequence reads generated by all major NGS platforms without sacrificing sensitivity while cor- recting for the sequencing biases associated by each of the NGS platforms. Our algorithms will also be able to map reads in the duplicated regions of the genome and report the underlying sequence variation, an important feature especially to characterize segmental duplications and structural variation that no other read mapping tool can currently achieve. Second, we will boost the performance and efficiency of our algorithms (100 to 1000-fold) by accelerating the required inherently-parallel computations of the sequence search problem on massively-parallel hardware engines available today, graphics processing units (GPUs). Finally, we will design specialized hardware architectures to enhance the speed of sequence analysis beyond orders of magnitude while reducing energy consumed by it by 100-fold or more. Our research will broadly impact large-scale genome studies such as the 1000 Genomes Project, the Can- cer Genome Atlas Project, and the ENCODE Project, by not only increasing their ability to reach conclusions very fast but also reducing their energy consumption and maintenance costs related to maintaining compu- tation clusters for data analysis. Our research, if successful, can eliminate the dependence of sequence analysis on large and power-hungry computing clusters/data-centers, thereby making sequence analysis significantly cheaper and energy-efficient, and hence enabling sequence analysis to be performed by the main- stream without the need to build large computational infrastructures. Together with further advances in sequencing technologies, research resulting from this proposal can help personal genomics become a reality: advancement and application of pharmacogenomics will start the era of personalized medicine. Through ultra- fast, energy-efficient and cost-efficient sequence analysis, this study can pave the way to unlimited number of new discoveries by making it feasible to analyze terabases of sequence data that cannot currently be handled with existing computational processing power.

描述（由申请人提供）：摘要随着下一代测序技术的引入，基因组序列数据量呈指数级增长。这些新方法已经开始通过低成本和高通量基因组测序从根本上改变基因组研究领域。 NGS技术有望影响广泛的遗传应用。这些包括但不限于大规模测序研究、多态性检测、小RNA分析、宏基因组学、比较基因组学、表观遗传变异（组蛋白修饰和甲基化模式）的发现、肿瘤DNA序列的表征、疾病途径中突变基因的鉴定和转录组分析。低成本测序将影响整个医疗保健系统，因为由于phar-macgenomics的潜在进步，个人基因组的测序将成为预防和个性化医疗的一部分。这些新技术产生的总体数据吞吐量是巨大的：例如，在Illumina基因组分析仪中，每次运行产生高达10亿个读取和>100 Gb的序列数据碱基对。由于这些方法的成本较低，大型基因组中心已经开始升级其测序能力，现在使用40台仪器时每天能够产生500千兆字节的数据。如此大量的数据压倒了现有的计算资源，需要采取紧急行动，将这一丰富的新基因组信息来源转化为医疗效益。下一代测序的所有医学和基因应用的成功关键取决于能够快速且节能地处理和分析大量序列数据的计算技术的存在。该提案的目标是通过结合增强的软件算法和专用硬件加速器的优点来开发此类技术。我们提出的研究旨在通过结合我们在基因组序列分析，算法开发和计算机体系结构/工程方面的知识，加速下一代序列分析1000倍或更多。我们的计划，以解决处理前所未有的大量序列数据的问题有三个主要组成部分。首先，我们将开发和改进复杂的软件算法和工具，以处理所有主要NGS平台生成的大量序列读数，而不牺牲灵敏度，同时校正每个NGS平台相关的测序偏倚。我们的算法还将能够映射基因组重复区域中的读段并报告潜在的序列变异，这是一个重要的特征，特别是用于表征片段重复和结构变异，目前没有其他读段映射工具可以实现。其次，我们将通过在当今可用的并行硬件引擎（图形处理单元（GPU））上加速序列搜索问题所需的内在并行计算来提高我们算法的性能和效率（100到1000倍）。最后，我们将设计专门的硬件架构，以提高序列分析的速度超过数量级，同时减少100倍或更多的能源消耗。我们的研究将广泛影响大规模基因组研究，如1000个基因组计划，癌症基因组图谱计划和ENCODE计划，不仅提高了他们快速得出结论的能力，而且降低了与维护数据分析计算集群相关的能耗和维护成本。我们的研究，如果成功的话，可以消除序列分析对大型和耗电的计算集群/数据中心的依赖，从而使序列分析显着更便宜和节能，并因此使序列分析能够由主流进行，而不需要建立大型计算基础设施。再加上测序技术的进一步发展，这一提议所产生的研究可以帮助个人基因组学成为现实：药物基因组学的进步和应用将开启个性化医疗时代。通过超快速、节能和成本效益高的序列分析，本研究可以通过使分析目前无法用现有计算处理能力处理的序列数据的序列库成为可能，为无限数量的新发现铺平道路。

项目成果

期刊论文数量（7）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications.

DOI：
10.1093/nar/gku370
发表时间：
2014-07
期刊：
Nucleic acids research
影响因子：
14.9
作者：
Hach F;Sarrafi I;Hormozdiari F;Alkan C;Eichler EE;Sahinalp SC
通讯作者：
Sahinalp SC

GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies.

DOI：
10.1186/s12864-018-4460-0
发表时间：
2018-05-09
期刊：
BMC genomics
影响因子：
4.4
作者：
Kim JS;Senol Cali D;Xin H;Lee D;Ghose S;Alser M;Hassan H;Ergin O;Alkan C;Mutlu O
通讯作者：
Mutlu O

Accelerating read mapping with FastHASH.