CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences
CD-HIT:对大量生物序列进行聚类和比较的快速程序
基本信息
- 批准号:7495498
- 负责人:
- 金额:$ 34.76万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2008
- 资助国家:美国
- 起止时间:2008-09-01 至 2011-06-30
- 项目状态:已结题
- 来源:
- 关键词:AddressAlgorithmsBase SequenceBioinformaticsBiologicalClassificationCommunitiesData SetDatabasesDevelopmentDocumentationEnvironmentFamilyFeedbackFundingFutureGoalsGrowthImageryIndividualInternetLinkMaintenanceManualsMetagenomicsMethodsNumbersPerformancePlayPliabilityPoliciesProteinsPublic HealthPublicationsResearch PersonnelResourcesRoleSequence AnalysisSet proteinSoftware EngineeringSpeedTechniquesUniversitiesUpdateWorkabstractingbasecluster computingcomputer programdata structuregenome sequencingimprovedopen sourceportabilityprogramstool
项目摘要
DESCRIPTION (provided by applicant): Project Summary/Abstract CD-HIT is a computer program for clustering and comparing large sets of protein or nucleotide sequences. It helps to significantly reduce the computational and manual efforts in various sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset. CD-HIT is 2 to 3 orders of magnitude faster than other methods. It can handle extremely large databases and has been used extensively in various fields. CD-HIT is becoming increasingly popular based on users' feedback and the growing number of publications that cited CD-HIT. CD-HIT has thousands of users now and is routinely used in many popular databases, such as UniProt and PDB. Researchers are now facing serious challenges and problems from the explosive growth of public sequence databases as a result of high-throughput genome sequencing projects and the very recent environmental metagenomic projects. The routine analysis, from searching a database to building a multiple alignment, is getting more computational expensive and complicated. An efficient clustering method is crucial to address many of the challenges and help researchers to overcome the problems. Currently, no other available program can replace CD-HIT in terms of speed and the ability to handle very large datasets. Therefore, CD-HIT will be playing a more important role in the future. The goal of this proposal is the further improvement and development of the CD-HIT program and related applications to better serve the increasing user community and to address the issues raised by users of CD-HIT. The algorithm will be improved to achieve better performance and overcome the existing limitations. Efforts will be spent towards more accurate clustering results while still maintaining the ultrahigh speed. New functions will be implemented to meet various clustering and comparing needs. More enhanced maintenance and better software engineering techniques will take place to provide regular program releases and updates, better portability, shorter trouble shooting cycles, and richer documentation. Subject to University policies, CD-HIT will be continually an open source package. In addition, a web server will be set up for easier public access to CD-HIT's applications. The server will provide further analysis and visualization tools, interface and links to other bioinformatics resources. Pre-calculated popular datasets will be made available to the public to eliminate the need for individual labs to repeat the same work. Project Narrative CD-HIT is a fast computer program for clustering and comparing biological sequences used by thousands of researchers in public health related studies. It directly helps researchers to significantly reduce the efforts in sequence analysis and to correct the bias within public databases. Continued development of CD-HIT will better serve researchers who are facing more challenges in sequence analysis by the explosive growth of public sequence databases.
CD-HIT是一个计算机程序,用于聚类和比较大量的蛋白质或核苷酸序列。它有助于显着减少各种序列分析任务中的计算和手动工作,并有助于理解数据结构和纠正数据集中的偏差。CD-HIT比其他方法快2 ~ 3个数量级。它可以处理非常大的数据库,并已广泛应用于各个领域。根据用户的反馈和越来越多的出版物引用CD-HIT,CD-HIT越来越受欢迎。CD-HIT现在拥有数千名用户,并且经常用于许多流行的数据库,例如UniProt和PDB。由于高通量基因组测序计划和最近的环境宏基因组计划,研究人员现在面临着来自公共序列数据库爆炸性增长的严重挑战和问题。常规分析,从搜索数据库到建立多重比对,变得越来越计算昂贵和复杂。一个有效的聚类方法是至关重要的,以解决许多挑战,并帮助研究人员克服这些问题。目前,没有其他可用的程序可以在速度和处理非常大的数据集的能力方面取代CD-HIT。因此,CD-HIT将在未来发挥更重要的作用。本提案的目标是进一步改进和发展CD-HIT程序和相关应用程序,以便更好地为日益增多的用户群体服务,并解决CD-HIT用户提出的问题。该算法将得到改进,以达到更好的性能,并克服现有的局限性。将努力争取更准确的聚类结果,同时仍保持搜索速度。将实现新的功能以满足各种聚类和比较需求。将采用更强的维护和更好的软件工程技术,以提供定期的程序发布和更新、更好的可移植性、更短的故障排除周期和更丰富的文档。根据大学的政策,CD-HIT将继续是一个开源软件包。此外,还将设置一个网络服务器,方便公众使用CD-HIT的应用程序。该服务器将提供进一步的分析和可视化工具,接口和链接到其他生物信息学资源。预先计算的流行数据集将向公众提供,以消除单个实验室重复相同工作的需要。Project Narrative CD-HIT是一个快速的计算机程序,用于聚类和比较数千名研究人员在公共卫生相关研究中使用的生物序列。它直接帮助研究人员显着减少序列分析的工作,并纠正公共数据库中的偏见。随着公共序列数据库的爆炸式增长,CD-HIT的持续发展将更好地服务于那些在序列分析中面临更多挑战的研究人员。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Weizhong Li其他文献
Weizhong Li的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Weizhong Li', 18)}}的其他基金
A study of antibiotics usage on early gut microbiome colonization and establishment in young children
抗生素使用对幼儿早期肠道微生物定植和建立的研究
- 批准号:
10113538 - 财政年份:2020
- 资助金额:
$ 34.76万 - 项目类别:
Novel Methods for Effective Analysis Assembly and Comparison of HMP Sequences
HMP 序列有效分析组装和比较的新方法
- 批准号:
8294893 - 财政年份:2010
- 资助金额:
$ 34.76万 - 项目类别:
Novel Methods for Effective Analysis Assembly and Comparison of HMP Sequences
HMP 序列有效分析组装和比较的新方法
- 批准号:
8020878 - 财政年份:2010
- 资助金额:
$ 34.76万 - 项目类别:
Novel Methods for Effective Analysis Assembly and Comparison of HMP Sequences
HMP 序列有效分析组装和比较的新方法
- 批准号:
8150493 - 财政年份:2010
- 资助金额:
$ 34.76万 - 项目类别:
CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences
CD-HIT:对大量生物序列进行聚类和比较的快速程序
- 批准号:
7892867 - 财政年份:2009
- 资助金额:
$ 34.76万 - 项目类别:
CD-HIT: A Fast Program to Cluster and Compare Large Sets of Biological Sequences
CD-HIT:对大量生物序列进行聚类和比较的快速程序
- 批准号:
7682840 - 财政年份:2008
- 资助金额:
$ 34.76万 - 项目类别:
相似海外基金
DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
- 批准号:
EP/Y029089/1 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Research Grant
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
- 批准号:
2337776 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Continuing Grant
CAREER: From Dynamic Algorithms to Fast Optimization and Back
职业:从动态算法到快速优化并返回
- 批准号:
2338816 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Continuing Grant
CAREER: Structured Minimax Optimization: Theory, Algorithms, and Applications in Robust Learning
职业:结构化极小极大优化:稳健学习中的理论、算法和应用
- 批准号:
2338846 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Continuing Grant
CRII: SaTC: Reliable Hardware Architectures Against Side-Channel Attacks for Post-Quantum Cryptographic Algorithms
CRII:SaTC:针对后量子密码算法的侧通道攻击的可靠硬件架构
- 批准号:
2348261 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Standard Grant
CRII: AF: The Impact of Knowledge on the Performance of Distributed Algorithms
CRII:AF:知识对分布式算法性能的影响
- 批准号:
2348346 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Standard Grant
CRII: CSR: From Bloom Filters to Noise Reduction Streaming Algorithms
CRII:CSR:从布隆过滤器到降噪流算法
- 批准号:
2348457 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Standard Grant
EAGER: Search-Accelerated Markov Chain Monte Carlo Algorithms for Bayesian Neural Networks and Trillion-Dimensional Problems
EAGER:贝叶斯神经网络和万亿维问题的搜索加速马尔可夫链蒙特卡罗算法
- 批准号:
2404989 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Standard Grant
CAREER: Efficient Algorithms for Modern Computer Architecture
职业:现代计算机架构的高效算法
- 批准号:
2339310 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Continuing Grant
CAREER: Improving Real-world Performance of AI Biosignal Algorithms
职业:提高人工智能生物信号算法的实际性能
- 批准号:
2339669 - 财政年份:2024
- 资助金额:
$ 34.76万 - 项目类别:
Continuing Grant