CAREER: Robust and scalable genome-wide phylogenetics

职业:稳健且可扩展的全基因组系统发育学

基本信息

  • 批准号:
    1845967
  • 负责人:
  • 金额:
    $ 54.92万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-02-15 至 2024-01-31
  • 项目状态:
    已结题

项目摘要

The present diversity of life has evolved from a single ancestor through billions of years of evolution. Understanding these evolutionary histories is fascinating, but more importantly, is a crucial precursor to many biological analyses. Some evolutionary relationships are obvious (e.g., a cat is closer to a lion than a chicken) but other consequential relationships are hard to discern. Luckily, evolution operates on the genomes of organisms, and the sequence of genetic changes leaves a trace of the evolutionary histories. Following these traces and reconstructing the evolutionary past, however, is a computational problem, and as it turns out, is a difficult problem. Sophisticated methods are needed to infer a phylogeny: a tree, called tree-of-life, that shows the historical relationships between species. When sequencing whole genomes became possible in the mid-2000s, many believed the sheer amount of data would result in robust reconstructions of phylogenies. While genome sequencing has fulfilled some of its promises, other challenges remain. Large-scale data are hard to adequately model and are hard to screen for errors. As a result, different analyses do not always agree, and also, inference algorithms are pushed to their limits of scalability. Thus, an improved understanding of the tree-of-life requires not just more data but also better algorithms. Interestingly, as data sciences permeate many areas of science, issues of robustness to error and scalability faced in phylogenetics will confront many disciplines. Thus, the next generation of data scientists needs to be trained to consider these concerns when developing algorithms for data analysis.This project seeks to address current limitations in phylogenomics (phylogeny inference from whole genomes) and to integrate issues of robustness and scalability into teaching. The main challenge in phylogenomics is data heterogeneity, and there are two sources of data heterogeneity: real biological processes driving genome evolution that lead to discordant histories across the genome, and artefactual heterogeneity that results from complex pipelines used to prepare the data for inference. Models of real heterogeneity exist. However, current methods often require knowing the source of heterogeneity in advance, are often not scalable, are not always robust to artefactual heterogeneity. The approach taken here is to combine unsupervised learning and discrete optimization to build methods for identifying errors. These techniques will strive to minimize assumptions and will use both parametric and non-parametric statistics. The project will draw on machine learning, multi-criteria optimization, and high-performance computing. If successful, it will dramatically improve the accuracy and scalability of genome-wide phylogeny reconstruction and will help researchers understand intricate patterns in genome evolution. To integrate research and education, this project will enable yearly hackathons that bring together students with computational and biological expertise with the goal of developing robust and scalable methods. The project will also seek to improve the understanding of data science for undergrad and K-12 students, emphasizing for them both the excitement and challenges of analyzing large error-prone datasets. The tools developed here will be publicly available and well-documented. Yearly workshops will be held to help biologists learn and use the tools.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
目前的生物多样性是从一个单一的祖先经过数十亿年的进化而形成的。理解这些进化历史是迷人的,但更重要的是,这是许多生物学分析的重要前提。一些进化关系是显而易见的(例如,猫与狮子的关系比鸡更近),但其他重要的关系却很难辨别。幸运的是,进化作用于生物体的基因组,基因变化的顺序留下了进化历史的痕迹。然而,追踪这些痕迹并重建进化的过去是一个计算问题,事实证明,这是一个困难的问题。需要复杂的方法来推断一个物种发生:一棵树,称为生命之树,它显示了物种之间的历史关系。2000年代中期,当全基因组测序成为可能时,许多人认为大量的数据将导致强有力的基因重建。虽然基因组测序已经实现了一些承诺,但其他挑战仍然存在。大规模数据很难充分建模,也很难筛选错误。因此,不同的分析并不总是一致的,而且,推理算法被推到了它们的可扩展性极限。因此,要更好地理解生命之树,不仅需要更多的数据,还需要更好的算法。有趣的是,随着数据科学渗透到许多科学领域,遗传学中面临的错误鲁棒性和可扩展性问题将面临许多学科。因此,下一代数据科学家需要接受培训,以便在开发数据分析算法时考虑这些问题。该项目旨在解决目前基因组学(从整个基因组进行基因组推断)的局限性,并将鲁棒性和可扩展性问题纳入教学。生物基因组学的主要挑战是数据异质性,数据异质性有两个来源:驱动基因组进化的真实的生物过程导致整个基因组的历史不一致,以及用于准备数据进行推理的复杂管道导致的人为异质性。存在真实的异质性模型。然而,目前的方法往往需要事先知道异质性的来源,往往是不可扩展的,并不总是鲁棒的人工异质性。这里采用的方法是将联合收割机无监督学习和离散优化相结合,以建立识别错误的方法。这些技术将努力尽量减少假设,并将使用参数和非参数统计。该项目将利用机器学习,多标准优化和高性能计算。如果成功,它将大大提高全基因组遗传重建的准确性和可扩展性,并将帮助研究人员了解基因组进化中复杂的模式。为了整合研究和教育,该项目将使每年的黑客马拉松能够汇集具有计算和生物专业知识的学生,目标是开发强大和可扩展的方法。该项目还将寻求提高本科生和K-12学生对数据科学的理解,强调他们分析大型易出错数据集的兴奋和挑战。在此开发的工具将公开提供,并有详细记录。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(22)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
TreeCluster: Clustering biological sequences using phylogenetic trees
  • DOI:
    10.1371/journal.pone.0221068
  • 发表时间:
    2019-08-22
  • 期刊:
  • 影响因子:
    3.7
  • 作者:
    Balaban, Metin;Moshiri, Niema;Mirarab, Siavash
  • 通讯作者:
    Mirarab, Siavash
SODA: multi-locus species delimitation using quartet frequencies
SODA:使用四重频率进行多位点物种界定
  • DOI:
    10.1093/bioinformatics/btaa1010
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    5.8
  • 作者:
    Rabiee, Maryam;Mirarab, Siavash
  • 通讯作者:
    Mirarab, Siavash
Completing gene trees without species trees in sub-quadratic time
  • DOI:
    10.1093/bioinformatics/btab875
  • 发表时间:
    2022-01-03
  • 期刊:
  • 影响因子:
    5.8
  • 作者:
    Mai, Uyen;Mirarab, Siavash
  • 通讯作者:
    Mirarab, Siavash
Multispecies Coalescent: Theory and Applications in Phylogenetics
多物种合并:系统发育学的理论与应用
  • DOI:
    10.1146/annurev-ecolsys-012121-095340
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Mirarab, Siavash;Nakhleh, Luay;Warnow, Tandy
  • 通讯作者:
    Warnow, Tandy
TAPER: Pinpointing errors in multiple sequence alignments despite varying rates of evolution
TAPER:尽管进化速度不同,但仍可精确定位多个序列比对中的错误
  • DOI:
    10.1111/2041-210x.13696
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    6.6
  • 作者:
    Zhang, Chao;Zhao, Yiming;Braun, Edward L.;Mirarab, Siavash
  • 通讯作者:
    Mirarab, Siavash
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Siavash Mir arabbaygi其他文献

A Bayesian Framework for Software Regression Testing
  • DOI:
  • 发表时间:
    2008-08
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Siavash Mir arabbaygi
  • 通讯作者:
    Siavash Mir arabbaygi
Novel scalable approaches for multiple sequence alignment and phylogenomic reconstruction
  • DOI:
  • 发表时间:
    2015-08
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Siavash Mir arabbaygi
  • 通讯作者:
    Siavash Mir arabbaygi

Siavash Mir arabbaygi的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Siavash Mir arabbaygi', 18)}}的其他基金

III: Small: New algorithms for genome skimming and its applications
III:小:基因组略读的新算法及其应用
  • 批准号:
    1815485
  • 财政年份:
    2018
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
CRII: III: Using Genomic Context to Understand Evolutionary Histories of Individual Genes
CRII:III:利用基因组背景来理解单个基因的进化历史
  • 批准号:
    1565862
  • 财政年份:
    2016
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant

相似国自然基金

供应链管理中的稳健型(Robust)策略分析和稳健型优化(Robust Optimization )方法研究
  • 批准号:
    70601028
  • 批准年份:
    2006
  • 资助金额:
    7.0 万元
  • 项目类别:
    青年科学基金项目
心理紧张和应力影响下Robust语音识别方法研究
  • 批准号:
    60085001
  • 批准年份:
    2000
  • 资助金额:
    14.0 万元
  • 项目类别:
    专项基金项目
ROBUST语音识别方法的研究
  • 批准号:
    69075008
  • 批准年份:
    1990
  • 资助金额:
    3.5 万元
  • 项目类别:
    面上项目
改进型ROBUST序贯检测技术
  • 批准号:
    68671030
  • 批准年份:
    1986
  • 资助金额:
    2.0 万元
  • 项目类别:
    面上项目

相似海外基金

ERI: Robust and Scalable Manufacturing of Ultra-Sensitive and Selective Molecule Sensor Arrays
ERI:稳健且可扩展的超灵敏和选择性分子传感器阵列制造
  • 批准号:
    2301668
  • 财政年份:
    2024
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
CAREER: Scalable and Robust Uncertainty Quantification using Subsampling Markov Chain Monte Carlo Algorithms
职业:使用子采样马尔可夫链蒙特卡罗算法进行可扩展且稳健的不确定性量化
  • 批准号:
    2340586
  • 财政年份:
    2024
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Continuing Grant
EAGER: Quantum Manufacturing: Supporting Future Quantum Applications by Developing a Robust, Scalable Process to Create Diamond Nitrogen-Vacancy Center Qubits
EAGER:量子制造:通过开发稳健、可扩展的工艺来创建钻石氮空位中心量子位,支持未来的量子应用
  • 批准号:
    2242049
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
Collaborative Research: SaTC: CORE: Small: Towards Robust, Scalable, and Resilient Radio Fingerprinting
协作研究:SaTC:核心:小型:迈向稳健、可扩展和有弹性的无线电指纹识别
  • 批准号:
    2225161
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
Developing robust and scalable genomics tools and databases to analyze immune receptor repertoires across diverse populations
开发强大且可扩展的基因组学工具和数据库来分析不同人群的免疫受体库
  • 批准号:
    10656981
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
Collaborative Research: CISE-MSI: DP: RI: Towards Scalable, Resilient and Robust Foraging with Heterogeneous Robot Swarms
合作研究:CISE-MSI:DP:RI:利用异构机器人群实现可扩展、有弹性和稳健的觅食
  • 批准号:
    2318682
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
Collaborative Research: U.S.-Ireland R&D Partnership: CIF: AF: Small: Enabling Beyond-5G Wireless Access Networks with Robust and Scalable Cell-Free Massive MIMO
合作研究:美国-爱尔兰 R
  • 批准号:
    2322191
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
Collaborative Research: CISE-MSI: DP: RI: Towards Scalable, Resilient and Robust Foraging with Heterogeneous Robot Swarms
合作研究:CISE-MSI:DP:RI:利用异构机器人群实现可扩展、有弹性和稳健的觅食
  • 批准号:
    2318683
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
Robust and scalable algorithms for learning hidden structures in sparse network data with the aid of side information
借助辅助信息学习稀疏网络数据中隐藏结构的鲁棒且可扩展的算法
  • 批准号:
    2311024
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
Collaborative Research: U.S.-Ireland R&D Partnership: CIF: AF: Small: Enabling Beyond-5G Wireless Access Networks with Robust and Scalable Cell-Free Massive MIMO
合作研究:美国-爱尔兰 R
  • 批准号:
    2322190
  • 财政年份:
    2023
  • 资助金额:
    $ 54.92万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了