Collaborative Research: Principal Component Analysis over Tree Spaces and Its Applications to Phylogenomics

合作研究:树空间的主成分分析及其在系统基因组学中的应用

基本信息

  • 批准号:
    1916496
  • 负责人:
  • 金额:
    $ 11.88万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-10-01 至 2023-09-30
  • 项目状态:
    已结题

项目摘要

Phylogenomics is a relatively new field that seeks to understand evolutionary relationships between organisms at the scale of the whole genome. One of the central questions in evolutionary biology is a better understanding of the relationships between organisms, usually summarized in the form of a phylogenetic tree. The methods in common use for developing these trees tend to work best for closely related organisms, and when the sequences are relatively short; for example, the DNA sequence for a single gene applied to a collection of mammals. When comparing more distantly related organisms, or data from large portions of the genome, current techniques can break down. Since modern technology can quickly and cheaply produce genome-scale sequence data, there is a pressing need for better analytical tools tailored to this large-scale high-dimensional data. The most popular statistical methods for finding general patterns in large-scale data, such as Principal Component Analysis (PCA), make the assumption that the space where the data lies is flat, like the plane geometry of Euclid. However, the space of possible phylogenetic trees has a decidedly non-Euclidean geometry, with a surface more akin to an origami figure made with a sheet of rubber. The goal of this project is to develop alternative types of principal components, and methods to calculate them, which take into account the unusual structural features of the mathematical space of phylogenetic trees.PCA is a statistical method that takes data points in a high dimensional Euclidean space into a lower dimensional plane which minimizes the sum of squares between each point in the data set and their orthogonal projection onto the plane. It has been used for clustering high dimensional data points for statistical analysis and it is one of the simplest and most robust ways of doing dimensionality reduction in a Euclidean vector space. However, it assumes the properties of a Euclidean vector space. The space of all possible phylogenies on a fixed set of species does not form a Euclidean vector space, so PCA must be reformulated in the geometry of a tree-space. Motivated by the previous work by T. Nye in 2011 on construction of the first principal component, or principal geodesic, the PIs propose two geometric objects under different metrics which represent a k-th order principal component: (1) the locus of the weighted Frechet mean of k+1 points in a tree-space, where the weights vary over the associated probability simplex, under the Billera-Holmes-Vogtman (BHV) metric and (2) the tropical convex hull of k+1 points in a tree-space via the tropical metric in tropical geometry known as the max-plus algebra. The first aim of this project is to prove properties of the PCA under the BHV metric and the PCA under the tropical metric over tree-spaces. Then, the second aim is to develop efficient algorithms to compute/approximate them. Simulation studies will be conducted to show these algorithms perform well. Then the PIs will apply these algorithms to empirical data sets, such as Apicomplexa, a phylum of parasitic alveolates including malaria, and African coelacanth genomes, and sequences of hemagglutinin for influenza from New York. The broader impact will include advising undergraduate students for the implementation of the algorithms and user interfaces of the software products. These research experiences will complement a new Data Science program being developed as a component of the current Hawaii EPSCoR program. A portion of the summer effort will also be used to collaborate with nearby high school science and engineering programs in the development of data analysis lesson modules.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
系统基因组学是一个相对较新的领域,它试图在整个基因组的规模上了解生物体之间的进化关系。进化生物学的核心问题之一是更好地理解生物体之间的关系,通常以系统发育树的形式总结。通常用于开发这些树的方法往往对密切相关的生物最有效,并且当序列相对较短时;例如,将单个基因的DNA序列应用于哺乳动物的集合。当比较关系较远的生物体或来自大部分基因组的数据时,当前的技术可能会崩溃。由于现代技术可以快速、廉价地产生基因组规模的序列数据,因此迫切需要针对这种大规模高维数据量身定制的更好的分析工具。在大规模数据中寻找一般模式的最流行的统计方法,如主成分分析(PCA),假设数据所在的空间是平坦的,就像欧几里得的平面几何一样。然而,可能的系统发育树的空间具有明显的非欧几里德几何学,其表面更类似于用橡胶片制成的折纸图形。该项目的目标是开发替代类型的主成分,并计算它们的方法,考虑到系统发育树的数学空间的不寻常的结构特征。PCA是一种统计方法,它将高维欧氏空间中的数据点带入一个低维平面,使数据集中每个点与其在平面上的正交投影之间的平方和最小化。它已被用于聚类高维数据点进行统计分析,它是在欧氏向量空间中进行降维的最简单和最鲁棒的方法之一。 然而,它假设欧几里得向量空间的性质。在一个固定的物种集合上的所有可能的共生空间不形成欧几里得向量空间,所以PCA必须在树空间的几何中重新表述。受T. Nye在2011年关于第一主成分或主测地线的构造,PI提出了两个不同度量下的几何对象,它们表示k阶主成分:(1)树空间中k+1个点的加权Frechet平均的轨迹,其中权在相关的概率单形上变化,在Billera-Holmes-Vogtman(BHV)度量下,(2)通过热带几何中的热带度量(称为极大代数),得到树空间中k+1个点的热带凸船体。本项目的第一个目标是证明树空间上的PCA在BHV度量下和PCA在热带度量下的性质。然后,第二个目标是开发有效的算法来计算/近似它们。将进行仿真研究,以显示这些算法的性能良好。 然后,PI将这些算法应用于经验数据集,例如Apicomplexa,包括疟疾在内的寄生性蜂窝动物门,非洲腔棘鱼基因组,以及来自纽约的流感血凝素序列。更广泛的影响将包括建议本科生实施软件产品的算法和用户界面。这些研究经验将补充正在开发的新数据科学计划,作为当前夏威夷EPSCoR计划的一部分。暑期项目的一部分也将用于与附近的高中科学和工程项目合作开发数据分析课程模块。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Grady Weyenberg其他文献

Grady Weyenberg的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似国自然基金

Research on Quantum Field Theory without a Lagrangian Description
  • 批准号:
    24ZR1403900
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
Cell Research
  • 批准号:
    31224802
  • 批准年份:
    2012
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research
  • 批准号:
    31024804
  • 批准年份:
    2010
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Cell Research (细胞研究)
  • 批准号:
    30824808
  • 批准年份:
    2008
  • 资助金额:
    24.0 万元
  • 项目类别:
    专项基金项目
Research on the Rapid Growth Mechanism of KDP Crystal
  • 批准号:
    10774081
  • 批准年份:
    2007
  • 资助金额:
    45.0 万元
  • 项目类别:
    面上项目

相似海外基金

Collaborative Research: Dynamics of Short Range Order in Multi-Principal Element Alloys
合作研究:多主元合金中的短程有序动力学
  • 批准号:
    2348956
  • 财政年份:
    2024
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Dynamics of Short Range Order in Multi-Principal Element Alloys
合作研究:多主元合金中的短程有序动力学
  • 批准号:
    2348955
  • 财政年份:
    2024
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Elucidating High Temperature Deformation Mechanisms in Refractory Multi-Principal-Element Alloys
合作研究:阐明难熔多主元合金的高温变形机制
  • 批准号:
    2313860
  • 财政年份:
    2023
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Elucidating High Temperature Deformation Mechanisms in Refractory Multi-Principal-Element Alloys
合作研究:阐明难熔多主元合金的高温变形机制
  • 批准号:
    2313861
  • 财政年份:
    2023
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Randomized Numerical Linear Algebra for Large Scale Inversion, Sparse Principal Component Analysis, and Applications
合作研究:大规模反演的随机数值线性代数、稀疏主成分分析及应用
  • 批准号:
    2152661
  • 财政年份:
    2022
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Randomized Numerical Linear Algebra for Large Scale Inversion, Sparse Principal Component Analysis, and Applications
合作研究:大规模反演的随机数值线性代数、稀疏主成分分析及应用
  • 批准号:
    2152704
  • 财政年份:
    2022
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Microscopic Mechanism of Surface Oxide Formation in Multi-Principal Element Alloys
合作研究:多主元合金表面氧化物形成的微观机制
  • 批准号:
    2219489
  • 财政年份:
    2022
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
2022 Collaborative Research in Computational Neuroscience (CRCNS) Principal Investigators Meeting
2022年计算神经科学合作研究(CRCNS)首席研究员会议
  • 批准号:
    2236749
  • 财政年份:
    2022
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Randomized Numerical Linear Algebra for Large Scale Inversion, Sparse Principal Component Analysis, and Applications
合作研究:大规模反演的随机数值线性代数、稀疏主成分分析及应用
  • 批准号:
    2152687
  • 财政年份:
    2022
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Standard Grant
Collaborative Research: Microscopic Mechanism of Surface Oxide Formation in Multi-Principal Element Alloys
合作研究:多主元合金表面氧化物形成的微观机制
  • 批准号:
    2219416
  • 财政年份:
    2022
  • 资助金额:
    $ 11.88万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了