BIGDATA: F: Metric-space Positioning Systems for Symbolic Data Science

BIGDATA:F:用于符号数据科学的度量空间定位系统

基本信息

  • 批准号:
    1836914
  • 负责人:
  • 金额:
    $ 61.06万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2018
  • 资助国家:
    美国
  • 起止时间:
    2018-10-01 至 2023-09-30
  • 项目状态:
    已结题

项目摘要

Next-generation DNA sequencing technologies produce datasets that are the epitome of "big data." Resulting files are typically quite large, consisting almost entirely of symbolic (i.e., non-numeric) short DNA sequences. In contrast, the most widely used machine learning algorithms require numerical datasets to learn. Unfortunately, both traditional and cutting-edge methods to numerically represent symbolic data often suffer from high-dimensionality or substantial running time requirements, which hinder the application of powerful machine learning algorithms to modern biological questions. To overcome these crucial issues, this project addresses the fundamental problem of determining the "right" dimension in which to embed symbolic data for a data-mining or classification task. It does so by representing symbolic datasets numerically via a method reminiscent of Global Positioning Systems (GPS) but in a far more general setting. Besides exploring modern biology applications, the project will also investigate how to predict the source of a spread (i.e., ground zero) over large networks. This may assist administrators in determining how best to respond to new epidemics and cyber-threats. Additionally, the project will closely mentor undergraduate and graduate students to become mature data scientists. Its findings will be communicated as notes, open-source software, and video-lectures available to the general public, including students in the Colorado Data Science Team, which encourages the participation of women and under-represented minorities in Engineering education.Much like GPS uses trilateration to locate a receiver anywhere on the planet, finite metric spaces contain resolving sets, that is sets of points that uniquely identify every point in the space via multilateration (i.e., the vector of distances to points in the set). Associated with any resolving set R, there is a one-to-one transformation from its ambient metric space to a Euclidean space of dimension |R|, the cardinality of R. The smallest resolving set thus induces the lowest-dimensional representation of its ambient space. Importantly, even when the ambient metric space is finite but exponentially large, its metric dimension is often much smaller than its cardinality. Determining the metric dimension is, however, an NP-hard problem in a variety of contexts. Building on this abstracted notion of multilateration, this project will: (1) assess the computational complexity of calculating the metric dimension of Hamming graphs, and characterize the metric dimension of various random graph models to guide the development of new and efficient algorithms to approximate this quantity; (2) explore relaxations and constraints of multilateration, including approximate and probabilistic algorithms, to expand the reach of applications of multilateration to other finite but large metric spaces; and finally (3) provide proofs-of-concept of multilateration to learn non-contiguous regions of dependencies in genomic sequences, develop classifiers for historically elusive virus targets, and identify the source of spread of information or disease in large networks.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
下一代DNA测序技术产生的数据集是“大数据”的缩影。生成的文件通常非常大,几乎全部由符号(即非数字)短DNA序列组成。相比之下,最广泛使用的机器学习算法需要数字数据集才能学习。遗憾的是,无论是传统的还是前沿的符号数据表示方法,往往都存在高维或大量运行时间的问题,这阻碍了强大的机器学习算法在现代生物学问题中的应用。为了克服这些关键问题,该项目解决了为数据挖掘或分类任务确定嵌入符号数据的“正确”维度的根本问题。它通过一种让人联想到全球定位系统(GPS)的方法,以数字形式表示符号数据集,但设置要广泛得多。除了探索现代生物学应用,该项目还将研究如何预测大型网络上的传播来源(即零点)。这可能有助于管理者确定如何最好地应对新的流行病和网络威胁。此外,该项目将密切指导本科生和研究生成为成熟的数据科学家。它的发现将以笔记、开源软件和视频讲座的形式向公众传播,包括科罗拉多州数据科学团队的学生,该团队鼓励女性和未被充分代表的少数群体参与工程教育。就像GPS使用三边测量来定位地球上任何地方的接收器一样,有限的公制空间包含解析集,即通过多边形(即到集合中的点的距离向量)唯一标识空间中每个点的点集。与任何分解集R相联系,存在从其环境度量空间到R的基数维|R|的欧氏空间的一一变换。最小分解集由此导出其环境空间的最低维表示。重要的是,即使当环境度量空间有限但指数大时,它的度量维度往往比它的基数小得多。然而,在各种情况下,确定公制维度是一个NP-Hard问题。在这种抽象的多边化概念的基础上,本项目将:(1)评估计算Hamming图的度量维度的计算复杂性,并刻画各种随机图模型的度量维度,以指导开发新的高效算法来逼近这个量;(2)探索多边化的松弛和约束,包括近似和概率算法,以将多边化的应用范围扩展到其他有限但大的度量空间;最后(3)提供多边化的概念证明,以了解基因组序列中不连续的依赖区域,为历史上难以捉摸的病毒目标开发分类器,并确定信息或疾病在大型网络中的传播来源。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(6)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Levenshtein graphs: Resolvability, automorphisms & determining sets
编辑图:可解析性、自同构
  • DOI:
    10.1016/j.disc.2022.113310
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0.8
  • 作者:
    Ruth, Perrin E.;Lladser, Manuel E.
  • 通讯作者:
    Lladser, Manuel E.
Sparsification of large ultrametric matrices: insights into the microbial Tree of Life
Getting the Lay of the Land in Discrete Space: A Survey of Metric Dimension and Its Applications
  • DOI:
    10.1137/21m1409512
  • 发表时间:
    2023-12-01
  • 期刊:
  • 影响因子:
    10.2
  • 作者:
    Tillquist,Richard C.;Frongillo,Rafael M.;Lladser,Manuel E.
  • 通讯作者:
    Lladser,Manuel E.
Truncated metric dimension for finite graphs
有限图的截断公制维度
  • DOI:
    10.1016/j.dam.2022.04.021
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    1.1
  • 作者:
    Frongillo, Rafael M.;Geneson, Jesse;Lladser, Manuel E.;Tillquist, Richard C.;Yi, Eunjeong
  • 通讯作者:
    Yi, Eunjeong
Metric Dimension
  • DOI:
    10.4249/scholarpedia.53881
  • 发表时间:
    2019-10
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Richard C. Tillquist;Rafael M. Frongillo;M. Lladser
  • 通讯作者:
    Richard C. Tillquist;Rafael M. Frongillo;M. Lladser
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Manuel Lladser其他文献

Manuel Lladser的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Manuel Lladser', 18)}}的其他基金

AMC-SS: Markovian Embeddings for the Analysis and Computation of Patterns in non-Markovian Random Sequences
AMC-SS:用于非马尔可夫随机序列中模式分析和计算的马尔可夫嵌入
  • 批准号:
    0805950
  • 财政年份:
    2008
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Continuing Grant

相似海外基金

Construction of metric space for datasets and learning algorithms
数据集和学习算法的度量空间构建
  • 批准号:
    22H03620
  • 财政年份:
    2022
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Grant-in-Aid for Scientific Research (B)
Dimension Reduction and Data Visualization for Regression Analysis of Metric-Space-Valued Data
用于度量空间值数据回归分析的降维和数据可视化
  • 批准号:
    2210775
  • 财政年份:
    2022
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Standard Grant
CareBand: A Collaborative Pilot Study to Optimize a Life-Space Performance Metric for Monitoring and Early Detection of Alzheimer's Disease and Related Dementias in Rural and Indigenous Communities
CareBand:一项合作试点研究,旨在优化生活空间绩效指标,以监测和早期发现农村和土著社区的阿尔茨海默病和相关痴呆症
  • 批准号:
    10212745
  • 财政年份:
    2021
  • 资助金额:
    $ 61.06万
  • 项目类别:
Geometric characterization of nonlinear metric spaces
非线性度量空间的几何表征
  • 批准号:
    20K14333
  • 财政年份:
    2020
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Hamilton-Jacobi equations on metric measure spaces
度量测度空间上的 Hamilton-Jacobi 方程
  • 批准号:
    20K22315
  • 财政年份:
    2020
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Grant-in-Aid for Research Activity Start-up
Geometric structure of Weil-Petersson metric on infinite dimensional Teichmuller space
无限维Teichmuller空间上Weil-Petersson度量的几何结构
  • 批准号:
    18K13410
  • 财政年份:
    2018
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
From mimesis to metric: The changing awareness of space as a category of administration caused by cartography in Northern Germany during the late 16th and early 17th century
从模仿到公制:16 世纪末和 17 世纪初德国北部的制图学引起了对空间作为一种管理类别的认识的变化
  • 批准号:
    328856666
  • 财政年份:
    2017
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Research Grants
Banach Space and Metric Geometry
巴纳赫空间和度量几何
  • 批准号:
    1565826
  • 财政年份:
    2016
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Continuing Grant
Consolidated Dynamic Saliency Metric for multiple dynamics in moving camera view
移动摄像机视图中多种动态的综合动态显着性度量
  • 批准号:
    15K00236
  • 财政年份:
    2015
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Research on complete quasi-metric spaces with algebraic structure
具有代数结构的完备拟度量空间的研究
  • 批准号:
    15K15940
  • 财政年份:
    2015
  • 资助金额:
    $ 61.06万
  • 项目类别:
    Grant-in-Aid for Young Scientists (B)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了