BIGDATA: F: DKA: Collaborative Research: Clustering Algorithms for Data Streams

BIGDATA:F:DKA:协作研究:数据流的聚类算法

基本信息

  • 批准号:
    1447639
  • 负责人:
  • 金额:
    $ 100万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2014
  • 资助国家:
    美国
  • 起止时间:
    2014-09-01 至 2018-08-31
  • 项目状态:
    已结题

项目摘要

This project will develop novel theoretical methods and algorithms for clustering massive datasets with applications to astronomy, neuroscience and natural language processing. Clustering is the process of creating groups of data based on similarities between individual data points. The developed theoretical methods will be used in applications where clustering algorithms are critical and the input data is extremely large. First, new clustering algorithms will be designed to scale and will allow for better cosmological simulations. The simulations involve billions of particles in each snapshot, and existing clustering algorithms based upon a simple friends-of-friends approach do not scale to these cardinalities. Second, this project will advance the computational capabilities in statistical neuroscience by employing clustering algorithms to discover both regular patterns and anomalies in normal and abnormal brain graphs. Finally, this research will explore the important topic of finding anomalies in massive text streams, such as Twitter. In this setting, one is concerned with detecting anomalous bursts in traffic content that share a similar pattern. These bursts might signal an important political event or a natural disaster. This project will support undergraduate and graduate research aimed at developing skills needed for algorithmic work on massive data sets.There exist numerous heuristics and approximation algorithms for many variants of the clustering problem. However, these methods are often slow or infeasible for applications with massive datasets. This research will improve space and time upper bounds for clustering algorithms in the streaming model. This project will address the k-mean and k-median problems in the dynamic streaming model, extend the results on separable data when the input comes from Euclidian space, improve the bounds in the sliding window model, combine the coresets technique with novel sampling approaches and the method of smooth histograms. The PIs' previous work has already been applied to natural language processing and this project will expand this direction further and explore the important topic of "First Story Detection." Furthermore, this research will explore the similarities and differences between various sampling and sketching techniques, and how they could be used in large multidimensional astronomical databases, like SDSS (Sloan Digital Sky Survey) SkyServer. These novel approaches will provide major speedups for the execution of large statistical aggregate queries. The new streaming algorithms will be used to find substructure in very large cosmological N-body simulations. For further information see the project web site at: http://www.cs.jhu.edu/~vova
该项目将开发新型的理论方法和算法,以将大量数据集与天文学,神经科学和自然语言处理的应用程序进行聚类。聚类是基于单个数据点之间的相似性创建数据组的过程。开发的理论方法将用于聚类算法至关重要并且输入数据非常大的应用中。首先,新的聚类算法将设计为扩展,并可以进行更好的宇宙学模拟。这些模拟涉及每个快照中数十亿个粒子,并且基于简单的朋友方法方法的现有聚类算法并不能扩展到这些红衣主教。其次,该项目将通过采用聚类算法来发现正常和异常脑图中的常规模式和异常情况,从而提高统计神经科学的计算能力。最后,这项研究将探讨在Twitter等大规模文本流中查找异常的重要主题。在这种情况下,人们涉及检测共享类似模式的流量内容中的异常突发。这些爆发可能标志着重要的政治事件或自然灾害。该项目将支持本科和研究生研究,旨在开发大量数据集算法工作所需的技能。对于许多聚类问题的许多变体都存在许多启发式方法和近似算法。但是,对于具有大量数据集的应用,这些方法通常很慢或不可行。这项研究将改善流媒体模型中聚类算法的空间和时间上限。该项目将解决动态流模型中的K-均值和K-Median问题,在输入来自欧几里得空间时扩展可分离数据的结果,改善滑动窗口模型中的界限,将核心技术技术与新型采样方法和平滑直方图的方法相结合。 PI的先前工作已经应用于自然语言处理,该项目将进一步扩展此方向,并探讨“第一层检测”的重要主题。此外,这项研究将探讨各种抽样和草图技术之间的相似性和差异,以及如何在大型多维天文数据库中使用它们,例如SDSS(Sloan Digital Sky Survey)Skyserver。这些新颖的方法将为执行大型统计汇总查询提供主要的加速。新的流算法将用于在非常大的宇宙N体模拟中找到子结构。有关更多信息,请参见项目网站:http://www.cs.jhu.edu/~vova

项目成果

期刊论文数量(12)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Streaming symmetric norms via measure concentration
Matrix Norms in Data Streams: Faster, Multi-Pass and Row-Order
  • DOI:
  • 发表时间:
    2016-09
  • 期刊:
  • 影响因子:
    0
  • 作者:
    V. Braverman;Stephen R. Chestnut;Robert Krauthgamer;Yi Li;David P. Woodruff;Lin F. Yang
  • 通讯作者:
    V. Braverman;Stephen R. Chestnut;Robert Krauthgamer;Yi Li;David P. Woodruff;Lin F. Yang
Approximate Convex Hull of Data Streams
  • DOI:
    10.4230/lipics.icalp.2018.21
  • 发表时间:
    2017-12
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Avrim Blum;V. Braverman;Ananya Kumar;Harry Lang;Lin F. Yang
  • 通讯作者:
    Avrim Blum;V. Braverman;Ananya Kumar;Harry Lang;Lin F. Yang
Scalable streaming tools for analyzing N-body simulations: Finding halos and investigating excursion sets in one pass
  • DOI:
    10.1016/j.ascom.2018.04.003
  • 发表时间:
    2017-11
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Nikita Ivkin;Zaoxing Liu;Lin F. Yang;Srinivas Suresh Kumar;G. Lemson;M. Neyrinck;A. Szalay;V. Braverman
  • 通讯作者:
    Nikita Ivkin;Zaoxing Liu;Lin F. Yang;Srinivas Suresh Kumar;G. Lemson;M. Neyrinck;A. Szalay;V. Braverman
Towards Fast and Scalable Graph Pattern Mining
迈向快速且可扩展的图形模式挖掘
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Vladimir Braverman其他文献

Metric <math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="d1e20" altimg="si14.svg" class="math"><mi>k</mi></math>-median clustering in insertion-only streams
  • DOI:
    10.1016/j.dam.2021.07.025
  • 发表时间:
    2021-12-15
  • 期刊:
  • 影响因子:
  • 作者:
    Vladimir Braverman;Harry Lang;Keith Levin;Yevgeniy Rudoy
  • 通讯作者:
    Yevgeniy Rudoy
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?
线性回归的上下文学习需要多少预训练任务?
  • DOI:
    10.48550/arxiv.2310.08391
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jingfeng Wu;Difan Zou;Zixiang Chen;Vladimir Braverman;Quanquan Gu;Peter L. Bartlett
  • 通讯作者:
    Peter L. Bartlett
Private Data Stream Analysis for Universal Symmetric Norm Estimation
用于通用对称范数估计的私有数据流分析

Vladimir Braverman的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Vladimir Braverman', 18)}}的其他基金

Collaborative Research: CNS: Medium: Scalable Learning from Distributed Data for Wireless Network Management
合作研究:CNS:媒介:无线网络管理的分布式数据可扩展学习
  • 批准号:
    2333887
  • 财政年份:
    2022
  • 资助金额:
    $ 100万
  • 项目类别:
    Continuing Grant
CSR: NeTS: Small: In-Network Resource Management for Rack-Scale Computers
CSR:NetS:小型:机架级计算机的网络内资源管理
  • 批准号:
    2244870
  • 财政年份:
    2022
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant
CAREER: New Methods for Central Streaming Problems
职业:解决中央流媒体问题的新方法
  • 批准号:
    2244899
  • 财政年份:
    2022
  • 资助金额:
    $ 100万
  • 项目类别:
    Continuing Grant
Collaborative Research: CNS: Medium: Scalable Learning from Distributed Data for Wireless Network Management
合作研究:CNS:媒介:无线网络管理的分布式数据可扩展学习
  • 批准号:
    2107239
  • 财政年份:
    2021
  • 资助金额:
    $ 100万
  • 项目类别:
    Continuing Grant
CSR: NeTS: Small: In-Network Resource Management for Rack-Scale Computers
CSR:NetS:小型:机架级计算机的网络内资源管理
  • 批准号:
    1813487
  • 财政年份:
    2018
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant
CAREER: New Methods for Central Streaming Problems
职业:解决中央流媒体问题的新方法
  • 批准号:
    1652257
  • 财政年份:
    2017
  • 资助金额:
    $ 100万
  • 项目类别:
    Continuing Grant
EAGER: Universal Sketches for Network Monitoring
EAGER:网络监控通用草图
  • 批准号:
    1650041
  • 财政年份:
    2016
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant

相似国自然基金

HIV-1逆转录酶/整合酶双重抑制剂DKA-DAPYs的分子设计、合成及抗HIV活性研究
  • 批准号:
    21402148
  • 批准年份:
    2014
  • 资助金额:
    25.0 万元
  • 项目类别:
    青年科学基金项目

相似海外基金

BIGDATA: F: DKA: Collaborative Research: Randomized Numerical Linear Algebra (RandNLA) for multi-linear and non-linear data
BIGDATA:F:DKA:协作研究:用于多线性和非线性数据的随机数值线性代数 (RandNLA)
  • 批准号:
    1661760
  • 财政年份:
    2016
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant
BIGDATA: F: DKA: Collaborative Research: High-Dimensional Statistical Machine Learning for Spatio-Temporal Climate Data
BIGDATA:F:DKA:协作研究:时空气候数据的高维统计机器学习
  • 批准号:
    1664720
  • 财政年份:
    2016
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant
BIGDATA: F: DKA: Collaborative Research: Structured Nearest Neighbor Search in High Dimensions
BIGDATA:F:DKA:协作研究:高维结构化最近邻搜索
  • 批准号:
    1447473
  • 财政年份:
    2015
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant
BIGDATA: F: DKA: Collaborative Research: Structured Nearest Neighbor Search in High Dimensions
BIGDATA:F:DKA:协作研究:高维结构化最近邻搜索
  • 批准号:
    1447413
  • 财政年份:
    2015
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant
BIGDATA: F: DKA: Collaborative Research: Structured Nearest Neighbor Search in High Dimensions
BIGDATA:F:DKA:协作研究:高维结构化最近邻搜索
  • 批准号:
    1447476
  • 财政年份:
    2015
  • 资助金额:
    $ 100万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了