Distributed Algorithms for Topic Models with Applications to Streaming Document Data and Cancer Genomics

主题模型的分布式算法及其在流文档数据和癌症基因组学中的应用

基本信息

  • 批准号:
    1854476
  • 负责人:
  • 金额:
    $ 35万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2019
  • 资助国家:
    美国
  • 起止时间:
    2019-08-15 至 2023-07-31
  • 项目状态:
    已结题

项目摘要

There is a growing need for methods that analyze and organize large collections of electronic information, for example a collection of papers on some scientific field, or on some medical question, or a collection of news articles in the New York Times. Traditional keyword-based searches are very fast but have important deficiencies. Suppose we are interested in searching for articles that deal with heart attacks. A search using the keywords "heart attack" will not return articles that use "myocardial infarction", the medical term for "heart attack". A newer and more powerful approach to the analysis and organization of large collections of electronic documents and for document retrieval is through the use of so-called topic models. These models work by identifying the hidden topics in the collection (in the New York Times example, these might be Sports, World Events, Politics, etc.) and by also identifying the topics that each document deals with. By far the most commonly used topic model is the so-called Latent Dirichlet Allocation (LDA) model. Unfortunately, all accurate implementations are slow: the algorithms list all the words in all the documents in some order, and then carry out a calculation for each word. This is done sequentially: the calculation for a given word cannot be carried out before the calculations for all previous words have been completed. This project will develop a class of algorithms that work in parallel, taking advantage of the massive distributed computation that is now available on multi-core platforms. Thus, for example, if 1000 processors are available, these algorithms work 1000 times faster than existing algorithms. These new algorithms will enable the use of the LDA model on very large collections of documents. Topics can be used to cluster documents into groups. However, at its core, LDA is a "multi membership model": a New York Times article about NFL football players kneeling at the national anthem belongs in the Sports section, and also in the Politics section. Multi-membership models arise in areas other than document retrieval and classification. For example, in cancer genomics, tumors can be potentially classified as members of several cancer subtypes. This project will develop other multi-membership models, designed to handle non-textual data, as in the cancer genomics example above, and it will develop parallel algorithms to handle these models. The output of this project will enable researchers to handle massive collections of documents and medical data.LDA and other multi-membership models are inherently Bayesian models, in which topics and topic memberships for each document are unknown parameters. Posterior distributions are generally estimated by Markov chain Monte Carlo, which has proven convergence guarantees. This project will develop grouped Gibbs samplers which update variables in groups, where all the variables within a group can be updated simultaneously. The project will also develop practical convergence diagnostics and also theoretical results on the rates of convergence of the new algorithms. The theoretical results will enable the user to determine how long the Markov chains need to be run in order to provide a required level of accuracy.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
越来越需要分析和组织大量电子信息的方法,例如关于某些科学领域或某些医学问题的论文集,或纽约时报的新闻文章集。 传统的基于关键字的搜索速度非常快,但有重要的缺陷。 假设我们有兴趣搜索有关心脏病发作的文章。 使用关键词“心脏病发作”进行搜索不会返回使用“心肌梗死”(心脏病发作的医学术语)的文章。 一个更新和更强大的方法来分析和组织大量的电子文档和文档检索是通过使用所谓的主题模型。 这些模型的工作原理是识别集合中的隐藏主题(在纽约时报的例子中,这些主题可能是体育、世界事件、政治等)。 并且还通过识别每个文档处理的主题。 到目前为止,最常用的主题模型是所谓的潜在狄利克雷分配(LDA)模型。 不幸的是,所有精确的实现都很慢:算法按某种顺序列出所有文档中的所有单词,然后对每个单词进行计算。 这是按顺序进行的:在所有先前单词的计算完成之前,不能对给定单词进行计算。 该项目将开发一类并行工作的算法,利用现在多核平台上可用的大规模分布式计算。 因此,例如,如果有1000个处理器可用,这些算法的工作速度比现有算法快1000倍。 这些新算法将使LDA模型能够用于非常大的文档集合。 主题可用于将文档分组。然而,在其核心,LDA是一个“多会员模式”:纽约时报的一篇关于NFL足球运动员跪在国歌属于体育部分,也在政治部分。 多成员模型出现在文档检索和分类以外的领域。 例如,在癌症基因组学中,肿瘤可能被分类为几种癌症亚型的成员。 该项目将开发其他多成员模型,旨在处理非文本数据,如上面的癌症基因组学示例,并将开发并行算法来处理这些模型。 LDA和其他多成员模型本质上是贝叶斯模型,其中每个文档的主题和主题成员是未知参数。 后验分布通常由马尔可夫链蒙特卡罗估计,它已经证明了收敛保证。 该项目将开发分组吉布斯采样器,更新组中的变量,其中一个组中的所有变量可以同时更新。 该项目还将开发实用的收敛诊断和新算法收敛速度的理论结果。 理论结果将使用户能够确定马尔可夫链需要运行多长时间,以提供所需的准确度。该奖项反映了NSF的法定使命,并已被认为是值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估的支持。

项目成果

期刊论文数量(7)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Scalable Hyperparameter Selection for Latent Dirichlet Allocation
Change Point Estimation in a Dynamic Stochastic Block Model
  • DOI:
  • 发表时间:
    2018-12
  • 期刊:
  • 影响因子:
    0
  • 作者:
    M. Bhattacharjee;M. Banerjee;G. Michailidis
  • 通讯作者:
    M. Bhattacharjee;M. Banerjee;G. Michailidis
System Identification of High-Dimensional Linear Dynamical Systems With Serially Correlated Output Noise Components
Regularized high dimension low tubal-rank tensor regression
  • DOI:
    10.1214/22-ejs2004
  • 发表时间:
    2022-01
  • 期刊:
  • 影响因子:
    1.1
  • 作者:
    S. Roy;G. Michailidis
  • 通讯作者:
    S. Roy;G. Michailidis
Sequential change-point detection in high-dimensional Gaussian graphical models
  • DOI:
  • 发表时间:
    2018-06
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Hossein Keshavarz;G. Michailidis;Y. Atchadé
  • 通讯作者:
    Hossein Keshavarz;G. Michailidis;Y. Atchadé
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Hani Doss其他文献

Bias Reduction When There Is No Unbiased Estimate.
当没有无偏估计时减少偏差。
  • DOI:
    10.1214/aos/1176347028
  • 发表时间:
    1989
  • 期刊:
  • 影响因子:
    4.5
  • 作者:
    Hani Doss;J. Sethuraman
  • 通讯作者:
    J. Sethuraman
Confidence Bands for the Median Survival Time as a Function of the Covariates in the Cox Model
中位生存时间的置信带作为 Cox 模型中协变量的函数
HYPERPARAMETER AND MODEL SELECTION FOR NONPARAMETRIC BAYES PROBLEMS VIA RADON-NIKODYM DERIVATIVES
基于 RADON-NIKODYM 导数的非参数贝叶斯问题的超参数和模型选择
  • DOI:
    10.5705/ss.2009.259
  • 发表时间:
    2012
  • 期刊:
  • 影响因子:
    1.4
  • 作者:
    Hani Doss
  • 通讯作者:
    Hani Doss
Discussion on the paper by Kong, McCullagh, Meng, Nicolae and Tan
Kong、McCullagh、Meng、Nicolae 和 Tan 对论文的讨论
An Elementary Approach to Weak Convergence for Quantile Processes, with Applications to Censored Survival Data
分位数过程弱收敛的基本方法及其在截尾生存数据中的应用
  • DOI:
  • 发表时间:
    1992
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Hani Doss;R. Gill
  • 通讯作者:
    R. Gill

Hani Doss的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Hani Doss', 18)}}的其他基金

Workshop on New Directions in Monte Carlo Methods
蒙特卡罗方法新方向研讨会
  • 批准号:
    1241502
  • 财政年份:
    2012
  • 资助金额:
    $ 35万
  • 项目类别:
    Standard Grant
2008 Workshop on Bayesian Model Selection and Objective Methods
2008年贝叶斯模型选择和客观方法研讨会
  • 批准号:
    0742079
  • 财政年份:
    2007
  • 资助金额:
    $ 35万
  • 项目类别:
    Standard Grant

相似海外基金

DMS-EPSRC: Asymptotic Analysis of Online Training Algorithms in Machine Learning: Recurrent, Graphical, and Deep Neural Networks
DMS-EPSRC:机器学习中在线训练算法的渐近分析:循环、图形和深度神经网络
  • 批准号:
    EP/Y029089/1
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Research Grant
CAREER: Blessing of Nonconvexity in Machine Learning - Landscape Analysis and Efficient Algorithms
职业:机器学习中非凸性的祝福 - 景观分析和高效算法
  • 批准号:
    2337776
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Continuing Grant
CAREER: From Dynamic Algorithms to Fast Optimization and Back
职业:从动态算法到快速优化并返回
  • 批准号:
    2338816
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Continuing Grant
CAREER: Structured Minimax Optimization: Theory, Algorithms, and Applications in Robust Learning
职业:结构化极小极大优化:稳健学习中的理论、算法和应用
  • 批准号:
    2338846
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Continuing Grant
CRII: SaTC: Reliable Hardware Architectures Against Side-Channel Attacks for Post-Quantum Cryptographic Algorithms
CRII:SaTC:针对后量子密码算法的侧通道攻击的可靠硬件架构
  • 批准号:
    2348261
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Standard Grant
CRII: AF: The Impact of Knowledge on the Performance of Distributed Algorithms
CRII:AF:知识对分布式算法性能的影响
  • 批准号:
    2348346
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Standard Grant
CRII: CSR: From Bloom Filters to Noise Reduction Streaming Algorithms
CRII:CSR:从布隆过滤器到降噪流算法
  • 批准号:
    2348457
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Standard Grant
EAGER: Search-Accelerated Markov Chain Monte Carlo Algorithms for Bayesian Neural Networks and Trillion-Dimensional Problems
EAGER:贝叶斯神经网络和万亿维问题的搜索加速马尔可夫链蒙特卡罗算法
  • 批准号:
    2404989
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Standard Grant
CAREER: Efficient Algorithms for Modern Computer Architecture
职业:现代计算机架构的高效算法
  • 批准号:
    2339310
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Continuing Grant
CAREER: Improving Real-world Performance of AI Biosignal Algorithms
职业:提高人工智能生物信号算法的实际性能
  • 批准号:
    2339669
  • 财政年份:
    2024
  • 资助金额:
    $ 35万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了