RI: Medium: Extreme Clustering

RI:中:极端集群

基本信息

  • 批准号:
    1763618
  • 负责人:
  • 金额:
    $ 110.39万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2018
  • 资助国家:
    美国
  • 起止时间:
    2018-09-01 至 2023-08-31
  • 项目状态:
    已结题

项目摘要

Clustering is a fundamental tool for data science, hypothesis discovery, pattern discovery, and information integration. Given a collection of objects, clustering is the task of automatically grouping the objects so that objects within a group (called a cluster) are more similar to each other than to objects in other clusters. Clustering is widely used in medicine, engineering, science and commerce. Most modern clustering methods scale well to a large number of objects, but not to a large number of clusters. Furthermore currently widely used clustering methods excessively assign objects to clusters, suffer in accuracy, and do not represent uncertainty in the clustering. All of these weaknesses limit analysis capabilities in many scientific, engineering, and other high-impact applications. This project is developing new machine learning and algorithms for large-scale clustering that scales to both massive number of objects and massive number of clusters. This project will build on the recent preliminary success with a family of algorithms that build hierarchical clustering, which supports efficient re-assignment of data to new clusters, and which naturally represents uncertainty. The new research aims to further increase accuracy and scalability. The project team will demonstrate its new research in multiple domains relevant to national priorities, including clustering chemical compounds for material science discovery, clustering single cell genome data, and entity resolution on scientific metadata (such as paper authors, patent authors, papers, institutions, etc)--- creating tools that advance scientific discovery, collaboration and scientific peer review. All of the software developed as part of this project will be released as open source software in order to facilitate experimentation and adoption of our methods in research and practice. The project team will develop a tutorial at the intersection of machine learning and algorithms, and will additionally teach a course on efficient clustering methods to researchers beyond computer scientists.This project will develop new research on machine learning and algorithms for hierarchical clustering that scales to both massive number of input objects, N, and massive number of clusters, K---a problem setting termed "extreme clustering," named after its similarly- motivated supervised cousin, "extreme classification." The project builds on the successes of recent preliminary work on PERCH, a family of algorithms for large-scale, incremental-data, non-greedy, hierarchical clustering that has achieved remarkable new state-of-the-art results. The method efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, the approach performs tree rotations both for the sake of enhancing subtree purity and encouraging balanced trees. Experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest at clustering competitor in nearly half the time. The project will perform new research (a) improving flexibility through alternative clustering cost functions and data representations, (b) further improving scalability and accuracy through new tree routing functions, (c) developing new tree-cut methods for determining the best clusterings and distributions over clusterings, and (d) inventing new methods for joint clustering of multiple inter-related data instance types. Evaluation and application of the research will be conducted on multiple broad-impact, large-data domains, including biomedicine, material science, image analysis, and scientific information integration.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
聚类是数据科学、假设发现、模式发现和信息集成的基本工具。 给定一个对象集合,聚类是自动对对象进行分组的任务,以便组(称为集群)中的对象彼此之间比其他集群中的对象更相似。 聚类广泛应用于医学、工程、科学和商业。大多数现代聚类方法可以很好地扩展到大量的对象,但不能扩展到大量的聚类。 此外,目前广泛使用的聚类方法过度地将对象分配到簇,在准确性上受到影响,并且不表示聚类中的不确定性。 所有这些弱点都限制了许多科学、工程和其他高影响力应用的分析能力。该项目正在开发新的机器学习和大规模聚类算法,可扩展到大量对象和大量集群。 该项目将建立在最近的初步成功与一系列的算法,建立层次聚类,支持有效的重新分配数据到新的集群,这自然代表不确定性。 这项新研究旨在进一步提高准确性和可扩展性。该项目团队将展示其在与国家优先事项相关的多个领域的新研究,包括用于材料科学发现的化合物聚类,单细胞基因组数据聚类,以及科学元数据的实体解析(如论文作者,专利作者,论文,机构等)-创建推进科学发现,合作和科学同行评审的工具。 作为该项目的一部分开发的所有软件将作为开源软件发布,以促进实验和采用我们的研究和实践方法。该项目团队将开发一个机器学习和算法交叉点的教程,并将向计算机科学家以外的研究人员教授一门关于高效聚类方法的课程。该项目将开发关于机器学习和分层聚类算法的新研究,该算法可扩展到大量输入对象N和大量聚类K--一个称为“极端聚类”的问题设置,“以其动机类似的受监督表亲命名,“极端分类。“该项目建立在最近对PERCH的初步工作的成功之上,PERCH是一系列用于大规模、增量数据、非贪婪、层次聚类的算法,已经取得了显着的新的最先进的结果。该方法有效地将新数据点路由到增量构建的树的叶子。出于对准确性和速度的期望,该方法执行树旋转,以提高子树纯度和鼓励平衡树。实验表明,PERCH构建更准确的树比其他树构建聚类算法和规模以及与N和K,实现了更高的质量聚类比最强的聚类竞争对手在近一半的时间。该项目将进行新的研究(a)通过替代聚类成本函数和数据表示来提高灵活性,(B)通过新的树路由函数进一步提高可扩展性和准确性,(c)开发新的树切割方法来确定最佳聚类和聚类分布,以及(d)发明新的方法来联合聚类多个相互关联的数据实例类型。该研究的评估和应用将在多个影响广泛、大数据领域进行,包括生物医学、材料科学、图像分析和科学信息集成。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(23)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Exact and Approximate Hierarchical Clustering with A*
使用 A* 的精确和近似层次聚类
Scalable Hierarchical Clustering with Tree Grafting
DAG-Structured Clustering by Nearest Neighbors
  • DOI:
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Nicholas Monath;M. Zaheer;Kumar Avinava Dubey;Amr Ahmed;A. McCallum
  • 通讯作者:
    Nicholas Monath;M. Zaheer;Kumar Avinava Dubey;Amr Ahmed;A. McCallum
Modeling Transitivity and Cyclicity in Directed Graphs via Binary Code Box Embeddings
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Dongxu Zhang;Michael Boratko;Cameron Musco;A. McCallum
  • 通讯作者:
    Dongxu Zhang;Michael Boratko;Cameron Musco;A. McCallum
Clustering-based Inference for Zero-Shot Biomedical Entity Linking
  • DOI:
  • 发表时间:
    2020-10
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Rico Angell;Nicholas Monath;Sunil Mohan;Nishant Yadav;A. McCallum
  • 通讯作者:
    Rico Angell;Nicholas Monath;Sunil Mohan;Nishant Yadav;A. McCallum
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Andrew McCallum其他文献

An Interoperable Multimedia Catalog System for Electronic Commerce.
用于电子商务的可互操作多媒体目录系统。
  • DOI:
  • 发表时间:
    2000
  • 期刊:
  • 影响因子:
    0
  • 作者:
    William W. Cohen;Andrew McCallum;D. Quass
  • 通讯作者:
    D. Quass
ezCoref : A Scalable Approach for Collecting Crowdsourced Annotations for Coreference Resolution
ezCoref:一种收集众包注释以进行共指解析的可扩展方法
  • DOI:
  • 发表时间:
    2022
  • 期刊:
  • 影响因子:
    0
  • 作者:
    A. Crowdsourced;David Bamman;Olivia Lewke;Rachel Bawden;Rico Sennrich;Alexandra Birch;Ari Bornstein;Arie Cattan;Ido Dagan;Hong Chen;Zhenhua Fan;Hao Lu;Alan Yuille;Eduard Hovy;Mitch Marcus;M. Palmer;Lance;Rodney Huddleston. 2002;Frédéric Landragin;T. Poibeau;Bernard Vic;Belinda Z. Li;Gabriel Stanovsky;Robert L Logan;Andrew McCallum;Sameer Singh
  • 通讯作者:
    Sameer Singh
Scaling Within Document Coreference to Long Texts
文档共指内的缩放到长文本
  • DOI:
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Raghuveer Thirukovalluru;Nicholas Monath;K. Shridhar;M. Zaheer;Mrinmaya Sachan;Andrew McCallum
  • 通讯作者:
    Andrew McCallum
PaRaDe: Passage Ranking using Demonstrations with Large Language Models
PaRaDe:使用大型语言模型的演示进行段落排名
  • DOI:
    10.48550/arxiv.2310.14408
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Andrew Drozdov;Honglei Zhuang;Zhuyun Dai;Zhen Qin;Razieh Rahimi;Xuanhui Wang;Dana Alon;Mohit Iyyer;Andrew McCallum;Donald Metzler;Kai Hui
  • 通讯作者:
    Kai Hui
Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
每个答案都很重要:用概率度量评估常识
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Qi Cheng;Michael Boratko;Pranay Kumar Yelugam;T. O’Gorman;Nalini Singh;Andrew McCallum;X. Li
  • 通讯作者:
    X. Li

Andrew McCallum的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Andrew McCallum', 18)}}的其他基金

Collaborative Research: SOS-DCI / HNDS-R: Advancing Semantic Network Analysis to Better Understand How Evaluative Exchanges Shape Scientific Arguments
合作研究:SOS-DCI / HNDS-R:推进语义网络分析,以更好地理解评估性交流如何塑造科学论证
  • 批准号:
    2244805
  • 财政年份:
    2023
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
RI: Medium: Probabilistic Box Embeddings
RI:中:概率框嵌入
  • 批准号:
    2106391
  • 财政年份:
    2021
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
DMREF: Collaborative Research: The Synthesis Genome: Data Mining for Synthesis of New Materials
DMREF:协作研究:合成基因组:新材料合成的数据挖掘
  • 批准号:
    1922090
  • 财政年份:
    2019
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
DMREF: Collaborative Research: The Synthesis Genome: Data Mining for Synthesis of New Materials
DMREF:协作研究:合成基因组:新材料合成的数据挖掘
  • 批准号:
    1534431
  • 财政年份:
    2015
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
III: Medium: Constructing Knowledge Bases by Extracting Entity-Relations and Meanings from Natural Language via "Universal Schema"
III:媒介:通过“通用模式”从自然语言中提取实体关系和含义来构建知识库
  • 批准号:
    1514053
  • 财政年份:
    2015
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Continuing Grant
The Fourth Northeast Student Colloquium on Artificial Intelligence
第四届东北学生人工智能学术研讨会
  • 批准号:
    1036017
  • 财政年份:
    2010
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
CI-ADDO-EN: Flexible Machine Learning for Natural Language in the MALLET Toolkit
CI-ADDO-EN:MALLET 工具包中自然语言的灵活机器学习
  • 批准号:
    0958392
  • 财政年份:
    2010
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Continuing Grant
RI-Medium: Collaborative Research: Dynamically-Structured Conditional Random Fields for Complex, Natural Domains
RI-Medium:协作研究:复杂自然域的动态结构条件随机场
  • 批准号:
    0803847
  • 财政年份:
    2008
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Continuing Grant
CRI: Collaborative Research: Improving Experimental Computer Science with a Searchable Web Portal for Data Sets
CRI:协作研究:通过可搜索的数据集门户网站改进实验计算机科学
  • 批准号:
    0551597
  • 财政年份:
    2006
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Continuing Grant
ITR: Collaborative Research: (ACS+NHS)-(dmc+soc): Machine Learning for Sequences and Structured Data: Tools for Non-Experts
ITR:协作研究:(ACS NHS)-(dmc soc):序列和结构化数据的机器学习:非专家工具
  • 批准号:
    0427594
  • 财政年份:
    2004
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant

相似海外基金

Collaborative Research: SaTC: CORE: Medium: ONSET: Optics-enabled Network Defenses for Extreme Terabit DDoS Attacks
协作研究:SaTC:核心:中:ONSET:针对极端太比特 DDoS 攻击的光学网络防御
  • 批准号:
    2415754
  • 财政年份:
    2023
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
Collaborative Research: SaTC: CORE: Medium: ONSET: Optics- enabled Network Defenses for Extreme Terabit DDoS Attacks
协作研究:SaTC:核心:中:ONSET:针对极端太比特 DDoS 攻击的光学网络防御
  • 批准号:
    2132651
  • 财政年份:
    2022
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
Collaborative Research:CPS Medium: Population Games for Cyber-Physical Systems: New Theory with Tools for Transportation Management under Extreme Demand
合作研究:CPS Medium:网络物理系统的群体博弈:极端需求下运输管理的新理论和工具
  • 批准号:
    2135561
  • 财政年份:
    2022
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
Collaborative Research: SaTC: CORE: Medium: ONSET: Optics-enabled Network Defenses for Extreme Terabit DDoS Attacks
协作研究:SaTC:核心:中:ONSET:针对极端太比特 DDoS 攻击的光学网络防御
  • 批准号:
    2132639
  • 财政年份:
    2022
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
Collaborative Research: SaTC: CORE: Medium: ONSET: Optics-enabled Network Defenses for Extreme Terabit DDoS Attacks
协作研究:SaTC:核心:中:ONSET:针对极端太比特 DDoS 攻击的光学网络防御
  • 批准号:
    2132643
  • 财政年份:
    2022
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
Collaborative Research: CPS: Medium: Population Games for Cyber-Physical Systems: New Theory with Tools for Transportation Management under Extreme Demand
合作研究:CPS:媒介:网络物理系统的群体博弈:极端需求下运输管理的新理论和工具
  • 批准号:
    2135791
  • 财政年份:
    2022
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
Collaborative Research: A Medium-Band K-band Survey with Gemini to Identify the First Quenching Galaxies and Extreme Episodes of Galaxy Formation
合作研究:与双子座进行中波段 K 波段巡天,以确定第一个猝灭星系和星系形成的极端事件
  • 批准号:
    2009632
  • 财政年份:
    2020
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
Collaborative Research: A Medium-Band K-band Survey with Gemini to Identify the First Quenching Galaxies and Extreme Episodes of Galaxy Formation
合作研究:与双子座进行中波段 K 波段巡天,以确定第一个猝灭星系和星系形成的极端事件
  • 批准号:
    2009442
  • 财政年份:
    2020
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Standard Grant
SHF: Medium: Collaborative Research: ANACIN-X: Analysis and modeling of Nondeterminism and Associated Costs in eXtreme scale applications
SHF:中:协作研究:ANACIN-X:极端规模应用中的非确定性和相关成本的分析和建模
  • 批准号:
    1900765
  • 财政年份:
    2019
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Continuing Grant
SHF: Medium: Collaborative Research: ANACIN-X: Analysis and modeling of Nondeterminism and Associated Costs in eXtreme scale applications
SHF:中:协作研究:ANACIN-X:极端规模应用中的非确定性和相关成本的分析和建模
  • 批准号:
    1900888
  • 财政年份:
    2019
  • 资助金额:
    $ 110.39万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了