权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CIF: CAREER: Robust, Interpretable, and Efficient Unsupervised Learning with K-set Clustering

CIF：职业：使用 K 集聚类进行稳健、可解释且高效的无监督学习

基本信息

批准号：
1845076
负责人：
Laura Balzano
金额：
$ 59.68万
依托单位：
Regents of the University of Michigan - Ann Arbor
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2019
资助国家：
美国
起止时间：
2019-05-01 至 2025-04-30
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1845076&HistoricalAwards=false
关键词：
CIF CAREER Robust Interpretable Efficient

项目摘要

Modern machine learning techniques aim to design models and algorithms that allow computers to learn efficiently from vast amounts of previously unexplored data. These problems are called 'unsupervised' because no human-provided information about the data is available to guide the machine learning process. Arguably the two most important unsupervised machine learning tools are dimensionality-reduction and clustering. In dimensionality-reduction, the algorithm seeks a simple low-dimensional structure that captures the interesting behavior in the data. In clustering, the algorithm seeks to group data points together into meaningful clusters. As increasingly higher-dimensional data are collected about progressively more elaborate physical, biological, and social phenomena, algorithms that aim at both dimensionality reduction and clustering are often highly applicable. However, joint formulations in the literature are often ad-hoc and fundamentally unable to operate on real data that have missing elements, corruptions, and heterogeneity --- critical machine learning challenges for modern data problems. This research project is expected to have broad applicability in data science, and will be demonstrated in two applications: genetics and computer vision. The joint clustering and dimensionality reduction formulation used in this project, called K-set clustering, seeks K "central sets" constrained to have some low-dimensional representation, each of which represents one of K clusters in the data. The formulation is a generalization of K-means, K-subspaces, and principal component analysis, and it naturally leads to several novel problem instances. Given a defined set geometry, the corresponding problem instance is approached from two perspectives: understanding the geometry of that instance of the problem formulation, and learning those geometric models from data. Three specific examples of the problem formulation will be studied: subspace clustering, variety clustering, and polyhedral set clustering. While each example presents intrinsic and unique challenges, these are just examples of a larger paradigm that is limited only by one's ability to define sets amenable to modeling the geometric structure in data. The formulation allows for interpretable data analysis, with a framework that can readily incorporate missing data and heterogeneous data.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

现代机器学习技术旨在设计模型和算法，使计算机能够从大量以前未探索的数据中有效学习。这些问题被称为“无监督”，因为没有人类提供的关于数据的信息可用于指导机器学习过程。可以说，两个最重要的无监督机器学习工具是降维和聚类。在降维中，该算法寻求一个简单的低维结构，以捕获数据中有趣的行为。在聚类中，该算法寻求将数据点分组到有意义的集群中。随着越来越多的高维数据被收集，越来越复杂的物理，生物和社会现象，旨在降维和聚类的算法往往是高度适用的。然而，文献中的联合公式通常是临时的，并且从根本上无法对具有缺失元素、损坏和异质性的真实的数据进行操作--这是现代数据问题的关键机器学习挑战。该研究项目预计将在数据科学中具有广泛的适用性，并将在遗传学和计算机视觉两个应用中得到证明。在这个项目中使用的联合聚类和降维公式，称为K集聚类，寻找K个“中心集”，这些中心集被限制为具有一些低维表示，每个中心集代表数据中的K个聚类之一。该公式是K-均值，K-子空间和主成分分析的推广，它自然会导致几个新的问题实例。给定一个定义的集合几何，相应的问题实例从两个角度来处理：理解问题公式化的该实例的几何，并从数据中学习这些几何模型。三个具体的例子的问题制定将进行研究：子空间聚类，品种聚类和多面体集聚类。虽然每个例子都提出了固有的和独特的挑战，这些只是一个更大的范例的例子，只限于一个人的能力，以定义适合建模的几何结构的数据集。该奖项反映了NSF的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。