权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Foundations of Unsupervised and Weakly Supervised Learning

无监督和弱监督学习的基础

基本信息

批准号：
RGPIN-2019-06018
负责人：
Ashtiani, Hassan
金额：
$ 1.68万
依托单位：
McMaster University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=750423
关键词：
Foundations Unsupervised Weakly Supervised Learning

项目摘要

The development of computational tools that can extract useful structures from the ever-growing sources of data is transforming the way complex systems are analyzed or engineered. The constant push to use machine learning in new applications poses new theoretical and practical challenges, as the fundamental assumptions under which the standard machine learning methods work are no longer valid. While practitioners often resort to case-specific solutions in these situations as the first line of defense, it is essential to establish more generic and reliable design principles. Our plan is to address this for a number of related unsupervised (and weakly supervised) learning problems: we aim at mathematically characterizing the success of learning methods in those settings. The significance of unsupervised learning stems from the fact that these methods can utilize unannotated data (which is abundant in many domains). Unsupervised learning problems arise frequently in science and engineering (e.g., in exploratory data analysis, recommender systems, speech/text/image generation, and medical imaging). From the theoretical point of view, however, many areas of unsupervised learning are still under-developed compared to supervised learning, and heuristic methods are routinely adopted without offering meaningful user-level guarantees. We aim at addressing this shortcoming by formulating and analyzing a set of unsupervised learning paradigms, and providing provably efficient methods (in terms of computational and/or statistical complexities) for solving them. We will work on these directions: Systematic Model Selection Schemes for Clustering. Clustering methods are widely used in practice, but for the basic question of "which clustering method is good for my use case?" it is hard to find an established solution beyond trial and error. To address this, we seek to exploit domain knowledge in a principled way (e.g., interactive clustering or clustering with weak supervision). Efficient Learning and Testing of Distributions. Learning and testing an unknown distribution (given a sample generated from it) are classic problems in statistics. Fresh challenges, however, are arising from the need for handling high-dimensional data. We seek to develop methods that are not only statistically efficient but are also computationally tractable. Also, motivated by applications such as natural-language/speech generation, we study distribution learning/testing with respect to fundamentally different distance measures (e.g., adversarial distances). Supervised Learning with Scarce Training Data. In many applications (e.g., medical diagnosis) it is costly or even impossible to annotate the training data. Therefore, we seek to develop solutions that are less demanding in terms of annotated training data (using, e.g., unsupervised and weakly supervised learning). We also study the "effective sample complexity" of learning particularly for deep neural networks.

从不断增长的数据源中提取有用结构的计算工具的发展正在改变复杂系统的分析或设计方式。在新应用中不断推动使用机器学习带来了新的理论和实践挑战，因为标准机器学习方法工作的基本假设不再有效。虽然从业者经常在这些情况下求助于特定于案例的解决方案作为第一道防线，但建立更通用和可靠的设计原则至关重要。我们的计划是解决一些相关的无监督（和弱监督）学习问题：我们的目标是在数学上描述这些设置中学习方法的成功。无监督学习的重要性源于这样一个事实，即这些方法可以利用未注释的数据（在许多领域都很丰富）。无监督学习问题在科学和工程中经常出现（例如，在探索性数据分析、推荐系统、语音/文本/图像生成和医学成像中）。然而，从理论的角度来看，与监督学习相比，无监督学习的许多领域仍然发展不足，并且通常采用启发式方法，而没有提供有意义的用户级保证。我们的目标是通过制定和分析一组无监督学习范式来解决这个缺点，并提供可证明有效的方法（在计算和/或统计复杂性方面）来解决它们。我们将在这些方向上工作：聚类的系统模型选择方案。聚类方法在实践中得到了广泛的应用，但是对于“哪种聚类方法适合我的用例？他说：“除了不断尝试外，很难找到一个既定的解决办法。为了解决这个问题，我们寻求以有原则的方式利用领域知识（例如，交互式聚类或具有弱监督的聚类）。分布的有效学习和测试。学习和测试一个未知的分布（给定一个从它生成的样本）是统计学中的经典问题。然而，新的挑战是由于需要处理高维数据。我们寻求开发的方法，不仅是统计上有效的，但也计算上听话。此外，出于自然语言/语音生成等应用的动机，我们研究了分布学习/测试，这些分布学习/测试涉及到根本不同的距离度量（例如，敌对距离）。用稀缺的训练数据进行监督学习。在许多应用中（例如，医学诊断），注释训练数据是昂贵的或者甚至是不可能的。因此，我们寻求开发在带注释的训练数据方面要求较低的解决方案（使用，例如，无监督和弱监督学习）。我们还研究了学习的“有效样本复杂度”，特别是深度神经网络。