权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CIF: Small: Learning Low-Dimensional Representations with Heteroscedastic Data Sources

CIF：小：使用异方差数据源学习低维表示

基本信息

批准号：
2331590
负责人：
Laura Balzano
金额：
$ 60万
依托单位：
Regents of the University of Michigan - Ann Arbor
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2024
资助国家：
美国
起止时间：
2024-01-01 至 2026-12-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2331590&HistoricalAwards=false
关键词：
CIF Small Learning Low Dimensional

项目摘要

As data-collection efforts continue to grow, so does heterogeneity in data. Machine-learning methods typically assume that data come from a single source or uniform instrumentation with noise characteristics that are the same for every data point. This project will address questions fundamental to learning low-dimensional data representations with heteroscedastic data, wherein samples from different sources have additive noise of different variances. It is well-known that classical linear dimensionality-reduction methods such as principal component analysis (PCA) are sensitive to outliers, so high-variance noise will degrade representations learned by PCA. However, robust methods that simply reject outliers are suboptimal if, indeed, the data do have some signal, even if it is buried in noise. The premise of this project therefore is to use approaches that learn the best way to incorporate the contribution of every different data source, no matter how high- or low-quality, to improve the overall learned representation. Many applications will benefit from the work, including medical imaging, environmental monitoring, astronomical data analysis, computer vision, and bioinformatics. The investigators' prior work in this area indicates that when learning is driven by heterogeneous and heteroscedastic sources – for example, in medical imaging, using data from multiple scanners, or with varying radiation levels – a better model will be learned by actively considering and modeling the heterogeneity. How to optimize learning in the face of such heterogeneity has been so far relatively unstudied, and this research aims to fill that gap. The technical contributions will be in three directions. First, the team of researchers will study open questions regarding how heterogeneity in data affects PCA, including establishing the required sample complexity for learning heteroscedastic models and assessing the optimization landscape of heteroscedastic PCA problems. Second, the team will extend heteroscedastic PCA methods and theory to consider union-of-subspaces models, dictionary learning models, and transform learning models. Third, the investigators will consider how nonlinear low-dimensional embedding methods are affected by heteroscedasticity in the data. The work will focus on distance-based methods and develop a foundational understanding of using distances in machine learning with heterogeneous data sources.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

随着数据收集工作的不断增加，数据的异质性也在增加。机器学习方法通常假设数据来自单一来源或具有每个数据点相同噪声特征的统一仪器。该项目将解决使用异方差数据学习低维数据表示的基本问题，其中来自不同来源的样本具有不同方差的加性噪声。众所周知，经典的线性降维方法如主成分分析（PCA）对异常值非常敏感，因此高方差噪声会降低PCA学习到的表示。然而，如果数据确实有一些信号，即使它被淹没在噪声中，那么简单地拒绝异常值的鲁棒方法是次优的。因此，这个项目的前提是使用学习最好的方法来整合每个不同数据源的贡献，无论质量是高是低，以提高整体学习表示。许多应用将受益于这项工作，包括医学成像、环境监测、天文数据分析、计算机视觉和生物信息学。研究人员先前在这一领域的工作表明，当学习是由异质和异方差的来源驱动时——例如，在医学成像中，使用来自多个扫描仪的数据，或具有不同的辐射水平——通过积极考虑和建模异质性，将学习到更好的模型。面对这种异质性，如何优化学习迄今为止还没有研究，本研究旨在填补这一空白。技术贡献将集中在三个方面。首先，研究团队将研究数据异质性如何影响主成分分析的开放性问题，包括建立学习异方差模型所需的样本复杂性和评估异方差主成分分析问题的优化前景。其次，该团队将扩展异方差PCA方法和理论，以考虑子空间的并集模型、字典学习模型和转换学习模型。第三，研究人员将考虑非线性低维嵌入方法如何受到数据异方差的影响。这项工作将侧重于基于距离的方法，并对在异构数据源的机器学习中使用距离有一个基本的理解。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。