权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Statistical Machine Learning Methods for Complex Data Sets

复杂数据集的统计机器学习方法

基本信息

批准号：
1811315
负责人：
Kean Ming Tan
金额：
$ 12万
依托单位：
University of Minnesota-Twin Cities
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-08-01 至 2019-10-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1811315&HistoricalAwards=false
关键词：
Statistical Machine Learning Methods Complex

项目摘要

Recent advances in science and technology have led to the generation of massive amounts of large-scale data with complex structures, including genomics, neuroimaging, and microbiology data. These large-scale datasets pose significant statistical and computational challenges to data analysis. Firstly, widely used statistical methods yield unstable estimates and are not computationally scalable to modeling large-scale data sets. Secondly, complex data sets are often accompanied by outliers due to possibly measurement error or heavy-tailed random noise. For instance, in genomic studies, it has been observed that the distribution of gene expression levels is generally heavy-tailed, that is, the data contain a lot of extremely large values. Classical statistical methods will yield biased estimates and spurious scientific discovery if these outliers are not taken into account during model estimation and inference. This project aims to develop scalable and robust multivariate statistical methods to address the aforementioned problems. In this project, the investigator uses a combination of regularization and statistical optimization techniques to develop novel multivariate statistical methods for analyzing complex high-dimensional data sets. The first part of the project concerns the sparse generalized eigenvalue problem, which arises naturally in many statistical models such as partial least squares, canonical correlation analysis, sufficient dimension reduction, and Fisher's discriminant analysis. The investigator will develop a general framework for solving the sparse generalized eigenvalue problem and make available a wide range of statistical models for analyzing high-dimensional data. Furthermore, the investigator will study the theoretical properties of sparse generalized eigenvalue problem, and this will lead to the understanding of various statistical models that are previously not well understood in the high-dimensional setting. The second part of the research project focuses on a class of robust sparse reduced rank regression models. The investigator will develop efficient algorithms and high-dimensional asymptotic analysis for the resulting estimators under the Huber loss function, and quantify the bias-robust tradeoff between using Huber loss and squared error loss. This research project will also deliver easy-to-use software packages for fitting the developed methods.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

科学和技术的最新进展导致产生了大量具有复杂结构的大规模数据，包括基因组学，神经成像和微生物学数据。这些大规模数据集对数据分析提出了重大的统计和计算挑战。首先，广泛使用的统计方法产生不稳定的估计，并且在计算上不可扩展以建模大规模数据集。其次，复杂的数据集往往伴随着离群值，由于可能的测量误差或重尾随机噪声。例如，在基因组研究中，已经观察到基因表达水平的分布通常是重尾的，即数据包含许多极大值。如果在模型估计和推断过程中不考虑这些离群值，经典的统计方法将产生有偏差的估计和虚假的科学发现。该项目旨在开发可扩展和强大的多元统计方法来解决上述问题。在这个项目中，研究人员使用正则化和统计优化技术的组合来开发新的多元统计方法来分析复杂的高维数据集。该项目的第一部分涉及稀疏广义特征值问题，它自然出现在许多统计模型中，如偏最小二乘，典型相关分析，充分降维和Fisher判别分析。研究人员将开发一个解决稀疏广义特征值问题的一般框架，并提供广泛的统计模型来分析高维数据。此外，研究人员将研究稀疏广义特征值问题的理论性质，这将导致对以前在高维环境中没有很好理解的各种统计模型的理解。第二部分研究了一类鲁棒稀疏降秩回归模型。研究人员将开发有效的算法和高维渐近分析的Huber损失函数下的估计，并量化使用Huber损失和平方误差损失之间的偏差稳健权衡。该研究项目还将提供易于使用的软件包，以适应开发的方法。该奖项反映了NSF的法定使命，并已被认为是值得支持的评估使用基金会的智力价值和更广泛的影响审查标准。