权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Random Matrices in Multivariate Statistics: Theoretical Developments and Applications

多元统计中的随机矩阵：理论发展和应用

基本信息

批准号：
0605169
负责人：
Noureddine El Karoui
金额：
$ 24万
依托单位：
University of California-Berkeley
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2006
资助国家：
美国
起止时间：
2006-07-01 至 2010-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=0605169&HistoricalAwards=false
关键词：
Random Matrices Multivariate Statistics Theoretical

项目摘要

This research program is currently focused on the development of data analysis methods for the new paradigm of high-dimensional problems. The associated theoretical problems are concerned with eigenvalues of large dimensional random matrices. More precisely, three related directions seem of particular interest: 1) further our understanding of the spectral properties of the relevant random matrices; 2) make practical use of the results obtained, combined with some more classical results from random matrix theory; 3) find and contribute to area of applications where this framework is relevant. More specifically, it is now very often that statisticians are faced with ``n times p" data matrices X, for which p, is of the same order of magnitude as n, and p and n are both large. The sample covariance matrix computed from this data is of great importance to a number of applications, as it underlies widely used methods like principal components analysis. However, the theoretical results which underly the method fail to apply in the "large n, large p" setting just described. Hence, a thorough study of sample covariance matrices in this setting is needed. Eigenvalues of such large dimensional matrices are of particular interest. The largest and smallest eigenvalues of these matrices are, from the point of view of applications, particularly interesting. The aim of the study is to obtain central limit type theorems for these extreme eigenvalues and use them in Statistics for, for instance, hypothesis testing, having a notion of power, etc... A more applied part of this work concerns efficiently using results from random matrix theory - new and old - to better estimate the eigenvalues of the population covariance with the ultimate aim of better estimating the whole covariance matrix when p and n are both large.Technological progress allows us to store and use massive amounts of data about many aspects of our daily lives. An interesting problem is to use this data to understand how certain traits depend on each other. In the stock market, we might be interested in how the behavior of one stock affects the behavior of another stock;understanding all these interrelationships leads to having a measure of the risk taken by investing in portfolios that use the corresponding stocks. Statisticians have a number of tools to deal with all these interrelationships. We can discover ways to look at the data so that, even if all interrelationships are small or weak, so each trait "should" not help us learn too much about any other trait, we might find combinations of the traits that carry enormous amounts of information. We also know what are typical values for these combinations, so we might be able to detect unusual things in the data by looking at it the right way. Those statistical techniques have very wide applications in various fields of science, ranging from climatology to genetics, image recognition etc... Thousands of research papers are published each year that use these techniques. However, the theory that underlies these statistical techniques was created in an era where massive datasets just did not exist, as they were not storable. This research project is focusing on theories and their applications that are better suited to handle our current massive datasets. The applications should allow us to see structure where the classical tools fail to see any and tell us when there is no structure when the classical tools tell us there is. We also have increasing evidence that our standard tools give us often very inaccurate results about our standard measures of risk or amount of information carried in combination of traits. It seems that risks might be underestimated and amount of information might be overestimated. Part of this research program will be dedicated to measuring how inaccurate the classical results are for large datasets and how can a more relevant theory be used for correcting these inaccuracies.

该研究计划目前专注于为高维问题的新范式开发数据分析方法。与之相关的理论问题涉及到高维随机矩阵的特征值问题。更确切地说，三个相关的方向似乎特别感兴趣：1）进一步了解相关随机矩阵的谱性质; 2）实际使用所获得的结果，结合随机矩阵理论的一些更经典的结果; 3）找到并有助于该框架相关的应用领域。更具体地说，现在统计学家经常面对"n乘p”的数据矩阵X，其中p与n具有相同的数量级，并且p和n都很大。从这些数据计算的样本协方差矩阵对于许多应用非常重要，因为它是广泛使用的方法（如主成分分析）的基础。然而，该方法所依据的理论结果不能应用于刚刚描述的“大n，大p”设置。因此，在这种情况下，样本协方差矩阵的深入研究是必要的。这种大维度矩阵的特征值特别令人感兴趣。这些矩阵的最大和最小特征值，从应用的角度来看，特别有趣。这项研究的目的是获得这些极端特征值的中心极限类型定理，并将其用于统计学，例如假设检验，具有权力的概念等。这项工作的一个更实用的部分涉及有效地使用随机矩阵理论的结果-新的和旧的-更好地估计总体协方差的特征值，最终目的是更好地估计整个协方差矩阵时，p和n都很大。技术进步使我们能够存储和使用大量的数据，我们日常生活的许多方面。一个有趣的问题是使用这些数据来理解某些特征是如何相互依赖的。在股票市场上，我们可能会对一只股票的行为如何影响另一只股票的行为感兴趣;了解所有这些相互关系可以衡量投资于使用相应股票的投资组合所承担的风险。统计学家有许多工具来处理所有这些相互关系。我们可以找到查看数据的方法，即使所有的相互关系都很小或很弱，所以每个特征“不应该”帮助我们了解任何其他特征，我们可能会发现携带大量信息的特征组合。我们也知道这些组合的典型值是什么，所以我们可以通过正确的方式来检测数据中的异常情况。这些统计技术在各个科学领域都有非常广泛的应用，从气候学到遗传学，图像识别等。每年都有成千上万的研究论文使用这些技术。然而，这些统计技术背后的理论是在一个不存在大规模数据集的时代创建的，因为它们不可存储。这个研究项目的重点是更适合处理我们当前大量数据集的理论及其应用。应用程序应该允许我们看到经典工具看不到的结构，并在经典工具告诉我们没有结构时告诉我们。我们也有越来越多的证据表明，我们的标准工具经常给我们非常不准确的结果，关于我们的风险标准措施或特征组合所携带的信息量。似乎风险可能被低估，信息量可能被高估。该研究计划的一部分将致力于测量大型数据集的经典结果有多不准确，以及如何使用更相关的理论来纠正这些不准确。