权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Small: RUI: Scalable and Iterative Statistical Testing of Multiple Hypotheses on Massive Datasets

III：小型：RUI：海量数据集上多个假设的可扩展和迭代统计检验

基本信息

批准号：
2006765
负责人：
Matteo Riondato
金额：
$ 37.34万
依托单位：
Amherst College
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-10-01 至 2024-09-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2006765&HistoricalAwards=false
关键词：
III Small RUI Scalable Iterative

项目摘要

Modern scientific practice is rooted on statistical testing of hypotheses on data. To limit the risk of false discoveries, the tests must offer strict statistical guarantees. The task is very challenging due to the sheer amount of rich data available today, and to the ever-increasing number of complex hypotheses that scientists want to test on the same data. In order for science to advance, and therefore advance society and human well-being, it is of the foremost importance that scientists are given tools that overcome these challenges. This project will design novel computational methods for statistical hypothesis testing that tackle all the above challenges by combining modern statistical results with recent approaches from the area of knowledge discovery and data mining, a field of computer science dealing with the efficient analysis of data. As part of the educational activities, this project will develop materials for college-level courses to ensure that the next generation of scientists and computer scientists posses the intellectual and practical knowledge to ensure a statistically-sound analysis of data and testing of hypotheses by using and extending the methods developed in the project. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.The team of researchers in this project will design and mathematically analyze algorithms to make statistical hypothesis testing iterative and scalable along multiple dimensions. Many existing statistical procedures are already computationally expensive when testing a single hypothesis on moderate-size datasets, and become even more inefficient as the amount of data or the number of hypotheses grows. Along the dimension of data complexity, available tests often lack scalability because limited to simple types of data (e.g., binary tables), while fewer methods are available for rich data such as attributed graphs or panel time-series. The lack of scalable methods may be due in part to the requirement that hypothesis tests satisfy stringent statistical guarantees (e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR)) to ensure that the successive inference is sound. Additionally, the iterative aspect of the practice of data analysis has been ignored for statistical tests, but considering it is crucial in order to ensure that these guarantees are satisfied. This project will develop algorithms for the scalable and iterative statistical testing of multiple complex hypotheses on massive rich datasets, while imposing only weak assumptions on the data generation process, and controlling the FWER and the FDR. These results will be achieved by bringing together two areas of computer science research that had, until now, only very limited points of contact: statistical learning theory and data mining. The novel methods developed in this project will use concepts from the former, such as (local) Rademacher averages, covering numbers, and pseudodimension, to exploit the structure of the class of hypotheses being tested and achieve better sample complexity bounds, which translate to higher statistical power and improved control of the FWER/FDR, even in an iterative data analysis setting. These concepts will be adapted to statistical hypothesis testing and strengthen to fully exploit their practical usefulness, especially on rich datasets and in the presence of dependencies between the data points. The project team will use techniques from the knowledge discovery task of pattern mining to efficiently explore the space of hypotheses to filter out those that are definitively not significant. To reach this goal, the project team will develop novel bounds for the p-value functions of different tests and adapt these techniques to rich datasets such as attributed graphs.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

现代科学实践植根于对数据假设的统计检验。为了限制错误发现的风险，测试必须提供严格的统计保证。这项任务非常具有挑战性，因为今天有大量的丰富数据可供使用，而且科学家们想要在相同的数据上测试的复杂假设数量也在不断增加。为了使科学进步，从而促进社会和人类福祉，最重要的是向科学家提供克服这些挑战的工具。该项目将为统计假设检验设计新颖的计算方法，通过将现代统计结果与知识发现和数据挖掘领域的最新方法相结合，解决上述所有挑战。数据挖掘是计算机科学的一个领域，涉及数据的有效分析。作为教育活动的一部分，该项目将开发大学水平课程的材料，以确保下一代科学家和计算机科学家拥有智力和实践知识，以确保通过使用和扩展项目中开发的方法对数据进行统计合理的分析和假设测试。不同的本科生群体将参与该项目的研究和教育部分。这个项目的研究团队将设计和数学分析算法，使统计假设测试在多个维度上迭代和可扩展。当在中等规模的数据集上测试单个假设时，许多现有的统计程序在计算上已经很昂贵，并且随着数据量或假设数量的增加而变得更加低效。沿着数据复杂性的维度，可用的测试通常缺乏可伸缩性，因为仅限于简单类型的数据（例如，二进制表），而用于丰富数据（例如属性图或面板时间序列）的方法较少。缺乏可扩展方法的部分原因可能是假设检验需要满足严格的统计保证（例如，家庭错误率（FWER）和错误发现率（FDR）），以确保连续推断是合理的。此外，统计测试忽略了数据分析实践的迭代方面，但考虑到这一点对于确保满足这些保证至关重要。该项目将开发算法，用于在海量丰富数据集上对多个复杂假设进行可扩展和迭代的统计测试，同时仅对数据生成过程施加弱假设，并控制FWER和FDR。这些结果将通过汇集计算机科学研究的两个领域来实现，到目前为止，只有非常有限的接触点：统计学习理论和数据挖掘。本项目开发的新方法将使用前者的概念，如（局部）Rademacher平均、覆盖数和伪维度，来利用被测试的假设类别的结构，并获得更好的样本复杂性界限，从而转化为更高的统计能力，并改进对FWER/FDR的控制，即使在迭代数据分析设置中也是如此。这些概念将适用于统计假设检验，并加强充分利用其实际用途，特别是在丰富的数据集和数据点之间存在依赖关系的情况下。项目团队将使用来自模式挖掘的知识发现任务的技术来有效地探索假设的空间，以过滤掉那些确定不重要的假设。为了实现这一目标，项目团队将为不同测试的p值函数开发新的界限，并将这些技术应用于丰富的数据集，如属性图。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（13）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Sharp uniform convergence bounds through empirical centralization

DOI：
发表时间：
2020
期刊：
影响因子：
0
作者：
Cyrus Cousins;Matteo Riondato
通讯作者：
Cyrus Cousins;Matteo Riondato

Alice and the Caterpillar: A more descriptive null model for assessing data mining results

DOI：
10.1109/icdm54844.2022.00052
发表时间：
2022-11
期刊：
Knowledge and Information Systems
影响因子：
2.7
作者：
Giulia Preti;G. D. F. Morales;Matteo Riondato
通讯作者：
Giulia Preti;G. D. F. Morales;Matteo Riondato

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

MCRapper：Poset 族的 Monte-Carlo Rademacher 平均值和近似模式挖掘

DOI：
10.1145/3532187
发表时间：
2022
期刊：
ACM Transactions on Knowledge Discovery from Data
影响因子：
3.6
作者：
Pellegrina, Leonardo;Cousins, Cyrus;Vandin, Fabio;Riondato, Matteo
通讯作者：
Riondato, Matteo

Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages

DOI：
10.1145/3447548.3467354
发表时间：
2021-08
期刊：
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
影响因子：
0
作者：
Cyrus Cousins;Chloe Wohlgemuth;Matteo Riondato
通讯作者：
Cyrus Cousins;Chloe Wohlgemuth;Matteo Riondato

Statistically-sound Knowledge Discovery from Data

从数据中发现统计上合理的知识