III: Small: RUI: Scalable and Iterative Statistical Testing of Multiple Hypotheses on Massive Datasets

III:小型:RUI:海量数据集上多个假设的可扩展和迭代统计检验

基本信息

  • 批准号:
    2006765
  • 负责人:
  • 金额:
    $ 37.34万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-10-01 至 2024-09-30
  • 项目状态:
    已结题

项目摘要

Modern scientific practice is rooted on statistical testing of hypotheses on data. To limit the risk of false discoveries, the tests must offer strict statistical guarantees. The task is very challenging due to the sheer amount of rich data available today, and to the ever-increasing number of complex hypotheses that scientists want to test on the same data. In order for science to advance, and therefore advance society and human well-being, it is of the foremost importance that scientists are given tools that overcome these challenges. This project will design novel computational methods for statistical hypothesis testing that tackle all the above challenges by combining modern statistical results with recent approaches from the area of knowledge discovery and data mining, a field of computer science dealing with the efficient analysis of data. As part of the educational activities, this project will develop materials for college-level courses to ensure that the next generation of scientists and computer scientists posses the intellectual and practical knowledge to ensure a statistically-sound analysis of data and testing of hypotheses by using and extending the methods developed in the project. A diverse cohort of undergraduate students will be involved in the research and educational components of the project.The team of researchers in this project will design and mathematically analyze algorithms to make statistical hypothesis testing iterative and scalable along multiple dimensions. Many existing statistical procedures are already computationally expensive when testing a single hypothesis on moderate-size datasets, and become even more inefficient as the amount of data or the number of hypotheses grows. Along the dimension of data complexity, available tests often lack scalability because limited to simple types of data (e.g., binary tables), while fewer methods are available for rich data such as attributed graphs or panel time-series. The lack of scalable methods may be due in part to the requirement that hypothesis tests satisfy stringent statistical guarantees (e.g., the Family-Wise Error Rate (FWER) and the False Discovery Rate (FDR)) to ensure that the successive inference is sound. Additionally, the iterative aspect of the practice of data analysis has been ignored for statistical tests, but considering it is crucial in order to ensure that these guarantees are satisfied. This project will develop algorithms for the scalable and iterative statistical testing of multiple complex hypotheses on massive rich datasets, while imposing only weak assumptions on the data generation process, and controlling the FWER and the FDR. These results will be achieved by bringing together two areas of computer science research that had, until now, only very limited points of contact: statistical learning theory and data mining. The novel methods developed in this project will use concepts from the former, such as (local) Rademacher averages, covering numbers, and pseudodimension, to exploit the structure of the class of hypotheses being tested and achieve better sample complexity bounds, which translate to higher statistical power and improved control of the FWER/FDR, even in an iterative data analysis setting. These concepts will be adapted to statistical hypothesis testing and strengthen to fully exploit their practical usefulness, especially on rich datasets and in the presence of dependencies between the data points. The project team will use techniques from the knowledge discovery task of pattern mining to efficiently explore the space of hypotheses to filter out those that are definitively not significant. To reach this goal, the project team will develop novel bounds for the p-value functions of different tests and adapt these techniques to rich datasets such as attributed graphs.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
现代科学实践植根于对数据假设的统计检验。为了限制错误发现的风险,测试必须提供严格的统计保证。这项任务非常具有挑战性,因为今天有大量的丰富数据可供使用,而且科学家们想要在相同的数据上测试的复杂假设数量也在不断增加。为了使科学进步,从而促进社会和人类福祉,最重要的是向科学家提供克服这些挑战的工具。该项目将为统计假设检验设计新颖的计算方法,通过将现代统计结果与知识发现和数据挖掘领域的最新方法相结合,解决上述所有挑战。数据挖掘是计算机科学的一个领域,涉及数据的有效分析。作为教育活动的一部分,该项目将开发大学水平课程的材料,以确保下一代科学家和计算机科学家拥有智力和实践知识,以确保通过使用和扩展项目中开发的方法对数据进行统计合理的分析和假设测试。不同的本科生群体将参与该项目的研究和教育部分。这个项目的研究团队将设计和数学分析算法,使统计假设测试在多个维度上迭代和可扩展。当在中等规模的数据集上测试单个假设时,许多现有的统计程序在计算上已经很昂贵,并且随着数据量或假设数量的增加而变得更加低效。沿着数据复杂性的维度,可用的测试通常缺乏可伸缩性,因为仅限于简单类型的数据(例如,二进制表),而用于丰富数据(例如属性图或面板时间序列)的方法较少。缺乏可扩展方法的部分原因可能是假设检验需要满足严格的统计保证(例如,家庭错误率(FWER)和错误发现率(FDR)),以确保连续推断是合理的。此外,统计测试忽略了数据分析实践的迭代方面,但考虑到这一点对于确保满足这些保证至关重要。该项目将开发算法,用于在海量丰富数据集上对多个复杂假设进行可扩展和迭代的统计测试,同时仅对数据生成过程施加弱假设,并控制FWER和FDR。这些结果将通过汇集计算机科学研究的两个领域来实现,到目前为止,只有非常有限的接触点:统计学习理论和数据挖掘。本项目开发的新方法将使用前者的概念,如(局部)Rademacher平均、覆盖数和伪维度,来利用被测试的假设类别的结构,并获得更好的样本复杂性界限,从而转化为更高的统计能力,并改进对FWER/FDR的控制,即使在迭代数据分析设置中也是如此。这些概念将适用于统计假设检验,并加强充分利用其实际用途,特别是在丰富的数据集和数据点之间存在依赖关系的情况下。项目团队将使用来自模式挖掘的知识发现任务的技术来有效地探索假设的空间,以过滤掉那些确定不重要的假设。为了实现这一目标,项目团队将为不同测试的p值函数开发新的界限,并将这些技术应用于丰富的数据集,如属性图。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(13)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Sharp uniform convergence bounds through empirical centralization
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Cyrus Cousins;Matteo Riondato
  • 通讯作者:
    Cyrus Cousins;Matteo Riondato
Alice and the Caterpillar: A more descriptive null model for assessing data mining results
  • DOI:
    10.1109/icdm54844.2022.00052
  • 发表时间:
    2022-11
  • 期刊:
  • 影响因子:
    2.7
  • 作者:
    Giulia Preti;G. D. F. Morales;Matteo Riondato
  • 通讯作者:
    Giulia Preti;G. D. F. Morales;Matteo Riondato
MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
MCRapper:Poset 族的 Monte-Carlo Rademacher 平均值和近似模式挖掘
Bavarian: Betweenness Centrality Approximation with Variance-Aware Rademacher Averages
Statistically-sound Knowledge Discovery from Data
从数据中发现统计上合理的知识
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Matteo Riondato其他文献

The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling
SQL查询的VC维和通过采样估计选择性
Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies
基于采样的数据挖掘算法:现代技术和案例研究
  • DOI:
  • 发表时间:
    2014
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Matteo Riondato
  • 通讯作者:
    Matteo Riondato
Sharpe Ratio: Estimation, Confidence Intervals, and Hypothesis Testing
夏普比率:估计、置信区间和假设检验
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Matteo Riondato
  • 通讯作者:
    Matteo Riondato
MiSoSouP
味噌汤
MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension
MiSoSouP:通过采样和伪维度挖掘有趣的子群

Matteo Riondato的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Matteo Riondato', 18)}}的其他基金

CAREER: Statistically-Sound Knowledge Discovery from Data
职业:从数据中发现统计上合理的知识
  • 批准号:
    2238693
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Continuing Grant
NSF Student Travel Grant for 2019 SIAM International Conference on Data Mining (SDM)
2019 年 SIAM 国际数据挖掘会议 (SDM) NSF 学生旅费补助
  • 批准号:
    1918446
  • 财政年份:
    2019
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant

相似国自然基金

昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
  • 批准号:
    n/a
  • 批准年份:
    2022
  • 资助金额:
    10.0 万元
  • 项目类别:
    省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
  • 批准号:
    32000033
  • 批准年份:
    2020
  • 资助金额:
    24.0 万元
  • 项目类别:
    青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
  • 批准号:
    31972324
  • 批准年份:
    2019
  • 资助金额:
    58.0 万元
  • 项目类别:
    面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
  • 批准号:
    81900988
  • 批准年份:
    2019
  • 资助金额:
    21.0 万元
  • 项目类别:
    青年科学基金项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
  • 批准号:
    31802058
  • 批准年份:
    2018
  • 资助金额:
    26.0 万元
  • 项目类别:
    青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
  • 批准号:
    31870821
  • 批准年份:
    2018
  • 资助金额:
    56.0 万元
  • 项目类别:
    面上项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
  • 批准号:
    31772128
  • 批准年份:
    2017
  • 资助金额:
    60.0 万元
  • 项目类别:
    面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
  • 批准号:
    81704176
  • 批准年份:
    2017
  • 资助金额:
    20.0 万元
  • 项目类别:
    青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
  • 批准号:
    91640114
  • 批准年份:
    2016
  • 资助金额:
    85.0 万元
  • 项目类别:
    重大研究计划

相似海外基金

III: Small: RUI: Designing Structure-Phenotype Query-Retrieval and Analysis Systems for Microscopy-Based Whole Organism Studies
III:小:RUI:为基于显微镜的整个生物体研究设计结构表型查询检索和分析系统
  • 批准号:
    2401096
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: A Fairness Auditing Framework for Predictive Mobility Models
III:小:RUI:预测移动模型的公平性审核框架
  • 批准号:
    2304213
  • 财政年份:
    2023
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Finding Best Representative Phylogenetic Tree Reconciliations
III:小:RUI:寻找最佳代表性系统发育树协调
  • 批准号:
    2231150
  • 财政年份:
    2022
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Collaborative Research: Modeling Pre- and Post- Conditions for Understanding Events
III:小:RUI:协作研究:为理解事件建模前后条件
  • 批准号:
    2007128
  • 财政年份:
    2020
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Interagency Agreement
III: Small: RUI: Investigating Fragmentation Rules and Improving Metabolite Identification Using Graph Grammar and Statistical Methods
III:小:RUI:使用图语法和统计方法研究断裂规则并改进代谢物识别
  • 批准号:
    2053286
  • 财政年份:
    2020
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Finding Best Representative Phylogenetic Tree Reconciliations
III:小:RUI:寻找最佳代表性系统发育树协调
  • 批准号:
    1905885
  • 财政年份:
    2019
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Investigating Fragmentation Rules and Improving Metabolite Identification Using Graph Grammar and Statistical Methods
III:小:RUI:使用图语法和统计方法研究断裂规则并改进代谢物识别
  • 批准号:
    1813252
  • 财政年份:
    2019
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: RUI: Designing Structure-Phenotype Query-Retrieval and Analysis Systems for Microscopy-Based Whole Organism Studies
III:小:RUI:为基于显微镜的整个生物体研究设计结构表型查询检索和分析系统
  • 批准号:
    1817239
  • 财政年份:
    2018
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
III: Small: Collaborative Research: RUI: Scalable Schema-Based Event Extraction
III:小型:协作研究:RUI:可扩展的基于模式的事件提取
  • 批准号:
    1617952
  • 财政年份:
    2016
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Interagency Agreement
III: Small: RUI: Efficient Search, Comparison, and Annotation for Biological Sequences
III:小:RUI:生物序列的高效搜索、比较和注释
  • 批准号:
    1528027
  • 财政年份:
    2015
  • 资助金额:
    $ 37.34万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了