权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

AF: MEDIUM: Collaborative Research: Foundations of Adaptive Data Analysis

AF：中：协作研究：自适应数据分析的基础

基本信息

批准号：
1763314
负责人：
AARON ROTH
金额：
$ 37.8万
依托单位：
University of Pennsylvania
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-03-01 至 2022-02-28
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1763314&HistoricalAwards=false
关键词：
AF MEDIUM Collaborative Research Foundations

项目摘要

Classical tools for rigorously analyzing data make the assumption that the analysis is static: the models and the hypotheses to be tested are fixed independently of the data, and preliminary analysis of the data does not feed back into the data gathering procedure. On the other hand, modern data analysis is highly adaptive. Large parts of modern machine learning perform model selection as a function of the data by iteratively tuning hyper-parameters, and exploratory data analysis is conducted to suggest hypotheses, which are then validated on the same data sets used to discover them. This kind of adaptivity is often referred to as p-hacking, and blamed in part for the surprising prevalence of non-reproducible science in some empirical fields. This project aims to develop rigorous tools and methodologies to perform statistically valid data analysis in the adaptive setting, drawing on techniques from statistics, information theory, differential privacy, and stable algorithm design. The technical goals of this project include coming up with: 1) information-theoretic measures that characterize the degree to which a worst-case data analysis can over-fit, given an interaction with a dataset; 2) models for data analysts that move beyond the worst-case setting, and; 3) empirical investigations that bridge the gap between theory and practice. The problem of adaptive data analysis (also called post-selection inference, or selective inference) has attracted attention in both computer science and statistics over the past several years, but from relatively disjoint communities. Part of the aim of this project is to integrate these two lines of work. The team of researchers on this project span departments of computer science, statistics, and biomedical data science. In addition to attempting to unify these two areas, the broader impacts of this research will be to make science more reliable, and reduce the prevalence of "over-fitting" and "false discovery." The project also has a significant outreach and education component, and will educate graduate students, organize workshops, and produce expository materials.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

用于严格分析数据的经典工具假设分析是静态的：待测试的模型和假设是独立于数据固定的，并且对数据的初步分析不会反馈到数据收集过程中。另一方面，现代数据分析具有高度适应性。现代机器学习的大部分内容是通过迭代调整超参数来根据数据进行模型选择，并进行探索性数据分析以提出假设，然后在用于发现它们的相同数据集上进行验证。这种适应性通常被称为p-hacking，并在一定程度上归咎于不可复制的科学在某些经验领域的惊人流行。该项目旨在开发严格的工具和方法，以在自适应环境中进行统计有效的数据分析，借鉴统计学，信息论，差异隐私和稳定算法设计的技术。该项目的技术目标包括提出：1）信息理论测量，表征最坏情况数据分析的过度拟合程度，给定与数据集的交互; 2）超越最坏情况设置的数据分析模型; 3）弥合理论与实践之间差距的实证调查。自适应数据分析（也称为后选择推理或选择性推理）的问题在过去几年中引起了计算机科学和统计学的关注，但来自相对不相交的社区。该项目的部分目的是将这两条工作线结合起来。该项目的研究人员团队跨越计算机科学，统计学和生物医学数据科学部门。除了试图统一这两个领域外，这项研究更广泛的影响将是使科学更加可靠，并减少“过度拟合”和“错误发现”的流行。“该项目也有一个重要的推广和教育组成部分，并将教育研究生，组织研讨会，并产生临时材料。这个奖项反映了NSF的法定使命，并已被认为是值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估的支持。