权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

BIGDATA: F: Reliable Inference with Big Data: Reproducibility, Data Sharing, Heterogeneity

BIGDATA：F：大数据的可靠推理：再现性、数据共享、异构性

基本信息

批准号：
1741162
负责人：
Andrea Montanari
金额：
$ 65万
依托单位：
Stanford University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2017
资助国家：
美国
起止时间：
2017-09-01 至 2021-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1741162&HistoricalAwards=false
关键词：
BIGDATA Reliable Inference Big Data

项目摘要

Over the last decade, 'big data' technologies have allowed the acquisition of vast amount of data (e.g. through smartphones) and their accumulation into large scale databases. Powerful hardware and software systems have been developed to crunch these data and extract statistical models. For instance, the outcome of a certain medical procedure can be modeled in terms of the features of the patient, thus in principle providing a personalized risk score for that procedure. Unfortunately, the increasing complexity of these data and of the algorithms used has made statistical models significantly less transparent. How certain are we of these statistical predictions? What is their limit of validity? How biased is the resulting model?This project focuses on four main challenges that are ubiquitous in big-data, and are crucial to extract reliable insights: reproducibility; data sharing; missing data; data heterogeneity. (1) Reproducibility requires being able to compare two models extracted from different data sets (e.g. after additional data have been accumulated). This is in turn impossible unless we have reliable procedures to quantify uncertainty and confidence in complex high-dimensional models. Recently proposed ideas in this direction are still insufficient to cope with realistic large-scale applications.(2) Data sharing is a key feature of modern data analysis, whereby a single massive data set is being studied by hundreds of independent researchers. Unguarded statistical inference by such a population of researchers unavoidably leads to large numbers of false discoveries. The project builds on false discovery rate-controlling methods to propose safe approaches for decentralized data analysis.(3) Missing data are ubiquitous in big data. While several methods have been developed in the past to deal with missing data, it is unclear to what extent they are applicable to modern scenarios. The project aims at developing principled guidelines based on a rigorous comparison of various approaches, and developing new algorithms based on maximum likelihood.(4) Data heterogeneity. Big data are often produced by the aggregation of multiple data sources. How can we prevent standard statistical procedures to be critically affected by such heterogeneities? The project uses new regularization schemes to fusion information across multiple sources.

在过去的十年中，“大数据”技术已经允许获取大量数据（例如通过智能手机）并将其积累到大规模数据库中。强大的硬件和软件系统已经被开发出来来处理这些数据并提取统计模型。例如，可以根据患者的特征对某个医疗程序的结果进行建模，从而原则上为该程序提供个性化的风险评分。不幸的是，这些数据和所用算法的日益复杂性使统计模型的透明度大大降低。我们对这些统计预测有多大把握？它们的有效期限是什么？结果模型的偏差有多大？该项目重点关注大数据中普遍存在的四个主要挑战，这些挑战对于提取可靠的见解至关重要：再现性;数据共享;缺失数据;数据异质性。(1)复制需要能够比较从不同数据集提取的两个模型（例如，在积累了额外的数据之后）。这反过来是不可能的，除非我们有可靠的程序来量化复杂的高维模型的不确定性和信心。最近提出的想法在这个方向上仍然不足以科普现实的大规模应用。(2)数据共享是现代数据分析的一个关键特征，数百名独立研究人员正在研究单个海量数据集。这样一群研究人员毫无防备的统计推断必然会导致大量的错误发现。该项目建立在错误发现率控制方法的基础上，为分散式数据分析提出了安全的方法。(3)缺失数据在大数据中无处不在。虽然过去已经开发了几种方法来处理缺失数据，但尚不清楚它们在多大程度上适用于现代情景。该项目旨在根据对各种方法的严格比较制定原则性准则，并根据最大似然法开发新算法。(4)数据异质性。大数据通常由多个数据源的聚合产生。我们怎样才能防止标准统计程序受到这种异质性的严重影响？该项目使用新的正则化方案来融合多个来源的信息。

项目成果

期刊论文数量（28）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Learning with invariances in random features and kernel models

DOI：
发表时间：
2021-02
期刊：
ArXiv
影响因子：
0
作者：
Song Mei;Theodor Misiakiewicz;A. Montanari
通讯作者：
Song Mei;Theodor Misiakiewicz;A. Montanari

Discussion of: “Nonparametric regression using deep neural networks with ReLU activation function”

DOI：
10.1214/19-aos1910
发表时间：
2020-08
期刊：
The Annals of Statistics
影响因子：
0
作者：
B. Ghorbani;Song Mei;Theodor Misiakiewicz;A. Montanari
通讯作者：
B. Ghorbani;Song Mei;Theodor Misiakiewicz;A. Montanari

When do neural networks outperform kernel methods?

DOI：
10.1088/1742-5468/ac3a81
发表时间：
2020-06
期刊：
Journal of Statistical Mechanics: Theory and Experiment
影响因子：
0
作者：
B. Ghorbani;Song Mei;Theodor Misiakiewicz;A. Montanari
通讯作者：
B. Ghorbani;Song Mei;Theodor Misiakiewicz;A. Montanari

Streaming Belief Propagation for Community Detection

DOI：
发表时间：
2021-06
期刊：
ArXiv
影响因子：
0
作者：
Yuchen Wu;M. Bateni;André Linhares;Filipe Almeida;A. Montanari;A. Norouzi-Fard;Jakab Tardos
通讯作者：
Yuchen Wu;M. Bateni;André Linhares;Filipe Almeida;A. Montanari;A. Norouzi-Fard;Jakab Tardos

Optimization of the Sherrington--Kirkpatrick Hamiltonian

Sherrington--Kirkpatrick 哈密顿量的优化

DOI：
10.1137/20m132016x
发表时间：
2021
期刊：
SIAM Journal on Computing
影响因子：
1.6
作者：
Montanari, Andrea
通讯作者：
Montanari, Andrea

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Andrea Montanari其他文献

A sensor-based study on the environmental determinants of sleep in older adults

一项基于传感器的关于老年人睡眠环境决定因素的研究

DOI：
10.1016/j.envres.2025.120874
发表时间：
2025-06-01
期刊：
ENVIRONMENTAL RESEARCH
影响因子：
7.700
作者：
Andrea Montanari;Giovanna Fancello;Cédric Sueur;Yan Kestens;Frank J. van Lenthe;Basile Chaix
通讯作者：
Basile Chaix

Understanding Inverse Scaling and Emergence in Multitask Representation Learning

了解多任务表示学习中的逆缩放和涌现

DOI：
发表时间：
2024
期刊：
International Conference on Artificial Intelligence and Statistics
影响因子：
0
作者：
M. E. Ildiz;Zhe Zhao;Samet Oymak;Xiangyu Chang;Yingcong Li;Christos Thrampoulidis;Lin Chen;Yifei Min;Mikhail Belkin;Aakanksha Chowdhery;Sharan Narang;Jacob Devlin;Maarten Bosma;Gaurav Mishra;Adam Roberts;Liam Collins;Hamed Hassani;M. Soltanolkotabi;Aryan Mokhtari;Sanjay Shakkottai;Provable;Simon S. Du;Wei Hu;S. Kakade;Chelsea Finn;A. Rajeswaran;Deep Ganguli;Danny Hernandez;Liane Lovitt;Amanda Askell;Yu Bai;Anna Chen;Tom Conerly;Nova Dassarma;Dawn Drain;Sheer Nelson El;El Showk;Stanislav Fort;Zac Hatfield;T. Henighan;Scott Johnston;Andy Jones;Nicholas Joseph;Jackson Kernian;Shauna Kravec;Benjamin Mann;Neel Nanda;Kamal Ndousse;Catherine Olsson;D. Amodei;Tom Brown;Jared Ka;Sam McCandlish;Chris Olah;Dario Amodei;Trevor Hastie;Andrea Montanari;Saharon Rosset;Jordan Hoffmann;Sebastian Borgeaud;A. Mensch;Elena Buchatskaya;Trevor Cai;Eliza Rutherford;Diego de;Las Casas;Lisa Anne Hendricks;Johannes Welbl;Aidan Clark;Tom Hennigan;Eric Noland;Katie Millican;George van den Driessche;Bogdan Damoc;Aurelia Guy;Simon Osindero;Karen Si;Erich Elsen;Jack W. Rae;O. Vinyals;Jared Kaplan;B. Chess;R. Child;S. Gray;Alec Radford;Jeffrey Wu;I. R. McKenzie;Alexander Lyzhov;Michael Pieler;Alicia Parrish;Aaron Mueller;Ameya Prabhu;Euan McLean;Aaron Kirtland;Alexis Ross;Alisa Liu;Andrew Gritsevskiy;Daniel Wurgaft;Derik Kauff;Gabriel Recchia;Jiacheng Liu;Joe Cavanagh;Tom Tseng;Xudong Korbak;Yuhui Shen;Zhengping Zhang;Najoung Zhou;Samuel R Kim;Bowman Ethan;Perez;Feng Ruan;Youngtak Sohn
通讯作者：
Youngtak Sohn