Robust and Efficient Statistical Inference in Large Scale Semi-Supervised Settings

大规模半监督环境中稳健且高效的统计推断

基本信息

  • 批准号:
    2113768
  • 负责人:
  • 金额:
    $ 17万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-08-01 至 2024-07-31
  • 项目状态:
    已结题

项目摘要

This project will develop methods for robust statistical inference in semi-supervised settings. Unlike more traditional data settings, semi-supervised settings are characterized by two types of available data: 1) a typical small or moderate sized labeled (or supervised) data containing observations for a response (or outcome) and a set of covariates (or predictors), and 2i) a much larger sized unlabeled (or unsupervised) data having observations only for the covariates. Such settings arise naturally whenever the covariates are easily available for a large cohort, while the response may be difficult and/or expensive to obtain due to practical constraints. These are increasingly relevant in modern studies in the big data era with large unlabeled databases (often electronically recorded) becoming easily available (and tractable) on top of a labeled data. Examples are ubiquitous across many disciplines, including computer science, machine learning, econometrics, and biomedical applications like electronic health records and integrative genomics. Statistical inference in semi-supervised settings is therefore of substantial interest. The ultimate question here is to investigate when and how one can use the extra information available from the large unlabeled data to “improve” upon a corresponding supervised approach, where improvement could be in terms of efficiency or robustness or both. This project aims to provide answers to such questions by developing a class of novel, provable and scalable semi-supervised inference methods for a range of fundamental problems in two fairly distinct and active research areas: 1) causal inference in semi-supervised settings, and 2) semi-supervised inference in the presence of selection bias in labeling. The research outlined in the project will lead to advances in bridging some major gaps in the existing literature and providing a much-needed unified understanding of semi-supervised inference and its subtleties. The methods will also have wide applicability to various domain areas, e.g. biomedical studies for precision medicine and causal inference. The project also has a significant education component, including mentoring of graduate students and curriculum development via short courses to raise awareness about these exciting new areas in modern statistics.In the first part of the project, the PI will consider causal inference in semi-supervised settings under the potential outcome framework, and explore semi-supervised inference for popular causal parameters, e.g. the average treatment effect and the quantile treatment effect, both of which have been widely studied in supervised settings but rarely so under semi-supervised settings. The PI will aim to develop semi-supervised methods for so-called doubly robust estimation of such parameters that can lead to improved (if not optimal) efficiency, as well as much stronger robustness properties than their best achievable supervised counterparts. The second part of the project will consider semi-supervised inference where the labeling mechanism has inherent selection bias, thus making the labeled and unlabeled data unequally distributed. Such settings, while of great practical relevance, have rarely been addressed so far, partly because their analysis is quite challenging since the labeling fraction decays to zero leading to a natural violation of the so-called positivity/overlap assumption. Under this setting, the PI will explore efficient and rate-optimal semi-supervised inference for various parameters, e.g. the mean response and the average treatment effect (under a causal framework), via doubly robust estimation methods, as well as modeling strategies for estimating the decaying propensity score which arises as an inevitable challenge and is of independent interest. Throughout, the PI's emphasis will be on developing methods with rigorous theoretical guarantees as well as efficient implementation that meets the scalability demanded by the intended applications on large modern datasets. The proposed methods will also bring together a synergy of tools and ideas from classical semi-parametric inference and modern high dimensional statistics theory.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该项目将开发在半监督环境中进行稳健统计推断的方法。与更传统的数据设置不同,半监督设置的特征在于两种类型的可用数据:1)典型的小或中等大小的标记(或监督)数据,其包含针对响应(或结果)和一组协变量(或预测因子)的观测,以及2 i)大得多的大小的未标记(或无监督)数据,其仅具有针对协变量的观测。当协变量很容易用于大型队列时,这种设置自然会出现,而由于实际限制,获得响应可能很困难和/或昂贵。这些在大数据时代的现代研究中越来越重要,大型未标记数据库(通常是电子记录的)在标记数据之上变得容易获得(并且易于处理)。例子在许多学科中无处不在,包括计算机科学,机器学习,计量经济学和生物医学应用,如电子健康记录和整合基因组学。因此,在半监督设置的统计推断是相当大的兴趣。这里的最终问题是研究何时以及如何使用来自大型未标记数据的额外信息来“改进”相应的监督方法,其中改进可以是效率或鲁棒性或两者兼而有之。该项目旨在通过开发一类新颖的,可证明的和可扩展的半监督推理方法来解决两个相当不同和活跃的研究领域中的一系列基本问题,从而为这些问题提供答案:1)半监督设置中的因果推理,以及2)标记中存在选择偏差的半监督推理。该项目中概述的研究将有助于弥合现有文献中的一些主要差距,并提供对半监督推理及其微妙之处的迫切需要的统一理解。这些方法也将广泛适用于各个领域,例如精确医学和因果推理的生物医学研究。该项目也有一个重要的教育组成部分,包括指导研究生和通过短期课程开发课程,以提高对现代统计中这些令人兴奋的新领域的认识。在项目的第一部分,PI将考虑在潜在结果框架下半监督环境中的因果推理,并探索流行因果参数的半监督推理,例如,平均处理效果和分位数处理效果,这两者都在监督设置中被广泛研究,但在半监督设置中很少如此。PI的目标是开发半监督方法,用于对这些参数进行所谓的双重鲁棒估计,这些方法可以提高(如果不是最佳的)效率,以及比其最佳可实现的监督对应物更强的鲁棒性。该项目的第二部分将考虑半监督推理,其中标记机制具有固有的选择偏差,从而使标记和未标记的数据不均匀分布。这种设置虽然具有很大的实际意义,但迄今为止很少得到解决,部分原因是它们的分析非常具有挑战性,因为标记分数衰减到零,导致自然违反所谓的阳性/重叠假设。在这种情况下,PI将通过双重稳健估计方法以及用于估计衰减倾向评分的建模策略,探索各种参数的有效和速率最优半监督推理,例如平均应答和平均治疗效果(在因果框架下),衰减倾向评分是一个不可避免的挑战,具有独立的兴趣。在整个过程中,PI的重点将是开发具有严格理论保证的方法,以及满足大型现代数据集上预期应用程序所需的可扩展性的高效实现。所提出的方法还将汇集从经典的半参数推理和现代高维统计理论的工具和思想的协同作用。这个奖项反映了NSF的法定使命,并已被认为是值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估的支持。

项目成果

期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Double robust semi-supervised inference for the mean: selection bias under MAR labeling with decaying overlap
均值的双重鲁棒半监督推理:具有衰减重叠的 MAR 标签下的选择偏差
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Abhishek Chakrabortty其他文献

A Poisson regression model for association mapping of count phenotypes
  • DOI:
    10.1186/1755-8166-7-s1-o1
  • 发表时间:
    2014-01-21
  • 期刊:
  • 影响因子:
    1.400
  • 作者:
    Saurabh Ghosh;Abhishek Chakrabortty
  • 通讯作者:
    Abhishek Chakrabortty
Surrogate Aided Unsupervised Recovery of Sparse Signals in Single Index Models for Binary Outcomes
二元结果的单索引模型中稀疏信号的替代辅助无监督恢复
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Abhishek Chakrabortty;Matey Neykov;Ray Carroll;T. Cai
  • 通讯作者:
    T. Cai
Robust Semi-Parametric Inference in Semi-Supervised Settings
  • DOI:
  • 发表时间:
    2016-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Abhishek Chakrabortty
  • 通讯作者:
    Abhishek Chakrabortty
Semi-supervised estimation of covariance with application to phenome-wide association studies with electronic medical records data
协方差的半监督估计及其应用于电子病历数据的表型范围关联研究

Abhishek Chakrabortty的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似海外基金

Robust and efficient statistical learning algorithms with applications in actuarial science
稳健高效的统计学习算法在精算科学中的应用
  • 批准号:
    RGPIN-2020-07064
  • 财政年份:
    2022
  • 资助金额:
    $ 17万
  • 项目类别:
    Discovery Grants Program - Individual
Robust and efficient statistical learning algorithms with applications in actuarial science
稳健高效的统计学习算法在精算科学中的应用
  • 批准号:
    RGPIN-2020-07064
  • 财政年份:
    2021
  • 资助金额:
    $ 17万
  • 项目类别:
    Discovery Grants Program - Individual
CAREER: Robust and Efficient Algorithms for Statistical Estimation and Inference
职业:用于统计估计和推理的稳健且高效的算法
  • 批准号:
    2045068
  • 财政年份:
    2021
  • 资助金额:
    $ 17万
  • 项目类别:
    Continuing Grant
A Robust and Efficient Statistical Framework for Handling Missing-Not-At-Random Data in Patient Reported Outcomes and Beyond
一个强大而高效的统计框架,用于处理患者报告结果及其他方面的非随机缺失数据
  • 批准号:
    2122074
  • 财政年份:
    2021
  • 资助金额:
    $ 17万
  • 项目类别:
    Continuing Grant
Robust and efficient statistical learning algorithms with applications in actuarial science
稳健高效的统计学习算法在精算科学中的应用
  • 批准号:
    RGPIN-2020-07064
  • 财政年份:
    2020
  • 资助金额:
    $ 17万
  • 项目类别:
    Discovery Grants Program - Individual
A Robust and Efficient Statistical Framework for Handling Missing-Not-At-Random Data in Patient Reported Outcomes and Beyond
一个强大而高效的统计框架,用于处理患者报告结果及其他方面的非随机缺失数据
  • 批准号:
    1953526
  • 财政年份:
    2020
  • 资助金额:
    $ 17万
  • 项目类别:
    Continuing Grant
CRII: III: Efficient and Robust Statistical Estimation from Nonlinear Compressed Measurements
CRII:III:通过非线性压缩测量进行高效且稳健的统计估计
  • 批准号:
    1948133
  • 财政年份:
    2020
  • 资助金额:
    $ 17万
  • 项目类别:
    Standard Grant
Robust and efficient statistical learning algorithms with applications in actuarial science
稳健高效的统计学习算法在精算科学中的应用
  • 批准号:
    DGECR-2020-00372
  • 财政年份:
    2020
  • 资助金额:
    $ 17万
  • 项目类别:
    Discovery Launch Supplement
Robust and efficient statistical inference methods for genomics
稳健且高效的基因组学统计推断方法
  • 批准号:
    10308395
  • 财政年份:
    2019
  • 资助金额:
    $ 17万
  • 项目类别:
Robust and efficient statistical inference methods for genomics
稳健且高效的基因组学统计推断方法
  • 批准号:
    10526429
  • 财政年份:
    2019
  • 资助金额:
    $ 17万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了