权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Robust and Efficient Statistical Inference in Large Scale Semi-Supervised Settings

大规模半监督环境中稳健且高效的统计推断

基本信息

批准号：
2113768
负责人：
Abhishek Chakrabortty
金额：
$ 17万
依托单位：
Texas A&M University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-08-01 至 2024-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2113768&HistoricalAwards=false
关键词：
Robust Efficient Statistical Inference Large

项目摘要

This project will develop methods for robust statistical inference in semi-supervised settings. Unlike more traditional data settings, semi-supervised settings are characterized by two types of available data: 1) a typical small or moderate sized labeled (or supervised) data containing observations for a response (or outcome) and a set of covariates (or predictors), and 2i) a much larger sized unlabeled (or unsupervised) data having observations only for the covariates. Such settings arise naturally whenever the covariates are easily available for a large cohort, while the response may be difficult and/or expensive to obtain due to practical constraints. These are increasingly relevant in modern studies in the big data era with large unlabeled databases (often electronically recorded) becoming easily available (and tractable) on top of a labeled data. Examples are ubiquitous across many disciplines, including computer science, machine learning, econometrics, and biomedical applications like electronic health records and integrative genomics. Statistical inference in semi-supervised settings is therefore of substantial interest. The ultimate question here is to investigate when and how one can use the extra information available from the large unlabeled data to “improve” upon a corresponding supervised approach, where improvement could be in terms of efficiency or robustness or both. This project aims to provide answers to such questions by developing a class of novel, provable and scalable semi-supervised inference methods for a range of fundamental problems in two fairly distinct and active research areas: 1) causal inference in semi-supervised settings, and 2) semi-supervised inference in the presence of selection bias in labeling. The research outlined in the project will lead to advances in bridging some major gaps in the existing literature and providing a much-needed unified understanding of semi-supervised inference and its subtleties. The methods will also have wide applicability to various domain areas, e.g. biomedical studies for precision medicine and causal inference. The project also has a significant education component, including mentoring of graduate students and curriculum development via short courses to raise awareness about these exciting new areas in modern statistics.In the first part of the project, the PI will consider causal inference in semi-supervised settings under the potential outcome framework, and explore semi-supervised inference for popular causal parameters, e.g. the average treatment effect and the quantile treatment effect, both of which have been widely studied in supervised settings but rarely so under semi-supervised settings. The PI will aim to develop semi-supervised methods for so-called doubly robust estimation of such parameters that can lead to improved (if not optimal) efficiency, as well as much stronger robustness properties than their best achievable supervised counterparts. The second part of the project will consider semi-supervised inference where the labeling mechanism has inherent selection bias, thus making the labeled and unlabeled data unequally distributed. Such settings, while of great practical relevance, have rarely been addressed so far, partly because their analysis is quite challenging since the labeling fraction decays to zero leading to a natural violation of the so-called positivity/overlap assumption. Under this setting, the PI will explore efficient and rate-optimal semi-supervised inference for various parameters, e.g. the mean response and the average treatment effect (under a causal framework), via doubly robust estimation methods, as well as modeling strategies for estimating the decaying propensity score which arises as an inevitable challenge and is of independent interest. Throughout, the PI's emphasis will be on developing methods with rigorous theoretical guarantees as well as efficient implementation that meets the scalability demanded by the intended applications on large modern datasets. The proposed methods will also bring together a synergy of tools and ideas from classical semi-parametric inference and modern high dimensional statistics theory.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该项目将开发在半监督环境中进行稳健统计推断的方法。与更传统的数据设置不同，半监督设置的特征在于两种类型的可用数据：1）典型的小或中等大小的标记（或监督）数据，其包含针对响应（或结果）和一组协变量（或预测因子）的观测，以及2 i）大得多的大小的未标记（或无监督）数据，其仅具有针对协变量的观测。当协变量很容易用于大型队列时，这种设置自然会出现，而由于实际限制，获得响应可能很困难和/或昂贵。这些在大数据时代的现代研究中越来越重要，大型未标记数据库（通常是电子记录的）在标记数据之上变得容易获得（并且易于处理）。例子在许多学科中无处不在，包括计算机科学，机器学习，计量经济学和生物医学应用，如电子健康记录和整合基因组学。因此，在半监督设置的统计推断是相当大的兴趣。这里的最终问题是研究何时以及如何使用来自大型未标记数据的额外信息来“改进”相应的监督方法，其中改进可以是效率或鲁棒性或两者兼而有之。该项目旨在通过开发一类新颖的，可证明的和可扩展的半监督推理方法来解决两个相当不同和活跃的研究领域中的一系列基本问题，从而为这些问题提供答案：1）半监督设置中的因果推理，以及2）标记中存在选择偏差的半监督推理。该项目中概述的研究将有助于弥合现有文献中的一些主要差距，并提供对半监督推理及其微妙之处的迫切需要的统一理解。这些方法也将广泛适用于各个领域，例如精确医学和因果推理的生物医学研究。该项目也有一个重要的教育组成部分，包括指导研究生和通过短期课程开发课程，以提高对现代统计中这些令人兴奋的新领域的认识。在项目的第一部分，PI将考虑在潜在结果框架下半监督环境中的因果推理，并探索流行因果参数的半监督推理，例如，平均处理效果和分位数处理效果，这两者都在监督设置中被广泛研究，但在半监督设置中很少如此。PI的目标是开发半监督方法，用于对这些参数进行所谓的双重鲁棒估计，这些方法可以提高（如果不是最佳的）效率，以及比其最佳可实现的监督对应物更强的鲁棒性。该项目的第二部分将考虑半监督推理，其中标记机制具有固有的选择偏差，从而使标记和未标记的数据不均匀分布。这种设置虽然具有很大的实际意义，但迄今为止很少得到解决，部分原因是它们的分析非常具有挑战性，因为标记分数衰减到零，导致自然违反所谓的阳性/重叠假设。在这种情况下，PI将通过双重稳健估计方法以及用于估计衰减倾向评分的建模策略，探索各种参数的有效和速率最优半监督推理，例如平均应答和平均治疗效果（在因果框架下），衰减倾向评分是一个不可避免的挑战，具有独立的兴趣。在整个过程中，PI的重点将是开发具有严格理论保证的方法，以及满足大型现代数据集上预期应用程序所需的可扩展性的高效实现。所提出的方法还将汇集从经典的半参数推理和现代高维统计理论的工具和思想的协同作用。这个奖项反映了NSF的法定使命，并已被认为是值得通过使用基金会的智力价值和更广泛的影响审查标准进行评估的支持。