权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Empirical Process Theory for Complex Statistical Data Integration

复杂统计数据集成的经验过程理论

基本信息

批准号：
2014971
负责人：
Takumi Saegusa
金额：
$ 20.44万
依托单位：
University of Maryland, College Park
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-07-01 至 2024-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2014971&HistoricalAwards=false
关键词：
Empirical Process Theory Complex Statistical

项目摘要

Nowadays, every organization collects various data sets from numerous sources. If these data sets are combined, improved quality of inference will accelerate scientific discovery. Statistical analysis of merged data is, however, challenging because each data set often represents only a part of the entire target population and because combined data contain unidentified duplicated records from data sets which share data sources partially. This research provides theoretical and methodological foundations to address the issue of unavoidable bias in data integration arising from heterogeneity and duplication in merged data. With the proposed data integration technique, previously limited findings to smaller populations are combined to be generalized to a broader population. The proposed methodology serves well for privacy protection by avoiding record linkage that identifies duplication through private information. Another benefit is to overcome the shortage of relevant information in individual data sources without collecting costly(and possibly small) independent and identically distributed data all over again. Expected outcomes from this project will encourage the efficient and socially proper use of massive data in modern data analysis. The graduate student support will be used on interdisciplinary activities and writing codes. The project delves into the intersection of empirical process theory, semi- and non-parametric inference, and sampling theory. Existing theory and methods fail to provide sufficient tools to study complex data integration problems characterized by bias and dependence due to heterogeneity and duplication. Inverse probability-weighted empirical process theory requires a special independence structure on weights and variables. Semi- and non-parametric inference often relies on the availability of the independent and identically distributed sample. Sampling theory handles dependence in a specific design but focuses on a parametric model without accounting for randomness in collected variables in a finite population framework. To address the paucity of probabilistic tools and techniques, the PI will develop a unified framework in connection with a weighted empirical process motivated by multiple frame surveys. This weighted empirical process is computable without identifying duplicated selections. The proposed tools and techniques will play a critical role in studying a general sample selection and missing data mechanisms such as a convenience sample, semiparametric estimation with misspecified models, and multiple observations for duplicated subjects in overlapping data sources. The particular problems under investigation include (a) uniform limit theorems under general missingness mechanisms, (b) robust M-estimation under model misspecification for data integration, and (c) general theory to integrate multiple probability measures that correspond to heterogeneous data sources.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

如今，每个组织都从众多来源收集各种数据集。如果将这些数据集结合起来，推理质量的提高将加快科学发现的速度。然而，合并数据的统计分析具有挑战性，因为每个数据集往往只代表整个目标人口的一部分，而且合并数据包含来自部分共享数据源的数据集的未识别的重复记录。本研究为解决合并数据中的异构性和重复性导致的数据集成中不可避免的偏差问题提供了理论和方法论基础。利用所提出的数据集成技术，将先前对较小人群的有限发现结合在一起，以推广到更广泛的人群。拟议的方法很好地保护了隐私，避免了通过私人信息识别重复的记录链接。另一个好处是克服了单个数据源中缺乏相关信息的问题，而不需要重新收集昂贵的(可能是小的)独立和相同分布的数据。该项目的预期结果将鼓励在现代数据分析中有效和社会适当地使用海量数据。研究生资助将用于跨学科活动和编写代码。该项目深入研究了经验过程理论、半参数和非参数推理以及抽样理论的交叉。现有的理论和方法不能提供足够的工具来研究复杂的数据集成问题，这些问题的特点是由于异构性和重复性而产生的偏差和依赖。逆概率加权经验过程理论需要一种特殊的权重和变量独立结构。半参数和非参数推断通常依赖于独立且同分布的样本的可用性。抽样理论处理特定设计中的相关性，但专注于参数模型，而没有考虑有限总体框架中收集的变量的随机性。为了解决缺乏概率工具和技术的问题，PI将制定一个统一的框架，与由多框架调查推动的加权经验过程有关。这种加权的经验过程是可计算的，无需识别重复的选择。所提出的工具和技术将在研究一般样本选择和缺失数据机制方面发挥关键作用，例如方便样本、错误指定模型的半参数估计以及重叠数据源中重复对象的多观测。正在调查的具体问题包括(A)一般缺失机制下的统一极限定理，(B)数据集成模型错误指定下的稳健M-估计，以及(C)整合对应于不同数据源的多个概率度量的一般理论。该奖项反映了NSF的法定使命，并通过使用基金会的智力优势和更广泛的影响审查标准进行评估，被认为值得支持。