Empirical Process Theory for Complex Statistical Data Integration
复杂统计数据集成的经验过程理论
基本信息
- 批准号:2014971
- 负责人:
- 金额:$ 20.44万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-07-01 至 2024-06-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Nowadays, every organization collects various data sets from numerous sources. If these data sets are combined, improved quality of inference will accelerate scientific discovery. Statistical analysis of merged data is, however, challenging because each data set often represents only a part of the entire target population and because combined data contain unidentified duplicated records from data sets which share data sources partially. This research provides theoretical and methodological foundations to address the issue of unavoidable bias in data integration arising from heterogeneity and duplication in merged data. With the proposed data integration technique, previously limited findings to smaller populations are combined to be generalized to a broader population. The proposed methodology serves well for privacy protection by avoiding record linkage that identifies duplication through private information. Another benefit is to overcome the shortage of relevant information in individual data sources without collecting costly(and possibly small) independent and identically distributed data all over again. Expected outcomes from this project will encourage the efficient and socially proper use of massive data in modern data analysis. The graduate student support will be used on interdisciplinary activities and writing codes. The project delves into the intersection of empirical process theory, semi- and non-parametric inference, and sampling theory. Existing theory and methods fail to provide sufficient tools to study complex data integration problems characterized by bias and dependence due to heterogeneity and duplication. Inverse probability-weighted empirical process theory requires a special independence structure on weights and variables. Semi- and non-parametric inference often relies on the availability of the independent and identically distributed sample. Sampling theory handles dependence in a specific design but focuses on a parametric model without accounting for randomness in collected variables in a finite population framework. To address the paucity of probabilistic tools and techniques, the PI will develop a unified framework in connection with a weighted empirical process motivated by multiple frame surveys. This weighted empirical process is computable without identifying duplicated selections. The proposed tools and techniques will play a critical role in studying a general sample selection and missing data mechanisms such as a convenience sample, semiparametric estimation with misspecified models, and multiple observations for duplicated subjects in overlapping data sources. The particular problems under investigation include (a) uniform limit theorems under general missingness mechanisms, (b) robust M-estimation under model misspecification for data integration, and (c) general theory to integrate multiple probability measures that correspond to heterogeneous data sources.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
如今,每个组织都从众多来源收集各种数据集。如果将这些数据集结合起来,推理质量的提高将加快科学发现的速度。然而,合并数据的统计分析具有挑战性,因为每个数据集往往只代表整个目标人口的一部分,而且合并数据包含来自部分共享数据源的数据集的未识别的重复记录。本研究为解决合并数据中的异构性和重复性导致的数据集成中不可避免的偏差问题提供了理论和方法论基础。利用所提出的数据集成技术,将先前对较小人群的有限发现结合在一起,以推广到更广泛的人群。拟议的方法很好地保护了隐私,避免了通过私人信息识别重复的记录链接。另一个好处是克服了单个数据源中缺乏相关信息的问题,而不需要重新收集昂贵的(可能是小的)独立和相同分布的数据。该项目的预期结果将鼓励在现代数据分析中有效和社会适当地使用海量数据。研究生资助将用于跨学科活动和编写代码。该项目深入研究了经验过程理论、半参数和非参数推理以及抽样理论的交叉。现有的理论和方法不能提供足够的工具来研究复杂的数据集成问题,这些问题的特点是由于异构性和重复性而产生的偏差和依赖。逆概率加权经验过程理论需要一种特殊的权重和变量独立结构。半参数和非参数推断通常依赖于独立且同分布的样本的可用性。抽样理论处理特定设计中的相关性,但专注于参数模型,而没有考虑有限总体框架中收集的变量的随机性。为了解决缺乏概率工具和技术的问题,PI将制定一个统一的框架,与由多框架调查推动的加权经验过程有关。这种加权的经验过程是可计算的,无需识别重复的选择。所提出的工具和技术将在研究一般样本选择和缺失数据机制方面发挥关键作用,例如方便样本、错误指定模型的半参数估计以及重叠数据源中重复对象的多观测。正在调查的具体问题包括(A)一般缺失机制下的统一极限定理,(B)数据集成模型错误指定下的稳健M-估计,以及(C)整合对应于不同数据源的多个概率度量的一般理论。该奖项反映了NSF的法定使命,并通过使用基金会的智力优势和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Semiparametric inference for merged data from multiple data sources
来自多个数据源的合并数据的半参数推理
- DOI:10.1016/j.jspi.2021.05.002
- 发表时间:2022
- 期刊:
- 影响因子:0.9
- 作者:Saegusa, Takumi
- 通讯作者:Saegusa, Takumi
Parametric Bootstrap Confidence Intervals for the Multivariate Fay–Herriot Model
多元 Fay–Herriot 模型的参数引导置信区间
- DOI:10.1093/jssam/smaa038
- 发表时间:2022
- 期刊:
- 影响因子:2.1
- 作者:Saegusa, T.
- 通讯作者:Saegusa, T.
Mann–Whitney test for two‐phase stratified sampling
两相分层抽样的曼恩·惠特尼检验
- DOI:10.1002/sta4.321
- 发表时间:2021
- 期刊:
- 影响因子:1.7
- 作者:Saegusa, Takumi
- 通讯作者:Saegusa, Takumi
Nonparametric inference for distribution functions with stratified samples
分层样本分布函数的非参数推断
- DOI:10.1016/j.jspi.2021.05.001
- 发表时间:2021
- 期刊:
- 影响因子:0.9
- 作者:Saegusa, Takumi
- 通讯作者:Saegusa, Takumi
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Takumi Saegusa其他文献
Confidence bands for a distribution function with merged data from multiple sources
具有多个来源的合并数据的分布函数的置信带
- DOI:
- 发表时间:
2020 - 期刊:
- 影响因子:0
- 作者:
Takumi Saegusa - 通讯作者:
Takumi Saegusa
Supplementary Material for " Weighted Likelihood Estimation under Two-phase Sampling "
“两阶段采样下的加权似然估计”的补充材料
- DOI:
- 发表时间:
2012 - 期刊:
- 影响因子:0
- 作者:
Takumi Saegusa;J. Wellner - 通讯作者:
J. Wellner
Large sample theory for merged data from multiple sources
- DOI:
10.1214/18-aos1727 - 发表时间:
2018-05 - 期刊:
- 影响因子:0
- 作者:
Takumi Saegusa - 通讯作者:
Takumi Saegusa
Variance Estimation under Two‐Phase Sampling
两阶段采样下的方差估计
- DOI:
- 发表时间:
2015 - 期刊:
- 影响因子:0
- 作者:
Takumi Saegusa - 通讯作者:
Takumi Saegusa
Takumi Saegusa的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似国自然基金
Neural Process模型的多样化高保真技术研究
- 批准号:62306326
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
磁转动超新星爆发中weak r-process的关键核反应
- 批准号:12375145
- 批准年份:2023
- 资助金额:52.00 万元
- 项目类别:面上项目
多臂Bandit process中的Bayes非参数方法
- 批准号:71771089
- 批准年份:2017
- 资助金额:48.0 万元
- 项目类别:面上项目
相似海外基金
CAREER: Integrating Graph Theory based Networks with Machine Learning for Enhanced Process Synthesis and Design
职业:将基于图论的网络与机器学习相集成以增强流程综合和设计
- 批准号:
2339588 - 财政年份:2024
- 资助金额:
$ 20.44万 - 项目类别:
Continuing Grant
Development of Fabrication Process for Hard Ceramic Coatings in Liquid Phase Based on Chemical Equilibrium Theory and Their Structural Control
基于化学平衡理论的液相硬质陶瓷涂层制备工艺及其结构控制研究
- 批准号:
23K04433 - 财政年份:2023
- 资助金额:
$ 20.44万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
The Construction Process of Media Memory Based on Media Experience and Media Practice Theory: Focusing on Asia-Pacific War Memory
基于媒体经验与媒体实践理论的媒体记忆建构过程——以亚太战争记忆为中心
- 批准号:
23K18840 - 财政年份:2023
- 资助金额:
$ 20.44万 - 项目类别:
Grant-in-Aid for Research Activity Start-up
Developing a theory of general quantum process manipulation and its application
发展通用量子过程操纵理论及其应用
- 批准号:
23K19028 - 财政年份:2023
- 资助金额:
$ 20.44万 - 项目类别:
Grant-in-Aid for Research Activity Start-up
Development of infiltration theory and data-driven process to realize melt-infiltration additive manufacturing
开发渗透理论和数据驱动流程以实现熔体渗透增材制造
- 批准号:
22K18285 - 财政年份:2022
- 资助金额:
$ 20.44万 - 项目类别:
Grant-in-Aid for Challenging Research (Pioneering)
An exploratory research about Taiwanese EMS companies' growth process: Toward further expansion of Penrose's growth theory
台湾EMS企业成长过程探索性研究:迈向彭罗斯成长理论的进一步拓展
- 批准号:
22K01632 - 财政年份:2022
- 资助金额:
$ 20.44万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Design cognition at hackathons - a dual process theory approach
黑客马拉松中的设计认知——双过程理论方法
- 批准号:
568950-2022 - 财政年份:2022
- 资助金额:
$ 20.44万 - 项目类别:
Alexander Graham Bell Canada Graduate Scholarships - Doctoral
An automated, fully auditable company revenue analysis generator, using natural language processing and argumentation theory to replace a currently manual process
自动化、完全可审计的公司收入分析生成器,使用自然语言处理和论证理论来取代当前的手动流程
- 批准号:
10017815 - 财政年份:2022
- 资助金额:
$ 20.44万 - 项目类别:
Collaborative R&D
A study of the filmmaking process and body theory in Shusaku Arakawa + Madeline Gins through the construction of an archive of film materials
通过建立电影资料档案馆来研究荒川周作的电影制作过程和身体理论
- 批准号:
22H00616 - 财政年份:2022
- 资助金额:
$ 20.44万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Realization of completely distortion-free processing by plasma nano-manufacturing process and exploration of its theory
等离子体纳米制造工艺实现完全无畸变加工及其理论探索
- 批准号:
21H05005 - 财政年份:2021
- 资助金额:
$ 20.44万 - 项目类别:
Grant-in-Aid for Scientific Research (S)