Enhancing Synthetic Data Techniques for Practical Applications
增强实际应用的综合数据技术
基本信息
- 批准号:2217456
- 负责人:
- 金额:$ 40万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2022
- 资助国家:美国
- 起止时间:2022-08-15 至 2025-07-31
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
This research project will advance statistical and computational methods for releasing high quality synthetic data as public use files. In the face of high and expanding risks of unintended and/or illegal disclosures, many data stewards are considering synthetic public use files. These comprise simulated records, with values generated from statistical models estimated with the confidential data. This can reduce disclosure risks, since it can be difficult to re-identify individuals and their sensitive attributes when the released values are simulated. Despite growing interest in synthetic data solutions for data dissemination, there are significant gaps in the theory and methods of synthetic data that complicate and hinder practical implementations. This project will address three critical yet unresolved topics in synthetic data, namely (1) assessing data subjects' disclosure risks, (2) facilitating data analysts' evaluation of their synthetic data inferences, and (3) generating synthetic datasets in surveys with complex designs. The results of this research will offer federal agencies, survey organizations, research centers, and other data producers the means to create safer and more analytically useful synthetic data products. In turn, this will help data stewards to better meet the challenges of public use data dissemination. The project will train Ph.D. and undergraduate students to become researchers in data privacy protection methods, thereby contributing to the pipeline of experts in data privacy and in statistics and data science more broadly. The project also will develop and disseminate software code that implements the various approaches. This research project will address three main questions. First, the project will develop computational techniques for estimating Bayesian posterior probabilities of disclosures; that is, probabilities that sensitive values can be learned from the synthetic data releases. These techniques facilitate disclosure risk assessment on datasets with many observations and many variables, thereby allowing agencies to replace current ad hoc assessments with principled and quantifiable measures of disclosure risk. Second, the project will develop novel verification measures that data stewards can use to provide feedback to secondary data analysts on the quality of their particular inferences without leaking too much information about the confidential data. The new measures will enable formally private verification of common survey-weighted estimation tasks. Third, the project will develop new synthesis and inferential methods and recommend best practices for incorporating complex survey designs in synthetic data. These new methods will adapt Bayesian bootstraps, multiple imputation, and multilevel regression and poststratification for data synthesis, while also enabling the use of popular techniques from machine learning for generating synthetic data in complex samples.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
该研究项目将推进统计和计算方法,以发布高质量的合成数据作为公共使用文件。 面对意外和/或非法披露的高风险和不断扩大的风险,许多数据管理员正在考虑合成公共使用文件。这些包括模拟记录,其中的值是根据用机密数据估计的统计模型生成的。这可以减少披露风险,因为在模拟公布的数值时,很难重新确定个人及其敏感属性。尽管人们对用于数据传播的合成数据解决方案越来越感兴趣,但合成数据的理论和方法存在重大差距,使实际实施复杂化并阻碍了实际实施。该项目将解决合成数据中三个关键但尚未解决的主题,即(1)评估数据主体的披露风险,(2)促进数据分析师对其合成数据推断的评估,以及(3)在复杂设计的调查中生成合成数据集。这项研究的结果将为联邦机构、调查组织、研究中心和其他数据生产者提供创建更安全、更有分析价值的合成数据产品的方法。反过来,这将有助于数据管理员更好地应对公共使用数据传播的挑战。该项目将培养博士。以及本科生成为数据隐私保护方法的研究人员,从而为更广泛的数据隐私以及统计和数据科学专家的管道做出贡献。该项目还将开发和传播执行各种方法的软件代码。这个研究项目将解决三个主要问题。 首先,该项目将开发用于估计披露的贝叶斯后验概率的计算技术;即,可以从合成数据发布中了解敏感值的概率。这些技术有助于对具有许多观察结果和许多变量的数据集进行披露风险评估,从而使各机构能够用原则性和可量化的披露风险衡量标准取代目前的临时评估。其次,该项目将开发新的验证措施,数据管理员可以使用这些措施向二级数据分析师提供关于其特定推断质量的反馈,而不会泄露太多有关机密数据的信息。 新的措施将允许对常见的调查加权估计任务进行正式的私人验证。第三,该项目将开发新的综合和推理方法,并建议将复杂的调查设计纳入综合数据的最佳做法。这些新方法将采用贝叶斯自举法、多重插补、多级回归和后分层进行数据合成,同时还可以使用机器学习中的流行技术在复杂样本中生成合成数据。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Jerome Reiter其他文献
The impact of lead and other exposures on early school performance
- DOI:
10.1016/j.ntt.2008.03.018 - 发表时间:
2008-05-01 - 期刊:
- 影响因子:
- 作者:
Jerome Reiter;Dohyeong Kim;Andy Hull;Marie Lynn Miranda - 通讯作者:
Marie Lynn Miranda
Jerome Reiter的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Jerome Reiter', 18)}}的其他基金
Leveraging Auxiliary Information on Marginal Distributions in Multiple Imputation for Survey Nonresponse
利用多重插补中边际分布的辅助信息来解决调查无答复问题
- 批准号:
1733835 - 财政年份:2017
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
CIF21 DIBBs: An Integrated System for Public/Private Access to Large-Scale, Confidential Social Science Data
CIF21 DIBB:公共/私人访问大规模、机密社会科学数据的集成系统
- 批准号:
1443014 - 财政年份:2015
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
NCRN-MN: Triangle Census Research Network
NCRN-MN:三角人口普查研究网络
- 批准号:
1131897 - 财政年份:2011
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
Multiple Imputation Methods for Handling Missing Data in Longitudinal Studies with Refreshment Samples
处理更新样本纵向研究中缺失数据的多重插补方法
- 批准号:
1061241 - 财政年份:2011
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
TC: Large: Collaborative Research: Practical Privacy: Metrics and Methods for Protecting Record-level and Relational Data
TC:大型:协作研究:实用隐私:保护记录级和关系数据的指标和方法
- 批准号:
1012141 - 财政年份:2010
- 资助金额:
$ 40万 - 项目类别:
Continuing Grant
Methodology for Improving Public Use Data Dissemination Via Multiply-Imputed, Partially Synthetic Data
通过多重插补、部分合成数据改进公共使用数据传播的方法
- 批准号:
0751671 - 财政年份:2008
- 资助金额:
$ 40万 - 项目类别:
Continuing Grant
相似海外基金
CAREER: New Frontiers of Private Learning and Synthetic Data
职业:私人学习和合成数据的新领域
- 批准号:
2339775 - 财政年份:2024
- 资助金额:
$ 40万 - 项目类别:
Continuing Grant
Conference: UCLA Synthetic Data Workshop
会议:加州大学洛杉矶分校综合数据研讨会
- 批准号:
2309349 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
Pediatric Hospitals as European drivers for multi-party computation and synthetic data generation capabilities across clinical specialties and data types
儿科医院是欧洲跨临床专业和数据类型多方计算和合成数据生成能力的推动者
- 批准号:
10103799 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
EU-Funded
Conference: DMR-NIBIB Planning Workshop: Leveraging data-driven design and synthetic biology to enable next-generation active biomaterials
会议:DMR-NIBIB 规划研讨会:利用数据驱动设计和合成生物学实现下一代活性生物材料
- 批准号:
2335176 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
A synthetic data and generative A.I approach to verifying and validating A.I
用于验证和验证人工智能的合成数据和生成人工智能方法
- 批准号:
10065801 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
Collaborative R&D
PHEMS - Pediatric Hospitals as European drivers for multi-party computation and synthetic data generation capabilities across clinical specialties and data types
PHEMS - 儿科医院是欧洲跨临床专业和数据类型多方计算和合成数据生成能力的推动者
- 批准号:
10103155 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
EU-Funded
INSAFEDARE: INNOVATIVE APPLICATIONS OF ASSESSMENT AND ASSURANCE OF DATA AND SYNTHETIC DATA FOR REGULATORY DECISION SUPPORT
INSAFEDARE:用于监管决策支持的数据和合成数据评估和保证的创新应用
- 批准号:
10066712 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
EU-Funded
Mitigating bias for underrepresented groups with multimorbidity and frailty through AI-based synthetic data generation
通过基于人工智能的合成数据生成,减少对患有多种疾病和虚弱的代表性不足群体的偏见
- 批准号:
2897596 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
Studentship
Collaborative Research: SCH: Therapeutic and Diagnostic System for Inflammatory Bowel Diseases: Integrating Data Science, Synthetic Biology, and Additive Manufacturing
合作研究:SCH:炎症性肠病的治疗和诊断系统:整合数据科学、合成生物学和增材制造
- 批准号:
2306740 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
Standard Grant
I-Corps: Trustworthy Synthetic Data Generation
I-Corps:值得信赖的综合数据生成
- 批准号:
2317549 - 财政年份:2023
- 资助金额:
$ 40万 - 项目类别:
Standard Grant