权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Enhancing Synthetic Data Techniques for Practical Applications

增强实际应用的综合数据技术

基本信息

批准号：
2217456
负责人：
Jerome Reiter
金额：
$ 40万
依托单位：
Duke University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2022
资助国家：
美国
起止时间：
2022-08-15 至 2025-07-31
项目状态：
未结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2217456&HistoricalAwards=false
关键词：
Enhancing Synthetic Data Techniques Practical

项目摘要

This research project will advance statistical and computational methods for releasing high quality synthetic data as public use files. In the face of high and expanding risks of unintended and/or illegal disclosures, many data stewards are considering synthetic public use files. These comprise simulated records, with values generated from statistical models estimated with the confidential data. This can reduce disclosure risks, since it can be difficult to re-identify individuals and their sensitive attributes when the released values are simulated. Despite growing interest in synthetic data solutions for data dissemination, there are significant gaps in the theory and methods of synthetic data that complicate and hinder practical implementations. This project will address three critical yet unresolved topics in synthetic data, namely (1) assessing data subjects' disclosure risks, (2) facilitating data analysts' evaluation of their synthetic data inferences, and (3) generating synthetic datasets in surveys with complex designs. The results of this research will offer federal agencies, survey organizations, research centers, and other data producers the means to create safer and more analytically useful synthetic data products. In turn, this will help data stewards to better meet the challenges of public use data dissemination. The project will train Ph.D. and undergraduate students to become researchers in data privacy protection methods, thereby contributing to the pipeline of experts in data privacy and in statistics and data science more broadly. The project also will develop and disseminate software code that implements the various approaches. This research project will address three main questions. First, the project will develop computational techniques for estimating Bayesian posterior probabilities of disclosures; that is, probabilities that sensitive values can be learned from the synthetic data releases. These techniques facilitate disclosure risk assessment on datasets with many observations and many variables, thereby allowing agencies to replace current ad hoc assessments with principled and quantifiable measures of disclosure risk. Second, the project will develop novel verification measures that data stewards can use to provide feedback to secondary data analysts on the quality of their particular inferences without leaking too much information about the confidential data. The new measures will enable formally private verification of common survey-weighted estimation tasks. Third, the project will develop new synthesis and inferential methods and recommend best practices for incorporating complex survey designs in synthetic data. These new methods will adapt Bayesian bootstraps, multiple imputation, and multilevel regression and poststratification for data synthesis, while also enabling the use of popular techniques from machine learning for generating synthetic data in complex samples.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

该研究项目将推进统计和计算方法，以发布高质量的合成数据作为公共使用文件。面对意外和/或非法披露的高风险和不断扩大的风险，许多数据管理员正在考虑合成公共使用文件。这些包括模拟记录，其中的值是根据用机密数据估计的统计模型生成的。这可以减少披露风险，因为在模拟公布的数值时，很难重新确定个人及其敏感属性。尽管人们对用于数据传播的合成数据解决方案越来越感兴趣，但合成数据的理论和方法存在重大差距，使实际实施复杂化并阻碍了实际实施。该项目将解决合成数据中三个关键但尚未解决的主题，即（1）评估数据主体的披露风险，（2）促进数据分析师对其合成数据推断的评估，以及（3）在复杂设计的调查中生成合成数据集。这项研究的结果将为联邦机构、调查组织、研究中心和其他数据生产者提供创建更安全、更有分析价值的合成数据产品的方法。反过来，这将有助于数据管理员更好地应对公共使用数据传播的挑战。该项目将培养博士。以及本科生成为数据隐私保护方法的研究人员，从而为更广泛的数据隐私以及统计和数据科学专家的管道做出贡献。该项目还将开发和传播执行各种方法的软件代码。这个研究项目将解决三个主要问题。首先，该项目将开发用于估计披露的贝叶斯后验概率的计算技术;即，可以从合成数据发布中了解敏感值的概率。这些技术有助于对具有许多观察结果和许多变量的数据集进行披露风险评估，从而使各机构能够用原则性和可量化的披露风险衡量标准取代目前的临时评估。其次，该项目将开发新的验证措施，数据管理员可以使用这些措施向二级数据分析师提供关于其特定推断质量的反馈，而不会泄露太多有关机密数据的信息。新的措施将允许对常见的调查加权估计任务进行正式的私人验证。第三，该项目将开发新的综合和推理方法，并建议将复杂的调查设计纳入综合数据的最佳做法。这些新方法将采用贝叶斯自举法、多重插补、多级回归和后分层进行数据合成，同时还可以使用机器学习中的流行技术在复杂样本中生成合成数据。该奖项反映了NSF的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。