Creating longitudinal datasets for linked administrative data research using synthetic data
使用合成数据为链接的行政数据研究创建纵向数据集
基本信息
- 批准号:ES/V005448/1
- 负责人:
- 金额:$ 20.56万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2021
- 资助国家:英国
- 起止时间:2021 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Administrative data hold great potential for informing public policy. However, this potential is not yet being realised due to restrictions around data access, linkage, and privacy protection. Governance procedures and approvals lead to long timescales and tight restrictions on data access, which can jeopardise publicly funded research.One solution is to generate synthetic data that preserve the statistical properties of the original sources, but do not correspond to any real individuals or pose privacy risks. These data could be widely shared, allowing researchers to understand the data structures, develop analysis plans and algorithms, and test out different models. This could be done in parallel to applying for access to linked administrative datasets, streamlining the research process. Final refinements and analyses would be conducted on the real data.Our study will test the feasibility of approaches for creating synthetic linked administrative datasets. We will compare two existing methods: 'Synthpop', used to create synthetic versions of the Scottish Longitudinal Study, and 'Simulacrum', used to create synthetic versions of the National Cancer Registry, with a new approach 'Jomo', based on recent methodological developments for the imputation of missing data. We will evaluate these approaches using an exemplar of linking the third National Survey of Sexual Attitudes and Lifestyles (Natsal-3) to two administrative datasets: Hospital Episode Statistics (HES) and the National Pupil Database (NPD).Natsal-3 is one of the largest sexual population-based behaviour surveys in the world and collected data from 15000 participants during 2010-2012. HES contains information on attendances to all NHS hospitals in England, allowing detailed analysis of procedures and diagnoses. NPD contains information on pupils attending state schools in England, including school achievement, absences, and special educational needs. Linkage between Natsal-3, HES and NPD will provide a unique opportunity to gain a deeper understanding of the social, behavioural and biological aspects of sexual and reproductive health, and to generate evidence to inform implementation of sexual health interventions.We will first compare different methods for generating synthetic versions of the three datasets separately (since all have different structures and characteristics), based on how well the data generated by these methods represent the original data. We will also apply for approvals to link the data, to i) explore whether there are any additional considerations needed when synthesising complex, linked data, and ii) generate synthetic versions of the linked data that can be shared with researchers more widely.The quality and usability of synthetic data is highly dependent on the data generation model and the purpose of analysis. However, identifying all relevant variables and possible dependencies or interactions between these is highly resource intensive. One of the challenges for synthetic data generation is therefore understanding whether there are situations in which generic versions of synthetic data may be sufficient for some purposes, or whether bespoke synthetic datasets (tailored to a specific research problem) are always required. We will explore this balance by engaging with data providers and researchers and determining the nature and practicality of communication between the two that is required to produce acceptable outputs. We will also engage with the public to seek their views on the use of synthetic data. Based on a set of exemplar research questions, we will generate synthetic data and compare feasibility and outputs from different approaches. To evaluate how well the synthetic data represent the real data, we will compare characteristics and statistical inferences from the synthetic data with those from the real data. Based on our findings, we will generate guidelines on the appropriate use of synthetic data.
行政数据具有为公共政策提供信息的巨大潜力。然而,由于数据访问、链接和隐私保护方面的限制,这种潜力尚未实现。治理程序和审批导致了漫长的时间尺度和对数据访问的严格限制,这可能危及公共资助的研究。一种解决方案是生成合成数据,这些数据保留了原始来源的统计属性,但不对应于任何真实的个人,也不会带来隐私风险。这些数据可以广泛共享,使研究人员能够了解数据结构,制定分析计划和算法,并测试不同的模型。这可以与申请访问相关的管理数据集同时进行,从而简化研究过程。最后将根据实际数据进行细化和分析。我们的研究将测试创建合成链接管理数据集的方法的可行性。我们将比较两种现有的方法:“Synthpop”,用于创建苏格兰纵向研究的合成版本,“Simulacrum”,用于创建国家癌症登记处的合成版本,采用一种新的方法“Jomo”,基于最近的方法发展,用于缺失数据的输入。我们将使用将第三次国家性态度和生活方式调查(Natsal-3)与两个管理数据集(医院事件统计(HES)和国家学生数据库(NPD))联系起来的范例来评估这些方法。Natsal-3是世界上最大的基于性人群的行为调查之一,在2010-2012年期间收集了15000名参与者的数据。HES包含英格兰所有NHS医院的出诊信息,允许对程序和诊断进行详细分析。NPD包含在英国公立学校就读的学生的信息,包括学业成绩、缺勤和特殊教育需求。Natsal-3、HES和NPD之间的联系将提供一个独特的机会,以更深入地了解性健康和生殖健康的社会、行为和生物学方面,并为性健康干预措施的实施提供证据。我们将首先比较分别生成三个数据集的合成版本的不同方法(因为它们都具有不同的结构和特征),基于这些方法生成的数据如何很好地代表原始数据。我们还将申请链接数据的批准,以i)探索在合成复杂的链接数据时是否需要任何额外的考虑,以及ii)生成可以与研究人员更广泛共享的链接数据的合成版本。合成数据的质量和可用性在很大程度上取决于数据生成模型和分析目的。然而,确定所有相关变量和这些变量之间可能的依赖关系或相互作用是高度资源密集型的。因此,合成数据生成的挑战之一是理解在某些情况下,合成数据的通用版本是否足以满足某些目的,或者是否总是需要定制的合成数据集(针对特定的研究问题量身定制)。我们将与数据提供者和研究人员接触,并确定两者之间沟通的性质和实用性,以产生可接受的产出,从而探索这种平衡。我们亦会与公众接触,征求他们对使用合成数据的意见。基于一系列范例研究问题,我们将生成综合数据,并比较不同方法的可行性和产出。为了评估合成数据如何很好地代表真实数据,我们将比较合成数据与真实数据的特征和统计推断。根据我们的研究结果,我们将制定适当使用合成数据的指导方针。
项目成果
期刊论文数量(3)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Synthetic data in medical research.
- DOI:10.1136/bmjmed-2022-000167
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Kokosi, Theodora;Harron, Katie
- 通讯作者:Harron, Katie
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Katie Harron其他文献
How has local authority expenditure on public health services for 0 to 5-year-olds in England changed over time? An analysis of national administrative data from 2016/17 to 2022/23
英格兰针对 0 至 5 岁儿童的公共卫生服务地方当局支出随时间如何变化?对 2016/17 至 2022/23 年国家行政数据的分析
- DOI:
10.1016/s0140-6736(24)01982-2 - 发表时间:
2024-11-01 - 期刊:
- 影响因子:88.500
- 作者:
Louise McGrath-Lone;Anjali Raman Middleton;Amanda Clery;Catherine Bunting;Eirini-Christina Saloniki;Jenny Woodman;Katie Harron - 通讯作者:
Katie Harron
La déclaration RECORD (Reporting of Studies Conducted Using Observational Routinely Collected Health Data) : directives pour la communication des études réalisées à partir de données de santé collectées en routine
La declaration RECORD(报告使用定期观察收集的健康数据进行的研究):关于日常交流的指令
- DOI:
10.1503/cmaj.181309 - 发表时间:
2019 - 期刊:
- 影响因子:14.6
- 作者:
Eric I. Benchimol;L. Smeeth;A. Guttmann;Katie Harron;D. Moher;I. Petersen;H. T. Sørensen;J. Januel;E. von Elm;Sinéad M. Langan - 通讯作者:
Sinéad M. Langan
The relationship between early life course air pollution exposure and general health in adolescence in the United Kingdom
英国青少年早期生命历程空气污染暴露与总体健康之间的关系
- DOI:
10.1038/s41598-025-94107-w - 发表时间:
2025-05-14 - 期刊:
- 影响因子:3.900
- 作者:
Gergő Baranyi;Katie Harron;Youchen Shen;Kees de Hoogh;Emla Fitzsimons - 通讯作者:
Emla Fitzsimons
Using administrative data to assess early-life policies
利用行政数据评估早期生活政策
- DOI:
10.1016/s2468-2667(23)00127-5 - 发表时间:
2023-07-01 - 期刊:
- 影响因子:25.200
- 作者:
Katie Harron;Jenny Woodman - 通讯作者:
Jenny Woodman
Primary school attainment outcomes in children with neurodisability: Protocol for a population-based cohort study using linked education and hospital data from England
神经障碍儿童的小学学业成绩:使用英格兰相关教育和医院数据进行的基于人群的队列研究方案
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Ayana Cant;A. Zylbersztejn;Laura Gimeno;R. Gilbert;Katie Harron - 通讯作者:
Katie Harron
Katie Harron的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Katie Harron', 18)}}的其他基金
Enhancement of ECHILD with a mother-child and Unique Property Reference number link
通过母子和独特的财产参考号链接增强 ECHILD
- 批准号:
ES/X000427/1 - 财政年份:2022
- 资助金额:
$ 20.56万 - 项目类别:
Research Grant
Linking health and education data for research to improve outcomes for children in England - Supplement
将健康和教育数据联系起来进行研究,以改善英格兰儿童的成果 - 补充
- 批准号:
ES/X003663/1 - 财政年份:2022
- 资助金额:
$ 20.56万 - 项目类别:
Research Grant
相似国自然基金
精神分裂症进程中非对称性活跃脑结构改变的磁共振研究
- 批准号:81171275
- 批准年份:2011
- 资助金额:14.0 万元
- 项目类别:面上项目
相似海外基金
Understanding Risk Heterogeneity Following Child Maltreatment: An Integrative Data Analysis Approach.
了解虐待儿童后的风险异质性:综合数据分析方法。
- 批准号:
10721233 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Impact of the expiration of temporary pandemic SNAP benefits on the healthfulness of supermarket food purchases
临时大流行 SNAP 福利到期对超市食品采购健康性的影响
- 批准号:
10835393 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Applying Computational Phenotypes To Assess Mental Health Disorders Among Transgender Patients in the United States
应用计算表型评估美国跨性别患者的心理健康障碍
- 批准号:
10604723 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Personality and Mortality Risk in Adulthood: Behavioral and Physiological Mechanisms
成年期的人格和死亡风险:行为和生理机制
- 批准号:
10645631 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Psychobiological Mechanisms Underlying the Association Between Early Life Stress and Depression Across Adolescence
早期生活压力与青春期抑郁之间关联的心理生物学机制
- 批准号:
10749429 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Substance Use and Firearm Injuries among Medicaid-enrolled Youth
参加医疗补助的青少年的药物使用和枪伤
- 批准号:
10811094 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Using real-world evidence to define safe pain management strategies in cirrhosis
使用现实世界的证据来定义肝硬化的安全疼痛管理策略
- 批准号:
10808794 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Low Dose Computed Tomography (LDCT) Eligibility and Outcome differences between Sexual and Gender Minorities and their Sexual and Gender Majority Counterparts
性和性别少数群体与性和性别多数群体之间的低剂量计算机断层扫描 (LDCT) 资格和结果差异
- 批准号:
10605428 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Evolution and resolution of ARDS molecular phenotypes
ARDS 分子表型的进化和解析
- 批准号:
10592022 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别:
Real time relapse risk scoring for Opioid Use Disorder (OUD) from clinical trial datasets
根据临床试验数据集对阿片类药物使用障碍 (OUD) 进行实时复发风险评分
- 批准号:
10585452 - 财政年份:2023
- 资助金额:
$ 20.56万 - 项目类别: