Privacy-protecting distributed analysis of biomedical big data
生物医学大数据的隐私保护分布式分析
基本信息
- 批准号:9159815
- 负责人:
- 金额:$ 50.01万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:2016
- 资助国家:美国
- 起止时间:2016-09-30 至 2019-06-30
- 项目状态:已结题
- 来源:
- 关键词:AgreementBig DataBioinformaticsBiomedical ResearchClinical ResearchCodeComplexComputer softwareConfidentiality of Patient InformationDataData AnalysesData ProtectionData ScienceData SetData SourcesDatabasesDevelopmentDistantElectronic Health RecordEnvironmentFundingHealthHealthcare SystemsHousingIndividualInsuranceLinear RegressionsLinkLogistic RegressionsMethodsMulticenter StudiesPatientsPerformancePrivacyProcessProgramming LanguagesPublic HealthRegistriesRegression AnalysisResearchResearch PersonnelSecuritySentinelSiteSoftware ToolsSourceStatistical ComputingStatistical Data InterpretationStatistical ModelsSystemTechnologyTestingUnited States National Institutes of HealthWorkbasebig biomedical datacollaboratorydata sharingdata structuredesigndistributed dataexperiencehandheld mobile deviceimprovedmultidisciplinaryopen dataopen sourcepatient orientedprecision medicineprogramsreal world applicationsocial mediastatisticssystems researchtool
项目摘要
ABSTRACT
Advances in technology, bioinformatics, and data science have made it possible to analyze large and complex
databases to generate evidence that improves public health and accelerates the development of precision
medicine. However, the advent of big data has also raised concerns about privacy and confidentiality. This
application is focused on data privacy in vertically partitioned data, a data environment where information
about an individual is available in two or more data sources. This type of data structure is common in
biomedical research and is expected to grow exponentially as information from the same individual is
increasingly collected in multiple sources, such as insurance claims databases, electronic health records,
registries, social media, wearables, and mobile devices. Combining multiple databases provides a more
complete health profile about the patient and generates more robust evidence. However, concerns about data
privacy, confidentiality, and security, and constraints in governance and institutional agreements make it highly
challenging or sometimes impossible to physically pool different data sources. We propose to develop an
open-source, freely available software tool that will employ a cutting-edge method – distributed regression – to
analyze vertically partitioned datasets. The method does not require data to be combined physically, but
produces statistically equivalent results as if the datasets were linked and pooled centrally at one site.
Instead of sharing patient-level information, participating sites will only transfer non-identifiable information
matrix (a design matrix used in fitting of statistical models) and other summary-level statistics needed in the
statistical modeling process. This approach offers much greater protection for data privacy while allowing one
to perform sophisticated statistical analysis. The software tool will be developed, tested, and fine-tuned using
both simulated datasets and the real-world data from Optum Labs, which houses one of the largest vertically
partitioned datasets in the U.S. with claims and electronic health record data from over 5 million patients. The
tool will be made compatible with PopMedNetTM, an open-source data-sharing platform currently used by
several large national initiatives such as the NIH Health Care Systems Research Collaboratory Distributed
Research Network, the PCORI-funded National Patient-Centered Clinical Research Network (PCORnet), and
the FDA-funded Sentinel program. The tool is therefore highly scalable and can have immediate impacts on
real-world big data analysis. The multidisciplinary study team includes researchers who pioneered some of the
distributed regression approaches and experts who have extensive experience in multi-center studies. The
distributed regression method has great potential to shift the paradigm of multi-center big biomedical
research, from transferring of potentially identifiable patient-level data to the sharing of non-identifiable
summary-level information. The proposed software tool will be a major step towards real-world application
of this state-of-the-art privacy-protecting analytic approach.
摘要
技术、生物信息学和数据科学的进步使得分析大型复杂的
数据库,以生成改善公共卫生和加速精确发展的证据
药然而,大数据的出现也引发了人们对隐私和保密的担忧。这
应用程序的重点是垂直分区数据中的数据隐私,这是一种数据环境,
关于一个人的信息在两个或多个数据源中可用。这种类型的数据结构在
生物医学研究,预计将成倍增长,因为来自同一个人的信息是
越来越多地收集在多个来源,如保险索赔数据库,电子健康记录,
注册表、社交媒体、可穿戴设备和移动的设备。组合多个数据库提供了更多
患者的完整健康状况,并生成更有力的证据。然而,对数据的担忧
隐私、保密性和安全性,以及治理和机构协议中的限制,
在物理上汇集不同的数据源具有挑战性,有时甚至是不可能的。我们建议发展一个
一个开源的、免费的软件工具,它将采用一种尖端的方法--分布式回归--来
分析垂直分区的数据集。该方法不需要物理地组合数据,但是
产生统计学上等同的结果,就好像数据集在一个研究中心集中链接和合并一样。
参与研究中心不会共享患者级别的信息,而只会传输不可识别的信息
矩阵(用于拟合统计模型的设计矩阵)和
统计建模过程。这种方法为数据隐私提供了更大的保护,同时允许
进行复杂的统计分析该软件工具将使用以下工具进行开发、测试和微调:
模拟数据集和来自Optum Labs的真实数据,Optum Labs拥有全球最大的垂直
美国的分区数据集,包含来自500多万患者的索赔和电子健康记录数据。的
该工具将与PopMedNetTM兼容,PopMedNetTM是一个开源数据共享平台,目前由
一些大型的国家计划,如NIH卫生保健系统研究合作实验室,
研究网络,PCORI资助的国家以患者为中心的临床研究网络(PCORnet),以及
FDA资助的哨兵项目因此,该工具具有高度可扩展性,可以立即影响到
真实世界的大数据分析。多学科研究小组包括研究人员谁开创了一些
分布式回归方法和在多中心研究方面具有丰富经验的专家。的
分布式回归方法具有很大的潜力,可以改变多中心大型生物医学
研究,从潜在的可识别的患者级数据的传输到不可识别的
摘要级信息。拟议的软件工具将是迈向实际应用的重要一步
这种最先进的隐私保护分析方法。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Darren Toh其他文献
Darren Toh的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Darren Toh', 18)}}的其他基金
Identifying treatment-resistant depression in automated databases
在自动化数据库中识别难治性抑郁症
- 批准号:
8110228 - 财政年份:2011
- 资助金额:
$ 50.01万 - 项目类别:
相似国自然基金
Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
- 批准号:
- 批准年份:2024
- 资助金额:万元
- 项目类别:合作创新研究团队
相似海外基金
Bioinformatics and Big Data Analytics
生物信息学和大数据分析
- 批准号:
CRC-2021-00259 - 财政年份:2022
- 资助金额:
$ 50.01万 - 项目类别:
Canada Research Chairs
Bioinformatics And Big Data Analytics
生物信息学和大数据分析
- 批准号:
CRC-2016-00137 - 财政年份:2021
- 资助金额:
$ 50.01万 - 项目类别:
Canada Research Chairs
Byte-sized bioinformatics: introducing Big Data through computational biology
字节大小的生物信息学:通过计算生物学引入大数据
- 批准号:
ST/T000872/1 - 财政年份:2020
- 资助金额:
$ 50.01万 - 项目类别:
Research Grant
Bioinformatics and big data analytics
生物信息学和大数据分析
- 批准号:
CRC-2016-00137 - 财政年份:2020
- 资助金额:
$ 50.01万 - 项目类别:
Canada Research Chairs
CLIMB-BIG-DATA: A Cloud Infrastructure for Big-Data Microbial Bioinformatics
CLIMB-BIG-DATA:大数据微生物生物信息学的云基础设施
- 批准号:
MR/T030062/1 - 财政年份:2020
- 资助金额:
$ 50.01万 - 项目类别:
Research Grant
Mining and Processing Big Data in Bioinformatics: Mouse Phenotyping using Flow Cytometry
生物信息学中的大数据挖掘和处理:使用流式细胞术进行小鼠表型分析
- 批准号:
505116-2017 - 财政年份:2019
- 资助金额:
$ 50.01万 - 项目类别:
Postgraduate Scholarships - Doctoral
Bioinformatics and big data analytics
生物信息学和大数据分析
- 批准号:
CRC-2016-00137 - 财政年份:2019
- 资助金额:
$ 50.01万 - 项目类别:
Canada Research Chairs
Bioinformatics and big data analytics
生物信息学和大数据分析
- 批准号:
CRC-2016-00137 - 财政年份:2018
- 资助金额:
$ 50.01万 - 项目类别:
Canada Research Chairs
Mining and Processing Big Data in Bioinformatics: Mouse Phenotyping using Flow Cytometry
生物信息学中的大数据挖掘和处理:使用流式细胞术进行小鼠表型分析
- 批准号:
505116-2017 - 财政年份:2018
- 资助金额:
$ 50.01万 - 项目类别:
Postgraduate Scholarships - Doctoral
Bioinformatics and big data analytics
生物信息学和大数据分析
- 批准号:
CRC-2016-00137 - 财政年份:2017
- 资助金额:
$ 50.01万 - 项目类别:
Canada Research Chairs