权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Privacy-protecting distributed analysis of biomedical big data

生物医学大数据的隐私保护分布式分析

基本信息

批准号：
9159815
负责人：
Darren Toh
金额：
$ 50.01万
依托单位：
HARVARD PILGRIM HEALTH CARE, INC.
依托单位国家：
美国
项目类别：
财政年份：
2016
资助国家：
美国
起止时间：
2016-09-30 至 2019-06-30
项目状态：
已结题

来源：
https://reporter.nih.gov/project-details/9159815
关键词：
Agreement Big Data Bioinformatics Biomedical Research Clinical Research Code Complex Computer software Confidentiality of Patient Information Data Data Analyses Data Protection Data Science Data Set Data Sources Databases Development Distant Electronic Health Record Environment Funding Health Healthcare Systems Housing Individual Insurance Linear Regressions Link Logistic Regressions Methods Multicenter Studies Patients Performance Privacy Process Programming Languages Public Health Registries Regression Analysis Research Research Personnel Security Sentinel Site Software Tools Source Statistical Computing Statistical Data Interpretation Statistical Models System Technology Testing United States National Institutes of Health Work base big biomedical data collaboratory data sharing data structure design distributed data experience handheld mobile device improved multidisciplinary open data open source patient oriented precision medicine programs real world application social media statistics systems research tool

项目摘要

ABSTRACT Advances in technology, bioinformatics, and data science have made it possible to analyze large and complex databases to generate evidence that improves public health and accelerates the development of precision medicine. However, the advent of big data has also raised concerns about privacy and confidentiality. This application is focused on data privacy in vertically partitioned data, a data environment where information about an individual is available in two or more data sources. This type of data structure is common in biomedical research and is expected to grow exponentially as information from the same individual is increasingly collected in multiple sources, such as insurance claims databases, electronic health records, registries, social media, wearables, and mobile devices. Combining multiple databases provides a more complete health profile about the patient and generates more robust evidence. However, concerns about data privacy, confidentiality, and security, and constraints in governance and institutional agreements make it highly challenging or sometimes impossible to physically pool different data sources. We propose to develop an open-source, freely available software tool that will employ a cutting-edge method – distributed regression – to analyze vertically partitioned datasets. The method does not require data to be combined physically, but produces statistically equivalent results as if the datasets were linked and pooled centrally at one site. Instead of sharing patient-level information, participating sites will only transfer non-identifiable information matrix (a design matrix used in fitting of statistical models) and other summary-level statistics needed in the statistical modeling process. This approach offers much greater protection for data privacy while allowing one to perform sophisticated statistical analysis. The software tool will be developed, tested, and fine-tuned using both simulated datasets and the real-world data from Optum Labs, which houses one of the largest vertically partitioned datasets in the U.S. with claims and electronic health record data from over 5 million patients. The tool will be made compatible with PopMedNetTM, an open-source data-sharing platform currently used by several large national initiatives such as the NIH Health Care Systems Research Collaboratory Distributed Research Network, the PCORI-funded National Patient-Centered Clinical Research Network (PCORnet), and the FDA-funded Sentinel program. The tool is therefore highly scalable and can have immediate impacts on real-world big data analysis. The multidisciplinary study team includes researchers who pioneered some of the distributed regression approaches and experts who have extensive experience in multi-center studies. The distributed regression method has great potential to shift the paradigm of multi-center big biomedical research, from transferring of potentially identifiable patient-level data to the sharing of non-identifiable summary-level information. The proposed software tool will be a major step towards real-world application of this state-of-the-art privacy-protecting analytic approach.

摘要技术、生物信息学和数据科学的进步使得分析大型复杂的数据库，以生成改善公共卫生和加速精确发展的证据药然而，大数据的出现也引发了人们对隐私和保密的担忧。这应用程序的重点是垂直分区数据中的数据隐私，这是一种数据环境，关于一个人的信息在两个或多个数据源中可用。这种类型的数据结构在生物医学研究，预计将成倍增长，因为来自同一个人的信息是越来越多地收集在多个来源，如保险索赔数据库，电子健康记录，注册表、社交媒体、可穿戴设备和移动的设备。组合多个数据库提供了更多患者的完整健康状况，并生成更有力的证据。然而，对数据的担忧隐私、保密性和安全性，以及治理和机构协议中的限制，在物理上汇集不同的数据源具有挑战性，有时甚至是不可能的。我们建议发展一个一个开源的、免费的软件工具，它将采用一种尖端的方法--分布式回归--来分析垂直分区的数据集。该方法不需要物理地组合数据，但是产生统计学上等同的结果，就好像数据集在一个研究中心集中链接和合并一样。参与研究中心不会共享患者级别的信息，而只会传输不可识别的信息矩阵（用于拟合统计模型的设计矩阵）和统计建模过程。这种方法为数据隐私提供了更大的保护，同时允许进行复杂的统计分析该软件工具将使用以下工具进行开发、测试和微调：模拟数据集和来自Optum Labs的真实数据，Optum Labs拥有全球最大的垂直美国的分区数据集，包含来自500多万患者的索赔和电子健康记录数据。的该工具将与PopMedNetTM兼容，PopMedNetTM是一个开源数据共享平台，目前由一些大型的国家计划，如NIH卫生保健系统研究合作实验室，研究网络，PCORI资助的国家以患者为中心的临床研究网络（PCORnet），以及 FDA资助的哨兵项目因此，该工具具有高度可扩展性，可以立即影响到真实世界的大数据分析。多学科研究小组包括研究人员谁开创了一些分布式回归方法和在多中心研究方面具有丰富经验的专家。的分布式回归方法具有很大的潜力，可以改变多中心大型生物医学研究，从潜在的可识别的患者级数据的传输到不可识别的摘要级信息。拟议的软件工具将是迈向实际应用的重要一步这种最先进的隐私保护分析方法。