权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Balancing Disclosure Risk with Inferential Power: Software for Intervalized Data

平衡披露风险与推理能力：间隔数据软件

基本信息

批准号：
8517848
负责人：
SCOTT D FERSON
金额：
$ 23.61万
依托单位：
APPLIED BIOMATHEMATICS, INC.
依托单位国家：
美国
项目类别：
财政年份：
2012
资助国家：
美国
起止时间：
2012-08-01 至 2014-05-31
项目状态：
已结题

项目摘要

DESCRIPTION (provided by applicant): Patient data collected during health care delivery and public health surveys possess a great deal of information that could be used in biomedical and epidemiological research. Access to these data, however, is usually limited because of the private nature of most personal health records. Methods of balancing the informativeness of data for research with the information loss required to minimize disclosure risk are needed before these data can be used to improve public health. Current methods are primarily focused on protecting privacy, but focusing on protecting privacy alone is inadequate. In statistical disclosure control techniques, information truthfulness is not well preserved so that unreliable results may be released. In generalization-based anonymization approaches, there is information loss due to attribute generalization and existing techniques do not provide sufficient control for maintaining data utility. What are currently needed are methods that protect both the privacy of individuals represented in the data as well as the integrity of relationships studied by researchers. The problem is that there is an inherent tradeoff between protecting the privacy of individuals and protecting the informativeness of the data set. Protecting the privacy of individuals always results in a loss of information and it is the information contained by the data set that affects the power of a statistical test. For a given anonymization strategy, however, there are often multiple ways of masking the data that meet the disclosure risk criteria provided. This can be taken advantage of to choose the solution that best preserves statistical information while meeting the disclosure risk criteria provided. This project will develop the first integrated software system that provides solutions for problems faced in all three stages in the release of sensitive health care data: 1. anonymize a data set by intervalizing/generalizing data to satisfy currently available anonymization strategies, 2. provide sufficient controls within anonymization procedures to satisfy constraints on statistical usefulness of the data, and 3. compute statistical tests for the anonymized data intervals. There are two main challenges facing this effort. The first is that, based on existing research results, integrating our proposed new control processes into anonymization procedures is expected to be computationally difficult. We will overcome this challenge by developing efficient and practically useful greedy algorithms, approximation algorithms, or algorithms working for realistic situations (if not for general cases). The other primary challenge facing this effort is the fact that statistical calculations with interval data sets are known to be computationally difficult, and these calculations are necessary both for control processes within anonymization procedures and for subsequent statistical computation and tests. We will overcome this challenge with efficient algorithms that exploit the structure present in data sets intervalized for privacy. The software will be tested on medical data sets of various sizes and structures to demonstrate the feasibility of the approach and to characterize the scalability of the algorithms with data set size.

描述(由申请人提供)：在提供医疗保健和公共卫生调查期间收集的患者数据拥有大量可用于生物医学和流行病学研究的信息。然而，由于大多数个人健康记录的私密性，对这些数据的访问通常是有限的。在这些数据被用于改善公共健康之前，需要平衡用于研究的数据的信息性与将披露风险降至最低所需的信息损失的方法。目前的方法主要集中在保护隐私上，但仅专注于保护隐私是不够的。在统计披露控制技术中，信息的真实性没有得到很好的保存，因此可能会发布不可靠的结果。在基于泛化的匿名化方法中，由于属性泛化而导致信息丢失，并且现有技术不能提供足够的控制来维护数据效用。目前需要的是既保护数据中代表的个人隐私又保护研究人员研究的关系完整性的方法。问题是，在保护个人隐私和保护数据集的信息性之间存在内在的权衡。保护个人隐私总是会导致信息的丢失，而正是数据集所包含的信息影响了统计测试的效力。然而，对于给定的匿名化策略，通常有多种方法来屏蔽满足所提供的披露风险标准的数据。可以利用这一点来选择在满足所提供的披露风险标准的同时最好地保存统计信息的解决方案。该项目将开发第一个综合软件系统，为发布敏感保健数据的所有三个阶段面临的问题提供解决办法：1.通过间隔化/通用化数据来匿名化数据集，以满足目前可用的匿名化战略；2.在匿名化程序内提供足够的控制，以满足对数据统计有用性的限制；以及3.计算匿名化数据间隔的统计检验。这一努力面临两个主要挑战。首先，根据现有的研究结果，将我们提出的新控制过程集成到匿名化过程中预计在计算上是困难的。我们将通过开发高效且实用的贪婪算法、近似算法或适用于现实情况(如果不适用于一般情况)的算法来克服这一挑战。这项工作面临的另一个主要挑战是，众所周知，区间数据集的统计计算在计算上是困难的，这些计算对于匿名化程序中的控制过程以及后续的统计计算和测试都是必要的。我们将用高效的算法来克服这一挑战，这些算法利用数据集中存在的结构来保护隐私。该软件将在不同大小和结构的医疗数据集上进行测试，以验证该方法的可行性，并用数据集大小来表征算法的可扩展性。