权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

A privacy-preserving socio-demographic enrichment framework for big data and its empirical application

保护隐私的大数据社会人口丰富框架及其实证应用

基本信息

批准号：
ES/W005352/1
负责人：
Yuanying Zhao
金额：
$ 15.13万
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Fellowship
财政年份：
2021
资助国家：
英国
起止时间：
2021 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=ES%2FW005352%2F1
关键词：
privacy preserving socio demographic enrichment

项目摘要

Recent decades have seen a substantial growth in the awareness and demand for privacy preservation set in legislation and by the public. This can be partly attributed to the proliferation of information and communications technologies, which generates large amounts of data (i.e. big data) when used. In parallel, the growing availability of big data has created opportunities for their applications, drawing upon the insights they can provide insights into people's daily behaviour patterns. However, the applicability of anonymous big data has been limited in behaviour-based analysis because of the essential role of socio-demographic information as exogenous determinants of human behaviour. Therefore, a plethora of studies have emerged to predict the absent socio-demographic attributes of respondents in various big data sources, termed the socio-demographic enrichment of big data.Existing socio-demographic enrichment approaches use either performance-based data mining and machine learning methods or statistical-fit-oriented models, which typically lack theoretical underpinnings that can explain or justify the postulated relationship between respondents' behaviour patterns (input features) and their socio-demographic attributes (output of the enrichment). A theoretical underpinning is, however, crucial because microeconomic consumer theory suggests that people's behaviour is driven by their socio-demographic attributes. One immediate consequence to neglect the underpinning microeconomic and/or sociological behavioural theoriesconcerns the incapability of existing methods to either predict the quality of enrichment or interpret the change in their performance due to the variation in data distributions. This motivates my PhD research in which I propose and formalise a new enrichment framework, called the Inverse Discrete Choice Modelling (IDCM) framework. The IDCM socio-demographic enrichment framework allows to quantitatively understand the trade-offs between enrichment accuracy and privacy preservation. Specifically, the IDCM approach performs statistical inversion to a discrete choice model (DCM), which is a well-established modelling technique relying on explicit behavioural assumptions grounded in social science, including microeconomics, sociology and psychology. The IDCM performance theory is established to estimate the IDCM enrichment performance based on known information about the data distribution in the enriched sample. This is enabled by drawing an analogy of human behaviour in information theory, i.e. observed individual as a 'message' transmitted over an information communication channel, which allows to use several powerful information-theoretic concepts to mathematically link how well we can predict who the person is and his/her privacy.So far, the ability of the IDCM performance theory is developed for socio-demographic enrichment of observation of a single, binary behaviour feature. To improve the empirical enrichment performance, the aim of the proposed research project is to extend the current IDCM approach by including multiple behaviour patterns as the input features. This can be achieved by using several DCMs that respectively captures the relationship between each behaviour feature and the enriched attribute and then to find the value of the socio-demographic attribute that is most likely to result in the joint behaviour patterns. The proposed extension of the IDCM approach involves the incorporation of machine learning or deep learning algorithms, applied to extract meaningful behaviour patterns, from raw big data, that can be further employed as the input feature for the subsequent IDCM enrichment. Correspondingly, the accompanying IDCM performance theory will be extended accordingly to accommodate the estimation of the enrichment performance based on the use of multiple behaviour features to retain transferability of the proposed extension of the IDCM methodology.

近几十年来，立法和公众对隐私保护的认识和需求大幅增长。这部分归因于信息和通信技术的扩散，这些技术在使用时产生大量数据（即大数据）。与此同时，大数据的日益可用性为它们的应用创造了机会，利用它们可以提供人们日常行为模式的见解。然而，匿名大数据在基于行为的分析中的适用性有限，因为社会人口信息作为人类行为的外生决定因素发挥着至关重要的作用。因此，大量的研究已经出现，以预测在各种大数据源中的受访者的缺失的社会人口统计属性，称为大数据的社会人口统计富集。现有的社会人口统计富集方法使用基于性能的数据挖掘和机器学习方法或以适应性为导向的模型，这通常缺乏理论基础，可以解释或证明受访者的行为模式之间的假设关系（输入特征）及其社会人口属性（丰富的输出）。然而，理论基础是至关重要的，因为微观经济消费者理论表明，人们的行为是由他们的社会人口属性驱动的。忽视微观经济学和/或社会学行为理论的一个直接后果是，现有方法无法预测富集的质量或解释由于数据分布的变化而导致的性能变化。这激发了我的博士研究，我提出并正式提出了一个新的丰富框架，称为逆离散选择模型（IDCM）框架。IDCM的社会人口富集框架允许定量地了解富集准确性和隐私保护之间的权衡。具体而言，IDCM方法对离散选择模型（DCM）进行统计反演，DCM是一种成熟的建模技术，依赖于基于社会科学（包括微观经济学，社会学和心理学）的明确行为假设。建立了IDCM性能理论，根据已知的数据分布信息估计IDCM的富集性能。这是通过在信息理论中对人类行为进行类比来实现的，即观察到的个体作为通过信息通信信道传输的“消息”，这允许使用几个强大的信息理论概念来数学地联系我们可以预测这个人是谁以及他/她的隐私。IDCM性能理论的能力被开发用于对单个二元行为特征的观察的社会人口统计学富集。为了提高经验丰富的性能，建议的研究项目的目的是扩展目前的IDCM方法，包括多个行为模式作为输入功能。这可以通过使用几个DCM来实现，这些DCM分别捕获每个行为特征与丰富属性之间的关系，然后找到最有可能导致联合行为模式的社会人口统计属性的值。IDCM方法的拟议扩展涉及机器学习或深度学习算法的合并，应用于从原始大数据中提取有意义的行为模式，这些模式可以进一步用作后续IDCM富集的输入特征。相应地，伴随的IDCM性能理论将相应地扩展，以适应基于使用多个行为特征的富集性能的估计，以保留IDCM方法的拟议扩展的可转移性。