Bias and Representativeness in Linked Data
关联数据的偏差和代表性
基本信息
- 批准号:RGPIN-2020-05948
- 负责人:
- 金额:$ 1.75万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2021
- 资助国家:加拿大
- 起止时间:2021-01-01 至 2022-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Data are collected and generated at increasingly fast rates by companies and organizations in various domains. Data describing a single entity (e.g., person, product) can be collected by different organizations or by different units within the same organization. Not only that data are collected by different organizations, but it may be collected in different ways and at different granularity (e.g., one company may collect city for location of a customer, while another may collect province). Nowadays it is well accepted that by integrating data from different sources, we enrich the knowledge about entities of interest, thus making the data more valuable for analysis and prediction. Although data integration is a challenging problem, recent technological advances have made it possible to create large data lakes where data are unified from multiple sources. However, we are far away from integrating data fully (i.e., finding all the entities across the different data sources). This is due to the fact that unique identifiers across data collections do not exist, thus one must use common characteristics to all of the databases and compare their values to determine similarity. In addition, other challenges are presented by different database schemas, typographical errors and missing data. What is the impact of false positives (mismatches) and false negatives (missed matches) when data integrated from multiple sources are used for analysis or as training data for artificial intelligence -based methods? The main objective of the proposed research program is to investigate, understand and mitigate the biased data that are created through data integration. The understanding of the bias, representativeness and quality of data is critical. This is especially true when data are used in circumstances that could affect society at large (e.g., healthcare, policy making). Towards achieving the goal of the proposed program, the research plan is divided between a set of short- term goals and a set of long-term goals. The set of short -term goals are as follows: (1) investigating the state- of -the -art systems currently employed in data integration and record linkage; (2) developing a comparison framework and proposing new methods to investigate representativeness and bias in data generated by unifying multiple data sources; and (3) testing and evaluating the techniques and comparison methods in diverse applications (e.g., healthcare, retail) with different types of entities (e.g., persons, products). The long-term goal involves developing new methods and new technologies to mitigate the bias and to improve the quality and value of the data generated through data integration techniques.
在各个领域,公司和组织以越来越快的速度收集和生成数据。描述单个实体的数据(例如,人、产品)可以由不同的组织或由同一组织内的不同单位收集。数据不仅由不同的组织收集,而且可以以不同的方式和不同的粒度收集(例如,一个公司可以收集客户所在城市,而另一个公司可以收集省)。如今,人们普遍认为,通过整合来自不同来源的数据,我们丰富了有关感兴趣实体的知识,从而使数据更有价值的分析和预测。虽然数据集成是一个具有挑战性的问题,但最近的技术进步使得创建大型数据湖成为可能,其中数据来自多个来源。然而,我们离完全整合数据还很远(即,在不同的数据源上找到所有实体)。这是由于跨数据集合的唯一标识符不存在,因此必须使用所有数据库的共同特征并比较它们的值以确定相似性。此外,不同的数据库模式、印刷错误和数据缺失也带来了其他挑战。当从多个来源整合的数据用于分析或作为基于人工智能的方法的训练数据时,假阳性(错配)和假阴性(漏配)会产生什么影响?拟议研究计划的主要目标是调查,理解和减轻通过数据集成创建的有偏见的数据。理解数据的偏差、代表性和质量至关重要。当数据用于可能影响整个社会的情况时(例如,医疗保健,政策制定)。为了实现拟议计划的目标,研究计划分为一组短期目标和一组长期目标。这组短期目标如下:(1)调查目前在数据整合和记录链接中采用的最先进的系统;(2)开发一个比较框架,并提出新的方法来调查通过统一多个数据源产生的数据的代表性和偏差;以及(3)测试和评估各种应用中的技术和比较方法(例如,医疗保健、零售)与不同类型的实体(例如,人、产品)。长期目标涉及开发新方法和新技术,以减少偏差,并提高通过数据整合技术生成的数据的质量和价值。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Antonie, Luiza其他文献
Selection Bias Encountered in the Systematic Linking of Historical Census Records
- DOI:
10.1017/ssh.2020.15 - 发表时间:
2020-01-01 - 期刊:
- 影响因子:0.8
- 作者:
Antonie, Luiza;Inwood, Kris;Summerfield, Fraser - 通讯作者:
Summerfield, Fraser
Tracking people over time in 19th century Canada for longitudinal analysis
- DOI:
10.1007/s10994-013-5421-0 - 发表时间:
2014-04-01 - 期刊:
- 影响因子:7.5
- 作者:
Antonie, Luiza;Inwood, Kris;Ross, J. Andrew - 通讯作者:
Ross, J. Andrew
Full-Time and Part-Time Work and the Gender Wage Gap
- DOI:
10.1007/s11293-020-09677-z - 发表时间:
2020-08-13 - 期刊:
- 影响因子:0.6
- 作者:
Antonie, Luiza;Gatto, Laura;Plesca, Miana - 通讯作者:
Plesca, Miana
Antonie, Luiza的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Antonie, Luiza', 18)}}的其他基金
Bias and Representativeness in Linked Data
关联数据的偏差和代表性
- 批准号:
RGPIN-2020-05948 - 财政年份:2022
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Bias and Representativeness in Linked Data
关联数据的偏差和代表性
- 批准号:
RGPIN-2020-05948 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Data unification for customer profile generation
用于生成客户档案的数据统一
- 批准号:
543346-2019 - 财政年份:2019
- 资助金额:
$ 1.75万 - 项目类别:
Engage Grants Program
Record Linkage Across Heterogeneous Data Sources
记录异构数据源之间的链接
- 批准号:
RGPIN-2014-05304 - 财政年份:2019
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Record Linkage Across Heterogeneous Data Sources
记录异构数据源之间的链接
- 批准号:
RGPIN-2014-05304 - 财政年份:2018
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Record Linkage Across Heterogeneous Data Sources
记录异构数据源之间的链接
- 批准号:
RGPIN-2014-05304 - 财政年份:2017
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Record Linkage Across Heterogeneous Data Sources
记录异构数据源之间的链接
- 批准号:
RGPIN-2014-05304 - 财政年份:2016
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Record Linkage Across Heterogeneous Data Sources
记录异构数据源之间的链接
- 批准号:
RGPIN-2014-05304 - 财政年份:2015
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Record Linkage Across Heterogeneous Data Sources
记录异构数据源之间的链接
- 批准号:
RGPIN-2014-05304 - 财政年份:2014
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
相似海外基金
Bias and Representativeness in Linked Data
关联数据的偏差和代表性
- 批准号:
RGPIN-2020-05948 - 财政年份:2022
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Optimizing the Population Representativeness of Older Adults in Cancer Trials
优化癌症试验中老年人的人群代表性
- 批准号:
10180066 - 财政年份:2021
- 资助金额:
$ 1.75万 - 项目类别:
Representativeness of mobile phone location data for accurate measurement of population movement
手机位置数据的代表性,精准测算人口流动
- 批准号:
2573226 - 财政年份:2021
- 资助金额:
$ 1.75万 - 项目类别:
Studentship
Improving representativeness in non-probability surveys and causal inference with regularized regression and post-stratification
通过正则化回归和后分层提高非概率调查和因果推断的代表性
- 批准号:
10219956 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Improving the representativeness of American Indian Tribal Behavioral Risk Factor Surveillance System (TBRFSS) by machine learning and propensity score based data integration approach A1
通过机器学习和基于倾向评分的数据集成方法提高美洲印第安人部落行为风险因素监测系统(TBRFSS)的代表性A1
- 批准号:
10063407 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Improving representativeness in non-probability surveys and causal inference with regularized regression and post-stratification
通过正则化回归和后分层提高非概率调查和因果推断的代表性
- 批准号:
10400107 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Broad Appeal Strategies, Perceptual Disagreements and the Representativeness of Party Democracy in Europe.
欧洲政党民主的广泛诉求策略、感知分歧和代表性。
- 批准号:
2435022 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Studentship
Bias and Representativeness in Linked Data
关联数据的偏差和代表性
- 批准号:
RGPIN-2020-05948 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Discovery Grants Program - Individual
Optimizing the Population Representativeness of Older Adults in Alzheimer's Disease and Related Dementia Clinical Trials
优化阿尔茨海默病及相关痴呆症临床试验中老年人的人群代表性
- 批准号:
10220844 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别:
Optimizing the Population Representativeness of Older Adults in Alzheimer's Disease and Related Dementia Clinical Trials
优化阿尔茨海默病及相关痴呆症临床试验中老年人的人群代表性
- 批准号:
10041303 - 财政年份:2020
- 资助金额:
$ 1.75万 - 项目类别: