Entity augmentation and data cleaning for machine learning
用于机器学习的实体增强和数据清理
基本信息
- 批准号:508081-2016
- 负责人:
- 金额:$ 4.37万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Collaborative Research and Development Grants
- 财政年份:2018
- 资助国家:加拿大
- 起止时间:2018-01-01 至 2019-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
With the rise of Big Data, companies and organizations are increasingly eager to use machine learning to extract value from their data and to enable data-driven decision making. However, machine learning often assumes that data has been well-prepared, and puts its main focus on learning and making predictions based on the data. But, in reality, data often comes from multiple sources and a lot of time is spent on data integration; real-world data is often dirty and data cleaning is an extremely time-consuming and expensive process. According to the interviews of data scientists, they can spend 80% of their time on data preparation. This problem will be further exacerbated in emerging Big Data scenarios when data volumes are increasing, or when data comes from a larger variety of sources.**To this end, in this project, we study how to reduce the cost of data preparation for machine learning. We will particularly focus on two challenging research topics: (1) "Entity augmentation" studies how to efficiently augment entities (e.g., restaurants, persons) with new attributes (e.g., location, occupation) from external data sources. (2) "Data cleaning for machine learning" studies how to reduce the cost by only cleaning the data that are most beneficial to predictions. This project has benefits to the Canadian economy in multiple aspects. First, more and more companies in Canada are relying on machine learning to make critical business decisions (e.g., churn prediction, fraud detection). The techniques developed in this project can save their time to better prepare data for use in machine learning, helping them to improve prediction accuracy and grow revenue. Second, the outcome of the project will further boost the development of data science technologies, democratize machine learning for small companies, and help to create more data science related jobs in Canada.**
随着大数据的兴起,公司和组织越来越渴望使用机器学习从数据中提取价值,并实现数据驱动的决策。然而,机器学习通常假设数据已经准备好,并将其主要重点放在学习和基于数据进行预测上。但是,在现实中,数据往往来自多个来源,大量的时间花在数据集成上;真实世界的数据往往是脏的,数据清理是一个非常耗时和昂贵的过程。根据对数据科学家的采访,他们可以将80%的时间花在数据准备上。在新兴的大数据场景中,当数据量增加时,或者当数据来自更广泛的来源时,这个问题将进一步加剧。为此,在这个项目中,我们研究如何降低机器学习的数据准备成本。我们将特别关注两个具有挑战性的研究课题:(1)“实体增强”研究如何有效地增强实体(例如,餐馆、人)具有新属性(例如,位置、职业)从外部数据源。(2)“机器学习的数据清洗”研究如何通过只清洗对预测最有利的数据来降低成本。该项目对加拿大经济有多方面的好处。首先,加拿大越来越多的公司依靠机器学习来做出关键的业务决策(例如,流失预测、欺诈检测)。该项目中开发的技术可以节省他们的时间,以便更好地准备用于机器学习的数据,帮助他们提高预测准确性并增加收入。其次,该项目的成果将进一步推动数据科学技术的发展,使小公司的机器学习民主化,并有助于在加拿大创造更多与数据科学相关的就业机会。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Wang, Jiannan其他文献
太阳能塔式热发电站熔融盐吸热器过热故障的影响因素分析
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Wang, Jiannan;Li, Xin;Chang, Chun - 通讯作者:
Chang, Chun
Optimization of Material for Key Components and Parameters of Peanut Sheller Based on Hertz Theory and Box-Behnken Design
- DOI:
10.3390/agriculture12020146 - 发表时间:
2022-02-01 - 期刊:
- 影响因子:3.6
- 作者:
Wang, Jiannan;Xie, Huanxiong;Ma, Chenbin - 通讯作者:
Ma, Chenbin
Motility and function of smooth muscle cells in a silk small-caliber tubular scaffold after replacement of rabbit common carotid artery
- DOI:
10.1016/j.msec.2020.110977 - 发表时间:
2020-09-01 - 期刊:
- 影响因子:7.9
- 作者:
Li, Helei;Song, Guangzhou;Wang, Jiannan - 通讯作者:
Wang, Jiannan
Steady-State Behavior and Endothelialization of a Silk-Based Small-Caliber Scaffold In Vivo Transplantation
丝基小口径支架体内移植的稳态行为和内皮化
- DOI:
10.3390/polym11081303 - 发表时间:
2019-08-01 - 期刊:
- 影响因子:5
- 作者:
Li, Helei;Wang, Yining;Wang, Jiannan - 通讯作者:
Wang, Jiannan
Cytocompatibility of a silk fibroin tubular scaffold
丝素蛋白管状支架的细胞相容性
- DOI:
10.1016/j.msec.2013.09.039 - 发表时间:
2014-01-01 - 期刊:
- 影响因子:7.9
- 作者:
Wang, Jiannan;Wei, Yali;Zhao, Huanrong - 通讯作者:
Zhao, Huanrong
Wang, Jiannan的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Wang, Jiannan', 18)}}的其他基金
DataPrep: Human-in-the-Loop Data Preparation
DataPrep:人在环数据准备
- 批准号:
RGPIN-2021-03995 - 财政年份:2022
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Grants Program - Individual
DataPrep: Human-in-the-Loop Data Preparation
DataPrep:人在环数据准备
- 批准号:
RGPIN-2021-03995 - 财政年份:2021
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Grants Program - Individual
Crowdsourced Data Cleaning
众包数据清理
- 批准号:
RGPIN-2016-05555 - 财政年份:2020
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Grants Program - Individual
Entity augmentation and data cleaning for machine learning
用于机器学习的实体增强和数据清理
- 批准号:
508081-2016 - 财政年份:2019
- 资助金额:
$ 4.37万 - 项目类别:
Collaborative Research and Development Grants
Crowdsourced Data Cleaning
众包数据清理
- 批准号:
RGPIN-2016-05555 - 财政年份:2019
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Grants Program - Individual
Crowdsourced Data Cleaning
众包数据清理
- 批准号:
RGPIN-2016-05555 - 财政年份:2018
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Grants Program - Individual
Crowdsourced Data Cleaning
众包数据清理
- 批准号:
RGPIN-2016-05555 - 财政年份:2017
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Grants Program - Individual
Approximate Query Processing over Secure Key/Value Stores
通过安全键/值存储进行近似查询处理
- 批准号:
517430-2017 - 财政年份:2017
- 资助金额:
$ 4.37万 - 项目类别:
Engage Grants Program
Entity augmentation and data cleaning for machine learning
用于机器学习的实体增强和数据清理
- 批准号:
508081-2016 - 财政年份:2017
- 资助金额:
$ 4.37万 - 项目类别:
Collaborative Research and Development Grants
A unified access server for SQL-on-Hadoop systems
SQL-on-Hadoop系统的统一访问服务器
- 批准号:
501015-2016 - 财政年份:2016
- 资助金额:
$ 4.37万 - 项目类别:
Engage Grants Program
相似海外基金
All for data, data for all: Improving accessibility of healthcare data through a co-designed augmentation of an existing online rehabilitation application.
一切为了数据,数据为所有人:通过共同设计的现有在线康复应用程序的增强功能,提高医疗保健数据的可访问性。
- 批准号:
10054277 - 财政年份:2023
- 资助金额:
$ 4.37万 - 项目类别:
Grant for R&D
Genome Editing Therapy for Usher Syndrome Type 3
针对 3 型亚瑟综合症的基因组编辑疗法
- 批准号:
10759804 - 财政年份:2023
- 资助金额:
$ 4.37万 - 项目类别:
Prime editing for Crumbs homologue 1 (CRB1) Inherited Retinal Dystrophies
Crumbs 同源物 1 (CRB1) 遗传性视网膜营养不良的 Prime 编辑
- 批准号:
10636325 - 财政年份:2023
- 资助金额:
$ 4.37万 - 项目类别:
Identifying mechanistic pathways underlying RPE pathogenesis in models of pattern dystrophy
识别模式营养不良模型中 RPE 发病机制的机制途径
- 批准号:
10636678 - 财政年份:2023
- 资助金额:
$ 4.37万 - 项目类别:
Precision genome editing in vivo to treat retinal diseases
体内精准基因组编辑治疗视网膜疾病
- 批准号:
10565189 - 财政年份:2023
- 资助金额:
$ 4.37万 - 项目类别:
Novel Implementation of Microporous Annealed Particle HydroGel for Next-generation Posterior Pharyngeal Wall Augmentation
用于下一代咽后壁增强的微孔退火颗粒水凝胶的新实现
- 批准号:
10727361 - 财政年份:2023
- 资助金额:
$ 4.37万 - 项目类别:
Biomechanical Treatment of CTS Via Carpal Arch Space Augmentation: A Pilot Clinical Trial
通过腕弓间隙增大治疗 CTS 的生物力学治疗:初步临床试验
- 批准号:
10725257 - 财政年份:2023
- 资助金额:
$ 4.37万 - 项目类别:
Expressive data augmentation in deep learning
深度学习中的富有表现力的数据增强
- 批准号:
RGPIN-2022-04651 - 财政年份:2022
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Grants Program - Individual
Bayesian Data Augmentation for Recurrent Events in Electronic Medical Records of Patients with Cancer
癌症患者电子病历中重复事件的贝叶斯数据增强
- 批准号:
10436083 - 财政年份:2022
- 资助金额:
$ 4.37万 - 项目类别:
Using data augmentation, active learning, and visual analytics for learning with limited examples on mobility data sets
使用数据增强、主动学习和可视化分析,通过移动数据集的有限示例进行学习
- 批准号:
DGECR-2022-00386 - 财政年份:2022
- 资助金额:
$ 4.37万 - 项目类别:
Discovery Launch Supplement