Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
基本信息
- 批准号:RGPIN-2019-04068
- 负责人:
- 金额:$ 2.99万
- 依托单位:
- 依托单位国家:加拿大
- 项目类别:Discovery Grants Program - Individual
- 财政年份:2019
- 资助国家:加拿大
- 起止时间:2019-01-01 至 2020-12-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Enterprises in all verticals (e.g., healthcare, financial services, manufacturers, and insurance companies) have been aggressively collecting data from a variety of sources including customers, transactions, sensors and social data to build the ultimate data asset. The hope is that by employing appropriate analysis techniques, this data can provide insights, directions, and findings that increase their customer satisfaction; achieve higher profit margins; or even inspire the creation of new lines of business or enable new discoveries. Unfortunately, what prevents this fine vision from being a pervasive reality is the data itself; dirty and siloed data is the norm rather than the exception. Consequently, data curation, cleaning and integration become key enablers to the big promise of effective data science. An article in the New York Times (August of 2014) indicated that for data scientists, "cleaning" is key hurdle to insights. Large scale data cleaning to enable data science is the main goal of this proposal.******Data cleaning is often described by a set of activities including finding and fixing anomalies and outliers, imputing missing values, and deduplicating records representing the same entity. The main objective is to prepare data to be mined and analyzed by a variety of tools to produce high quality aggregates and insights. The task of curating and integrating large amounts of data presents real theoretical and engineering challenges. Most current proposals suffer from fundamental problems that hinder any of these solutions from being deployed in practical industry and business settings.******I propose to conduct fundamental research in data quality leading to solutions (new technologies, methods and algorithms) that can be deployed in real environments. The main objective is to enable quality-aware analytics on and retrieval from large-scale inconsistent and dirty data sources, unleashing the potential of data science. Some of the fundamental challenges in achieving this objective, which we intend to investigate, include: (1) developing efficient profiling and repair solutions that scale to large data sets; (2) addressing the privacy concerns around sensitive data by developing privacy-aware exploration, error detection, and repair framework; (3) modelling data cleaning as large scale statistical inference problem that takes into account all available signals including business rules, master data and various statistical properties; (4) studying practical variants of the outlier detection problem; and (5) investigate the quality issues in integrating unstructured data (such as text), with structured relational data, including revisiting information extraction systems to include quality constraints. The proposed techniques will be implemented and tested in multiple open-source system prototypes, including HoloClean, our recent system for machine learning-based data cleaning.
所有垂直领域的企业(例如,医疗保健、金融服务、制造商和保险公司)一直在积极地从各种来源收集数据,包括客户、交易、传感器和社交数据,以构建最终的数据资产。希望通过采用适当的分析技术,这些数据可以提供洞察力,方向和发现,提高客户满意度;实现更高的利润率;甚至激励创建新的业务线或实现新的发现。不幸的是,阻止这种美好愿景成为普遍现实的是数据本身;肮脏和孤立的数据是常态而不是例外。因此,数据管理、清理和集成成为实现有效数据科学的关键因素。纽约时报(2014年8月)的一篇文章指出,对于数据科学家来说,“清理”是获得洞察力的关键障碍。大规模数据清理以实现数据科学是本提案的主要目标。数据清理通常由一组活动来描述,包括查找和修复异常和离群值,估算缺失值以及删除代表同一实体的重复记录。主要目标是准备数据,通过各种工具进行挖掘和分析,以产生高质量的聚合和见解。管理和整合大量数据的任务提出了真实的理论和工程挑战。大多数当前的提案都存在根本性的问题,这些问题阻碍了这些解决方案在实际工业和商业环境中的部署。我建议在数据质量方面进行基础研究,以获得可以在真实的环境中部署的解决方案(新技术,方法和算法)。其主要目标是实现对大规模不一致和脏数据源的质量感知分析和检索,释放数据科学的潜力。实现这一目标的一些基本挑战,我们打算调查,包括:(1)开发有效的分析和修复解决方案,可扩展到大型数据集;(2)通过开发隐私感知的探索,错误检测和修复框架来解决敏感数据的隐私问题;(3)将数据清理建模为考虑所有可用信号(包括业务规则、主数据和各种统计属性)的大规模统计推断问题;(4)研究异常值检测问题的实际变体;(5)研究将非结构化数据(如文本)与结构化关系数据集成时的质量问题,包括重新访问信息提取系统以包括质量约束。 所提出的技术将在多个开源系统原型中实现和测试,包括HoloClean,我们最近的基于机器学习的数据清理系统。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Ilyas, Ihab其他文献
Ilyas, Ihab的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Ilyas, Ihab', 18)}}的其他基金
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2022
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2021
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2021
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
- 批准号:
RGPIN-2019-04068 - 财政年份:2020
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
- 批准号:
543961-2019 - 财政年份:2019
- 资助金额:
$ 2.99万 - 项目类别:
Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2019
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
Cleaning and Analysis of Large Uncertain and Inconsistent Data Sources
大量不确定且不一致的数据源的清理和分析
- 批准号:
RGPIN-2014-06143 - 财政年份:2018
- 资助金额:
$ 2.99万 - 项目类别:
Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
- 批准号:
534011-2017 - 财政年份:2018
- 资助金额:
$ 2.99万 - 项目类别:
Industrial Research Chairs
相似海外基金
SBIR Phase II: Increasing energy yield from dusty solar panels with a new generation of an electrostatic self-cleaning technology
SBIR 第二阶段:利用新一代静电自清洁技术提高多尘太阳能电池板的能源产量
- 批准号:
2322204 - 财政年份:2024
- 资助金额:
$ 2.99万 - 项目类别:
Cooperative Agreement
RADWIPES - Performance and Cleaning Efficacy
RADWIPES - 性能和清洁功效
- 批准号:
10089486 - 财政年份:2024
- 资助金额:
$ 2.99万 - 项目类别:
Collaborative R&D
Simultaneous treatment technology for high-concentration NOx, SOx, and particulate matter from ships by plasma hybrid cleaning method
等离子混合清洗法同时处理船舶高浓度NOx、SOx和颗粒物技术
- 批准号:
23H01626 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Development of multifunctional soft denture liners with self-cleaning function and drug delivery function for sustained release of physiologically active substances
开发具有自清洁功能和缓释生理活性物质的药物输送功能的多功能软义齿衬垫
- 批准号:
23H03094 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Microwave assisted photocatalysis and heterogeneous catalysis for cleaning with air
微波辅助光催化和多相催化用于空气清洁
- 批准号:
2882933 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Studentship
SBIR Phase I: Optimization of a Novel Compliant Mechanisms-Based Laparoscope Cleaning Device
SBIR 第一阶段:基于新型顺应机制的腹腔镜清洁装置的优化
- 批准号:
2213695 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Standard Grant
Data Collection, Linkages, Cleaning and Sharing Core
数据采集、联动、清洗、共享核心
- 批准号:
10774555 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
DEVELOPMENT OF SNOW CLEANING WITH SUPERCRITICAL CO2
超临界二氧化碳除雪技术的发展
- 批准号:
10948128 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Novel Surfactants and Cleaning Technologies
新型表面活性剂和清洁技术
- 批准号:
2889941 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Studentship
An Injectable Glucose Biosensor Based on a Self-cleaning Membrane & NIR FRET Assay
基于自清洁膜的可注射葡萄糖生物传感器
- 批准号:
2314639 - 财政年份:2023
- 资助金额:
$ 2.99万 - 项目类别:
Standard Grant