Cleaning and Analysis of Large Uncertain and Inconsistent Data Sources

大量不确定且不一致的数据源的清理和分析

基本信息

  • 批准号:
    RGPIN-2014-06143
  • 负责人:
  • 金额:
    $ 5.54万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2018
  • 资助国家:
    加拿大
  • 起止时间:
    2018-01-01 至 2019-12-31
  • 项目状态:
    已结题

项目摘要

Data generated by modern applications such as object tracking, sensor networks, health record management, and Web data integration involves uncertainty, and various anomalies such as missing values and duplication. While many research efforts have been focusing on data cleaning and dealing with inconsistent databases, very limited research has been adopted in real settings for various technical and practical challenges around increasing and extracting value from large dirty data sets. To list a few of these practical and technical challenges: (1) the protection and sensitivity of data, where data custodians and guardians prevent automatic repairing algorithms from changing the underlying data; (2) the heterogeneity of integrity constraints, which makes proposed techniques that tackle a single type of error inapplicable or ineffective in practice; (3) the lack of ground truth to validate repairing strategies; quality metrics such as minimal repairs have not been showing great results in practice; and (4) the lack of interactive tools for data quality that allow users and experts to reason about the problematic parts of the data and to explain the reasons behind these errors.**In this proposal, we focus on enabling data quality analytics and retrieval on large-scale inconsistent and dirty databases. The proposal pursues a set of research directions including (1) non-destructive data cleaning that represents and queries possible data repairs without changing the underlying data; (2) holistic data cleaning, which addresses the violations of multiple heterogeneous integrity constraints; (3) high-fidelity data repairing, which depends more on trusted data sources and experts, and depends less on heuristic quality metrics, such as minimal repairs; and (4) descriptive and prescriptive data quality analytics in practical dashboards that go beyond describing errors in the data to recommending ways to prevent future errors.**The proposed techniques will be implemented and tested in our previously developed system prototypes: UClean, a probabilistic and quality-aware database engine prototype, based on an open-source Database Management System; and NADEEF, an open source extensible data cleaning system. The goal is to build a generic framework that encapsulates efficient query processing algorithms to allow users to effectively query, analyze and explore large volumes of inconsistent and uncertain data.*The developed algorithms and dashboard will enable both the research community and industry to reason about the quality of available data sets, and to provide guidance on how to clean or enhance the quality of this data with respect to target applications or use cases.
对象跟踪、传感器网络、健康记录管理和Web数据集成等现代应用程序生成的数据包含不确定性,以及丢失值和重复等各种异常。虽然许多研究工作一直集中在数据清理和处理不一致的数据库上,但在真实环境中针对增加和从大型脏数据集中提取价值的各种技术和实践挑战进行的研究非常有限。列举几个实际和技术挑战:(1)数据的保护和敏感性,其中数据保管人和监护人阻止自动修复算法改变底层数据;(2)完整性约束的异构性,这使得解决单一类型错误的拟议技术在实践中不适用或无效;(3)缺乏基本事实来验证修复策略;最小修复等质量指标在实践中没有显示出很大的效果;以及(4)缺乏交互式数据质量工具,使用户和专家能够对数据中有问题的部分进行推理,并解释这些错误背后的原因。**在本提案中,我们侧重于在大规模不一致和肮脏的数据库上启用数据质量分析和检索。该方案追求一系列的研究方向,包括(1)无损数据清理,在不改变底层数据的情况下表示和查询可能的数据修复;(2)整体数据清理,解决多个异质完整性约束的违反;(3)高保真数据修复,它更依赖于可信的数据源和专家,而较少依赖启发式质量度量,如最小修复;以及(4)实用仪表板中的描述性和说明性数据质量分析,它不仅描述数据中的错误,还建议防止未来错误的方法。**所提出的技术将在我们之前开发的系统原型中实施和测试:UClean,一个基于开源数据库管理系统的概率和质量感知数据库引擎原型;以及NADEEF,一个开源的可扩展数据清理系统。其目标是建立一个封装高效查询处理算法的通用框架,以允许用户有效地查询、分析和探索大量不一致和不确定的数据。*开发的算法和仪表板将使研究界和行业能够对可用的数据集的质量进行推理,并就如何针对目标应用程序或用例来清理或提高这些数据的质量提供指导。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Ilyas, Ihab其他文献

Ilyas, Ihab的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Ilyas, Ihab', 18)}}的其他基金

Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
  • 批准号:
    RGPIN-2019-04068
  • 财政年份:
    2022
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Discovery Grants Program - Individual
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
  • 批准号:
    RGPIN-2019-04068
  • 财政年份:
    2021
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Discovery Grants Program - Individual
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
  • 批准号:
    534011-2017
  • 财政年份:
    2021
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Industrial Research Chairs
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
  • 批准号:
    543961-2019
  • 财政年份:
    2020
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
  • 批准号:
    534011-2017
  • 财政年份:
    2020
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Industrial Research Chairs
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
  • 批准号:
    RGPIN-2019-04068
  • 财政年份:
    2020
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Discovery Grants Program - Individual
Scalable Cleaning, Integration and Analysis of Structured and Semi-Structured Inconsistent Data
结构化和半结构化不一致数据的可扩展清理、集成和分析
  • 批准号:
    RGPIN-2019-04068
  • 财政年份:
    2019
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Discovery Grants Program - Individual
End-to-end Extraction and Curation of Large RDF Repositories
大型 RDF 存储库的端到端提取和管理
  • 批准号:
    543961-2019
  • 财政年份:
    2019
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Collaborative Research and Development Grants
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
  • 批准号:
    534011-2017
  • 财政年份:
    2019
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Industrial Research Chairs
NSERC/Thomson Reuters Industrial Research Chair in Data Cleaning
NSERC/汤森路透数据清理工业研究主席
  • 批准号:
    534011-2017
  • 财政年份:
    2018
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Industrial Research Chairs

相似国自然基金

Scalable Learning and Optimization: High-dimensional Models and Online Decision-Making Strategies for Big Data Analysis
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    合作创新研究团队
Intelligent Patent Analysis for Optimized Technology Stack Selection:Blockchain BusinessRegistry Case Demonstration
  • 批准号:
  • 批准年份:
    2024
  • 资助金额:
    万元
  • 项目类别:
    外国学者研究基金项目
基于Meta-analysis的新疆棉花灌水增产模型研究
  • 批准号:
    41601604
  • 批准年份:
    2016
  • 资助金额:
    22.0 万元
  • 项目类别:
    青年科学基金项目
大规模微阵列数据组的meta-analysis方法研究
  • 批准号:
    31100958
  • 批准年份:
    2011
  • 资助金额:
    20.0 万元
  • 项目类别:
    青年科学基金项目
用“后合成核磁共振分析”(retrobiosynthetic NMR analysis)技术阐明青蒿素生物合成途径
  • 批准号:
    30470153
  • 批准年份:
    2004
  • 资助金额:
    22.0 万元
  • 项目类别:
    面上项目

相似海外基金

Exploring Hotel Customer Experiences in Japan via Big Data and Large Language Model Analysis
通过大数据和大语言模型分析探索日本酒店客户体验
  • 批准号:
    24K21025
  • 财政年份:
    2024
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Uncovering Sex-Specific Biological Mechanisms of Depression: Insights from Large-Scale Data Analysis
揭示抑郁症的性别特异性生物学机制:大规模数据分析的见解
  • 批准号:
    MR/Y011112/1
  • 财政年份:
    2024
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Fellowship
Collaborative Research:CISE-ANR:CIF:Small:Learning from Large Datasets - Application to Multi-Subject fMRI Analysis
合作研究:CISE-ANR:CIF:Small:从大数据集中学习 - 多对象 fMRI 分析的应用
  • 批准号:
    2316421
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Standard Grant
RII Track-4:NSF: DyG-MAP: Fast Algorithms for Mining and Analysis of Evolving Patterns in Large Dynamic Graphs
RII Track-4:NSF:DyG-MAP:大型动态图中演化模式挖掘和分析的快速算法
  • 批准号:
    2323533
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Standard Grant
Collaborative Research: NeTS: Medium: Large Scale Analysis of Configurations and Management Practices in the Domain Name System
合作研究:NetS:中型:域名系统配置和管理实践的大规模分析
  • 批准号:
    2312711
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Standard Grant
Non-linear large signal network analysis
非线性大信号网络分析
  • 批准号:
    512477106
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Major Research Instrumentation
Development of "ultra" large displacement dynamic analysis algorithm using machine learning
利用机器学习开发“超”大位移动态分析算法
  • 批准号:
    23K04007
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Identification of coexistence relationships and phenotypic traits of virulence and resistance genes by large-scale E. coli genome analysis
通过大规模大肠杆菌基因组分析鉴定毒力和抗性基因的共存关系和表型特征
  • 批准号:
    23K07947
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Elucidation of a novel subgroup of anaplastic thyroid carcinoma by large-scale cohort and comprehensive genetic analysis
通过大规模队列和综合遗传分析阐明未分化甲状腺癌的新亚组
  • 批准号:
    23K14493
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Large-scale data analysis for improving the process of personalized remote lifestyle intervention
大规模数据分析,改善个性化远程生活方式干预流程
  • 批准号:
    23K16769
  • 财政年份:
    2023
  • 资助金额:
    $ 5.54万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了