Larg-Scale Data Analytics: Methodologies and Applications

大规模数据分析:方法和应用

基本信息

  • 批准号:
    RGPIN-2014-05721
  • 负责人:
  • 金额:
    $ 1.46万
  • 依托单位:
  • 依托单位国家:
    加拿大
  • 项目类别:
    Discovery Grants Program - Individual
  • 财政年份:
    2017
  • 资助国家:
    加拿大
  • 起止时间:
    2017-01-01 至 2018-12-31
  • 项目状态:
    已结题

项目摘要

Recent years have witnessed the rise of the big data era in computing and storage systems. With the great advances in information and communication technology, hundreds of petabytes of data are generated, transferred, processed and stored every day. The availability of this overwhelming amount of structured and unstructured data creates an acute need to develop fast and accurate algorithms to discover useful information that is hidden in big data. One of the crucial problems in the big data era is the ability to represent the data and its underlying information in a succinct and interpretable format. Although different algorithms for clustering and dimensionality reduction can be used to summarize big data, these algorithms tend to learn representations whose meanings are difficult to interpret. For instance, the traditional clustering algorithms such as k-means tend to produce centroids which encode information about thousands of data instances, but the meanings of these centroids are hard to interpret. Even clustering methods that use data instances as prototypes, such as k-medoid, learn only one representative for each of the resulting clusters; these alone are not sufficient to capture the insights of the data instances in this cluster. In addition, using medoids as representatives implicitly assumes that the data points are distributed as clusters and that the number of those clusters is known ahead of time. This assumption is not true for all data sets. On the other hand, traditional dimensionality reduction algorithms such as Latent Semantic Analysis (LSA) tend to learn a few latent concepts in the feature space. Each of these concepts is represented by a dense vector which combines thousands of features with positive and negative weights. This makes it difficult for the data analyst to understand the meaning of these concepts. Even if the goal of representative selection is to learn a low-dimension embedding of data instances, learning dimensions whose meanings are easy to interpret allows the understanding of the results of data mining and machine learning algorithms, such as understanding the meanings of data clusters in the low-dimensional space. The acute need to summarize big data to a format that is informative for data analysts motivates the development of new algorithms to directly select a few representative data instances and/or features. This problem can be generally formulated as the selection of a subset of columns from a data matrix, which is formally known as the Column Subset Selection (CSS) problem. Although many algorithms have been proposed for tackling the CSS problem, most of these algorithms focus on randomly selecting a subset of columns with the goal of using these columns to obtain a low-rank approximation of the data matrix. In this case, these algorithms tend to select a relatively large number of columns. When the goal is to select a very few columns to be directly presented to a data analyst or indirectly used to interpret the results of other algorithms, the randomized CSS methods do not produce a meaningful subset of columns. On the other hand, deterministic algorithms for CSS, although more accurate, do not scale to work on big matrices with massively-distributed columns. We propose to address these limitations by developing a new framework that we call Data Downdate. A large variety of important problems in machine learning and data mining such as variable selection, mining representative patterns, and most notably sparse approximation are all special cases of this framework.
近年来,计算和存储系统见证了大数据时代的兴起。随着信息和通信技术的巨大进步,每天生成、传输、处理和存储数百PB的数据。如此大量的结构化和非结构化数据的可用性迫切需要开发快速、准确的算法来发现隐藏在大数据中的有用信息。大数据时代的关键问题之一是以简洁且可解释的格式表示数据及其底层信息的能力。尽管可以使用不同的聚类和降维算法来总结大数据,但这些算法往往会学习其含义难以解释的表示。例如,传统的聚类算法(例如 k 均值)往往会生成对数千个数据实例的信息进行编码的质心,但这些质心的含义很难解释。即使使用数据实例作为原型的聚类方法(例如 k-medoid)也只能为每个结果聚类学习一个代表;仅这些不足以捕获该集群中数据实例的见解。此外,使用中心点作为代表隐式假设数据点作为簇分布,并且这些簇的数量是提前已知的。该假设并不适用于所有数据集。另一方面,传统的降维算法(例如潜在语义分析(LSA))倾向于学习特征空间中的一些潜在概念。这些概念中的每一个都由一个密集向量表示,该向量结合了数千个具有正权重和负权重的特征。这使得数据分析师很难理解这些概念的含义。即使代表性选择的目标是学习数据实例的低维嵌入,但易于解释含义的学习维度可以帮助理解数据挖掘和机器学习算法的结果,例如理解低维空间中数据簇的含义。将大数据总结为可为数据分析师提供信息的格式的迫切需求促使开发新算法来直接选择一些代表性的数据实例和/或特征。该问题通常可以表述为从数据矩阵中选择列子集,正式名称为列子集选择 (CSS) 问题。尽管已经提出了许多算法来解决 CSS 问题,但大多数算法都专注于随机选择列的子集,目的是使用这些列来获得数据矩阵的低秩近似。在这种情况下,这些算法往往会选择相对较多的列。当目标是选择极少数列直接呈现给数据分析师或间接用于解释其他算法的结果时,随机 CSS 方法不会生成有意义的列子集。另一方面,CSS 的确定性算法虽然更准确,但无法扩展到具有大规模分布列的大型矩阵。我们建议通过开发一个称为 Data Downdate 的新框架来解决这些限制。机器学习和数据挖掘中的各种重要问题,例如变量选择、挖掘代表性模式,以及最著名的稀疏逼近,都是该框架的特例。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Ghodsi, Ali其他文献

Novel mass detection based on magnetic excitation in anti-resonance region
Automatic dimensionality selection from the scree plot via the use of profile likelihood
A conceptual study on the dynamics of a piezoelectric MEMS (Micro Electro Mechanical System) energy harvester
  • DOI:
    10.1016/j.energy.2015.12.014
  • 发表时间:
    2016-02-01
  • 期刊:
  • 影响因子:
    9
  • 作者:
    Azizi, Saber;Ghodsi, Ali;Ghazavi, Mohammad Reza
  • 通讯作者:
    Ghazavi, Mohammad Reza
Controlling the Morphology of PVDF Hollow Fiber Membranes by Promotion of Liquid-Liquid Phase Separation
  • DOI:
    10.1002/adem.201701169
  • 发表时间:
    2018-07-01
  • 期刊:
  • 影响因子:
    3.6
  • 作者:
    Ghodsi, Ali;Fashandi, Hossein;Mirzaei, Majid
  • 通讯作者:
    Mirzaei, Majid
Eventual Consistency Today: Limitations, Extensions, and Beyond
  • DOI:
    10.1145/2447976.2447992
  • 发表时间:
    2013-05-01
  • 期刊:
  • 影响因子:
    22.7
  • 作者:
    Bailis, Peter;Ghodsi, Ali
  • 通讯作者:
    Ghodsi, Ali

Ghodsi, Ali的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Ghodsi, Ali', 18)}}的其他基金

Beyond Deep Associative learning
超越深度联想学习
  • 批准号:
    RGPIN-2019-04824
  • 财政年份:
    2022
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual
Beyond Deep Associative learning
超越深度联想学习
  • 批准号:
    RGPIN-2019-04824
  • 财政年份:
    2021
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual
Beyond Deep Associative learning
超越深度联想学习
  • 批准号:
    RGPIN-2019-04824
  • 财政年份:
    2020
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual
Beyond Deep Associative learning
超越深度联想学习
  • 批准号:
    RGPIN-2019-04824
  • 财政年份:
    2019
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual
Larg-Scale Data Analytics: Methodologies and Applications
大规模数据分析:方法和应用
  • 批准号:
    RGPIN-2014-05721
  • 财政年份:
    2018
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual
Larg-Scale Data Analytics: Methodologies and Applications
大规模数据分析:方法和应用
  • 批准号:
    RGPIN-2014-05721
  • 财政年份:
    2016
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual
Larg-Scale Data Analytics: Methodologies and Applications
大规模数据分析:方法和应用
  • 批准号:
    RGPIN-2014-05721
  • 财政年份:
    2015
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual
Larg-Scale Data Analytics: Methodologies and Applications
大规模数据分析:方法和应用
  • 批准号:
    RGPIN-2014-05721
  • 财政年份:
    2014
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Discovery Grants Program - Individual

相似国自然基金

基于热量传递的传统固态发酵过程缩小(Scale-down)机理及调控
  • 批准号:
    22108101
  • 批准年份:
    2021
  • 资助金额:
    30 万元
  • 项目类别:
    青年科学基金项目
基于Multi-Scale模型的轴流血泵瞬变流及空化机理研究
  • 批准号:
    31600794
  • 批准年份:
    2016
  • 资助金额:
    22.0 万元
  • 项目类别:
    青年科学基金项目
针对Scale-Free网络的紧凑路由研究
  • 批准号:
    60673168
  • 批准年份:
    2006
  • 资助金额:
    25.0 万元
  • 项目类别:
    面上项目

相似海外基金

Uncovering Sex-Specific Biological Mechanisms of Depression: Insights from Large-Scale Data Analysis
揭示抑郁症的性别特异性生物学机制:大规模数据分析的见解
  • 批准号:
    MR/Y011112/1
  • 财政年份:
    2024
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Fellowship
Novel Analytical and Computational Approaches for Fusion and Analysis of Multi-Level and Multi-Scale Networks Data
用于多层次和多尺度网络数据融合和分析的新分析和计算方法
  • 批准号:
    2311297
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Standard Grant
GOALI: Frameworks: At-Scale Heterogeneous Data based Adaptive Development Platform for Machine-Learning Models for Material and Chemical Discovery
GOALI:框架:基于大规模异构数据的自适应开发平台,用于材料和化学发现的机器学习模型
  • 批准号:
    2311632
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Standard Grant
Collaborative Research: OAC Core: Small: Anomaly Detection and Performance Optimization for End-to-End Data Transfers at Scale
协作研究:OAC 核心:小型:大规模端到端数据传输的异常检测和性能优化
  • 批准号:
    2412329
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Standard Grant
DMS/NIGMS 2: Deep learning for repository-scale analysis of tandem mass spectrometry proteomics data
DMS/NIGMS 2:用于串联质谱蛋白质组数据存储库规模分析的深度学习
  • 批准号:
    2245300
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Continuing Grant
Collaborative Research: RETTL: Story Studio: Coaching Data Storytelling at Scale
协作研究:RETTL:故事工作室:指导大规模数据讲故事
  • 批准号:
    2302795
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Standard Grant
Collaborative Research: RETTL: Story Studio: Coaching Data Storytelling at Scale
协作研究:RETTL:故事工作室:指导大规模数据讲故事
  • 批准号:
    2302794
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Standard Grant
Study of electrified post-sunset medium-scale traveling ionospheric disturbances at mid-latitudes: a multi-layer model and a multi-source data investigation
中纬度地区带电日落后中尺度移动电离层扰动研究:多层模型和多源数据调查
  • 批准号:
    23K19066
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
    Grant-in-Aid for Research Activity Start-up
Fluency from Flesh to Filament: Collation, Representation, and Analysis of Multi-Scale Neuroimaging data to Characterize and Diagnose Alzheimer's Disease
从肉体到细丝的流畅性:多尺度神经影像数据的整理、表示和分析,以表征和诊断阿尔茨海默病
  • 批准号:
    10462257
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
Statistical methods for co-expression network analysis of population-scale scRNA-seq data
群体规模 scRNA-seq 数据共表达网络分析的统计方法
  • 批准号:
    10740240
  • 财政年份:
    2023
  • 资助金额:
    $ 1.46万
  • 项目类别:
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了