权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Finding Groups in Big Data

在大数据中寻找群体

基本信息

批准号：
RGPIN-2016-04850
负责人：
Sander, Jörg
金额：
$ 3.35万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2019
资助国家：
加拿大
起止时间：
2019-01-01 至 2020-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=689277
关键词：
Finding Groups Big Data

项目摘要

The main objective of the proposed research program is to advance theory and practice of “Big Data” cluster analysis. “Big Data” refers to today's ever larger and more complex data sets, which are typically collected on a large scale by automatic equipment (e.g. microarray chips, sensors, logging devices). These data sets have become abundant, and they hold the potential for the discovery of new insights, which can lead to new opportunities for improved, data-driven decision making. Unsupervised, exploratory methods for knowledge discovery play an important role in realizing this potential.******One of the major exploratory data analysis tasks is finding “natural” groups in data. Understanding the groups in Big Data allows a better organization and classification of Big Data, more efficient browsing and searching, focusing an analysis on specific groups, and discovering unknown relationships between groups.******Clustering is the most common unsupervised approach to finding groups in data. Traditional clustering methods, however, face challenges when applied to Big Data, due to the typically large volume and high dimensionality of the data; and they are not designed to take advantage of properties such as the time dependence of some Big Data sets and relationships between different data sources.******My proposed research program is aimed at advancing the theory and practice of clustering methods particularly density-based clustering (i.e., where clusters are considered dense regions in the data space, separated by regions of lower point density) applied to “Big Data”. Based on the theoretical insights we will gain, we will develop novel and improved algorithms for clustering Big Data that overcome limitations of current clustering methods and extend the applicability of clustering to a wider range of Big Data scenarios. ***The research will focus on the following aspects: (a) fast methods to deal with large data volumes, (b) projected- and subspace-clustering that can address issues of high-dimensionality such as data sparseness and “irrelevant” attributes, (c) semi-supervised clustering based on constraints, including constrains derived from related data sources, that can guide an algorithm to a solution which is consistent with these constraints, (d) combining projected-/subspace clustering with semi-supervision, to find simultaneously the subspace(s) and clusters that are most consistent with given constraints, and (e) modelling the development of cluster structures over time to allow the discovery and tracking of relationships between different clusters.***Progress with these issues will benefit a wide range and fast increasing number of application areas, in which Big Data is being collected, and which includes industrial and business application areas, as well as medical, biological, and and other scientific domains.*** **

该研究计划的主要目标是推进“大数据”聚类分析的理论和实践。“大数据”是指当今越来越大和越来越复杂的数据集，这些数据集通常由自动设备（例如微阵列芯片，传感器，记录设备）大规模收集。这些数据集已经变得非常丰富，它们具有发现新见解的潜力，这可以为改进数据驱动的决策带来新的机会。无监督的探索性知识发现方法在实现这一潜力方面发挥着重要作用。探索性数据分析的主要任务之一是在数据中找到“自然”组。了解大数据中的组可以更好地组织和分类大数据，更有效地浏览和搜索，重点分析特定组，并发现组之间的未知关系。聚类是在数据中查找组的最常见的无监督方法。然而，传统的聚类方法在应用于大数据时面临挑战，这是由于数据通常具有很大的容量和很高的维度;并且它们并没有被设计成利用一些大数据集的时间依赖性和不同数据源之间的关系等属性。我提出的研究计划旨在推进聚类方法的理论和实践，特别是基于密度的聚类（即，其中聚类被认为是数据空间中的密集区域，由较低点密度的区域分隔开）应用于“大数据”。基于我们将获得的理论见解，我们将开发新的和改进的算法来聚类大数据，克服当前聚类方法的局限性，并将聚类的适用性扩展到更广泛的大数据场景。* 研究将集中在以下几个方面：（a）处理大数据量的快速方法，（B）可以解决诸如数据稀疏和“不相关”属性的高维问题的投影和子空间聚类，（c）基于约束的半监督聚类，包括从相关数据源导出的约束，其可以将算法引导到与这些约束一致的解决方案，（d）将投影子空间聚类与半监督相结合，以同时找到与给定约束最一致的子空间和聚类，以及（e）模拟集群结构随时间的发展，以便发现和跟踪不同集群之间的关系。这些问题的进展将有利于广泛和快速增长的应用领域，其中大数据正在收集，包括工业和商业应用领域，以及医疗，生物和其他科学领域。**