权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Finding Groups in Big Data

在大数据中寻找群体

基本信息

批准号：
RGPIN-2016-04850
负责人：
Sander, Jörg
金额：
$ 3.35万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2016
资助国家：
加拿大
起止时间：
2016-01-01 至 2017-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=615338
关键词：
Finding Groups Big Data

项目摘要

The main objective of the proposed research program is to advance theory and practice of “Big Data” cluster analysis. “Big Data” refers to today’s ever larger and more complex data sets, which are typically collected on a large scale by automatic equipment (e.g. microarray chips, sensors, logging devices). These data sets have become abundant, and they hold the potential for the discovery of new insights, which can lead to new opportunities for improved, data-driven decision making. Unsupervised, exploratory methods for knowledge discovery play an important role in realizing this potential. One of the major exploratory data analysis tasks is finding “natural” groups in data. Understanding the groups in Big Data allows a better organization and classification of Big Data, more efficient browsing and searching, focusing an analysis on specific groups, and discovering unknown relationships between groups. Clustering is the most common unsupervised approach to finding groups in data. Traditional clustering methods, however, face challenges when applied to Big Data, due to the typically large volume and high dimensionality of the data; and they are not designed to take advantage of properties such as the time dependence of some Big Data sets and relationships between different data sources. My proposed research program is aimed at advancing the theory and practice of clustering methods -particularly density-based clustering (i.e., where clusters are considered dense regions in the data space, separated by regions of lower point density)- applied to “Big Data”. Based on the theoretical insights we will gain, we will develop novel and improved algorithms for clustering Big Data that overcome limitations of current clustering methods and extend the applicability of clustering to a wider range of Big Data scenarios. The research will focus on the following aspects: (a) fast methods to deal with large data volumes, (b) projected- and subspace-clustering that can address issues of high-dimensionality such as data sparseness and “irrelevant” attributes, (c) semi-supervised clustering based on constraints, including constrains derived from related data sources, that can guide an algorithm to a solution which is consistent with these constraints, (d) combining projected-/subspace clustering with semi-supervision, to find simultaneously the subspace(s) and clusters that are most consistent with given constraints, and (e) modelling the development of cluster structures over time to allow the discovery and tracking of relationships between different clusters. Progress with these issues will benefit a wide range and fast increasing number of application areas, in which Big Data is being collected, and which includes industrial and business application areas, as well as medical, biological, and and other scientific domains.

该研究计划的主要目标是推进“大数据”聚类分析的理论和实践。“大数据”指的是当今越来越大、越来越复杂的数据集，通常由自动设备(如微阵列芯片、传感器、记录设备)大规模收集。这些数据集已经变得丰富，它们具有发现新见解的潜力，这可能会带来改进的、数据驱动的决策制定的新机会。无监督的、探索性的知识发现方法在实现这一潜力方面发挥着重要作用。探索性数据分析的主要任务之一是在数据中找到“自然”组。了解大数据中的组可以更好地组织和分类大数据，更高效地浏览和搜索，专注于特定组的分析，并发现组之间的未知关系。聚类是在数据中查找组的最常见的非监督方法。然而，传统的集群方法在应用于大数据时面临挑战，因为数据通常是大容量和高维的；它们没有被设计为利用一些大数据集的时间相关性和不同数据源之间的关系等属性。我提出的研究计划旨在推动适用于大数据的聚类方法的理论和实践-特别是基于密度的聚类(即，集群被认为是数据空间中的密集区域，被较低点密度的区域隔开)。基于我们将获得的理论见解，我们将开发新的和改进的大数据集群算法，克服现有集群方法的局限性，将集群的适用性扩展到更广泛的大数据场景。研究将集中在以下几个方面：(A)处理大数据量的快速方法，(B)能够解决诸如数据稀疏和不相关属性等高维问题的投影和子空间聚类，(C)基于约束的半监督聚类，其可以指导算法找到与这些约束一致的解，(D)将投影/子空间聚类与半监督相结合，以同时找到与给定约束最一致的子空间(S)和聚类，以及(E)模拟集群结构随时间的发展，以便能够发现和跟踪不同集群之间的关系。这些问题的进展将使正在收集大数据的广泛且快速增长的应用领域受益，其中包括工业和商业应用领域，以及医疗、生物和其他科学领域。