权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Robust clustering of mixed-type data

混合类型数据的鲁棒聚类

基本信息

批准号：
2602507
负责人：
金额：
--
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Studentship
财政年份：
2021
资助国家：
英国
起止时间：
2021 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=studentship-2602507
关键词：
Robust clustering mixed type data

项目摘要

Nowadays, given the vast amount of data that is available to us, there is a large need for efficient 'segmentation' algorithms to be used in the industry. A typical example comes from the field of market research, where one of the main tasks is dividing a large group of a company's current or potential customers into smaller groups. The ultimate goal of this process is to aggregate the subjects into 'segments', such that each segment consists of subjects which are likely to share the same needs or have common interests. This is achieved by identifying similarities among the subjects, which may be of demographic, geographic, psychographic or behavioural nature. In the statistical machine learning world, segmentation is known as 'cluster analysis' or 'clustering'. However, the process of clustering is not very straightforward when dealing with data sets that include both numerical and textdata (commonly referred to as 'mixed-type data'), or when anomalous points are included. A data point is said to be 'anomalous' or 'outlying' if it does not conform to a general pattern that may exist within the data set or if it consists of 'unusual' values that are 'abnormal' compared to the majority of values of the rest of the data points such as to arouse suspicion. Despite the fact that a significant number of methods for cluster analysis of mixed-type data exists inthe literature, no such methods are 'robust' to the presence of outlying data points. This is potentially a consequence of having a well-established definition of 'outliers' for numerical data but of this not being the case for data that is not numeric. A more general definition for 'categorical outliers' ('categorical' referring to the fact that some variables may only take a fixed number of values, called'categories') is therefore needed, so that we can better understand what it means to have 'outliers'or 'anomalies' in a mixed data set. Although a simple approach could involve detecting outliers for the numerical and the categorical data individually, this is rather naive, since outliers might stillexist based on the relationships among variables of different types. In fact, there exists a very small number of anomaly detection algorithms for data of mixed-type in the literature but these make use of the aforementioned naive approach, with no software implementation being available either. Our project aims to develop novel methodology for identifying data points that are anomalous in amixed data set, by employing anomaly detection techniques in an unsupervised manner (meaning that we do not have access to a 'ground truth' regarding which data points are the anomalies). This will involve making use of statistical tools to capture any dependencies or interactions between the variables that a mixed data set is comprised of, in order to achieve a good understanding of the data set and, as a result, of which observations may be outlying. Combining the results of such a method with clustering algorithms for mixed-type data could enhance the performance of existing non-robust methods. Moreover, such a method could be extended to account for an additional aspect of robustness that has to do with 'incomplete' observations within a data set. Data irregularities and missing observations are very common issues that practitioners from several industry sectors have to face, such as in the automotive, the education, the insurance, the retail or the telecommunications sectors, all of which use segmentation techniques in some form. Thus we want to provide them with a framework under which they can obtain results which are meaningful and easily interpretable tothem, without being affected by 'misleading' or missing observations. This project falls within the EPSRC Statistics and applied probability research area.1

如今，鉴于我们可以获得的大量数据，行业中需要使用高效的“分割”算法。一个典型的例子来自市场研究领域，其中一项主要任务是将公司现有或潜在客户的大群体划分为较小的群体。该过程的最终目标是将主题聚合为“部分”，以便每个部分由可能具有相同需求或共同兴趣的主题组成。这是通过确定受试者之间的相似性来实现的，这些相似性可能是人口、地理、心理或行为性质的。在统计机器学习领域，分割被称为“聚类分析”或“聚类”。然而，当处理包括数值和文本数据（通常称为“混合类型数据”）的数据集时，或者当包括异常点时，聚类的过程并不非常简单。如果一个数据点不符合数据集中可能存在的一般模式，或者如果它包含与其余数据点的大多数值相比“异常”的“不寻常”值，例如引起怀疑，则该数据点被称为“异常”或“外围”。尽管事实上，大量的混合型数据的聚类分析方法存在于文献中，没有这样的方法是“强大的”存在的离群数据点。这可能是由于对数值数据的“离群值”有一个明确的定义，但对于非数值数据则并非如此。因此，需要对“分类离群值”（“分类”是指某些变量可能只取固定数量的值，称为“类别”）进行更一般的定义，以便我们更好地理解混合数据集中的“离群值”或“异常”意味着什么。虽然一个简单的方法可能涉及检测数值和分类数据的离群值，但这是相当天真的，因为离群值可能基于不同类型变量之间的关系。事实上，在文献中存在非常少量的用于混合类型数据的异常检测算法，但是这些算法利用了上述朴素方法，并且也没有软件实现。我们的项目旨在开发一种新的方法来识别混合数据集中的异常数据点，通过以无监督的方式采用异常检测技术（这意味着我们无法获得关于哪些数据点是异常的“地面实况”）。这将涉及利用统计工具来捕捉构成混合数据集的变量之间的任何依赖关系或相互作用，以便更好地了解数据集，从而了解哪些观测结果可能是无关的。将这种方法的结果与混合类型数据的聚类算法相结合，可以提高现有非鲁棒方法的性能。此外，这种方法可以扩展到考虑与数据集内的“不完整”观察有关的鲁棒性的另一个方面。数据不规则性和缺失观察是来自多个行业的从业者必须面对的非常常见的问题，例如在汽车，教育，保险，零售或电信行业，所有这些行业都以某种形式使用细分技术。因此，我们希望为他们提供一个框架，在这个框架下，他们可以获得有意义的和容易解释的结果，而不受“误导”或遗漏的观察结果的影响。该项目属于EPSRC统计和应用概率研究领域的福尔斯。1