Robust clustering of mixed-type data
混合类型数据的鲁棒聚类
基本信息
- 批准号:2602507
- 负责人:
- 金额:--
- 依托单位:
- 依托单位国家:英国
- 项目类别:Studentship
- 财政年份:2021
- 资助国家:英国
- 起止时间:2021 至 无数据
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Nowadays, given the vast amount of data that is available to us, there is a large need for efficient 'segmentation' algorithms to be used in the industry. A typical example comes from the field of market research, where one of the main tasks is dividing a large group of a company's current or potential customers into smaller groups. The ultimate goal of this process is to aggregate the subjects into 'segments', such that each segment consists of subjects which are likely to share the same needs or have common interests. This is achieved by identifying similarities among the subjects, which may be of demographic, geographic, psychographic or behavioural nature. In the statistical machine learning world, segmentation is known as 'cluster analysis' or 'clustering'. However, the process of clustering is not very straightforward when dealing with data sets that include both numerical and textdata (commonly referred to as 'mixed-type data'), or when anomalous points are included. A data point is said to be 'anomalous' or 'outlying' if it does not conform to a general pattern that may exist within the data set or if it consists of 'unusual' values that are 'abnormal' compared to the majority of values of the rest of the data points such as to arouse suspicion. Despite the fact that a significant number of methods for cluster analysis of mixed-type data exists inthe literature, no such methods are 'robust' to the presence of outlying data points. This is potentially a consequence of having a well-established definition of 'outliers' for numerical data but of this not being the case for data that is not numeric. A more general definition for 'categorical outliers' ('categorical' referring to the fact that some variables may only take a fixed number of values, called'categories') is therefore needed, so that we can better understand what it means to have 'outliers'or 'anomalies' in a mixed data set. Although a simple approach could involve detecting outliers for the numerical and the categorical data individually, this is rather naive, since outliers might stillexist based on the relationships among variables of different types. In fact, there exists a very small number of anomaly detection algorithms for data of mixed-type in the literature but these make use of the aforementioned naive approach, with no software implementation being available either. Our project aims to develop novel methodology for identifying data points that are anomalous in amixed data set, by employing anomaly detection techniques in an unsupervised manner (meaning that we do not have access to a 'ground truth' regarding which data points are the anomalies). This will involve making use of statistical tools to capture any dependencies or interactions between the variables that a mixed data set is comprised of, in order to achieve a good understanding of the data set and, as a result, of which observations may be outlying. Combining the results of such a method with clustering algorithms for mixed-type data could enhance the performance of existing non-robust methods. Moreover, such a method could be extended to account for an additional aspect of robustness that has to do with 'incomplete' observations within a data set. Data irregularities and missing observations are very common issues that practitioners from several industry sectors have to face, such as in the automotive, the education, the insurance, the retail or the telecommunications sectors, all of which use segmentation techniques in some form. Thus we want to provide them with a framework under which they can obtain results which are meaningful and easily interpretable tothem, without being affected by 'misleading' or missing observations. This project falls within the EPSRC Statistics and applied probability research area.1
如今,鉴于我们可以获得的大量数据,行业中需要使用高效的“分割”算法。一个典型的例子来自市场研究领域,其中一项主要任务是将公司现有或潜在客户的大群体划分为较小的群体。该过程的最终目标是将主题聚合为“部分”,以便每个部分由可能具有相同需求或共同兴趣的主题组成。这是通过确定受试者之间的相似性来实现的,这些相似性可能是人口、地理、心理或行为性质的。在统计机器学习领域,分割被称为“聚类分析”或“聚类”。然而,当处理包括数值和文本数据(通常称为“混合类型数据”)的数据集时,或者当包括异常点时,聚类的过程并不非常简单。如果一个数据点不符合数据集中可能存在的一般模式,或者如果它包含与其余数据点的大多数值相比“异常”的“不寻常”值,例如引起怀疑,则该数据点被称为“异常”或“外围”。尽管事实上,大量的混合型数据的聚类分析方法存在于文献中,没有这样的方法是“强大的”存在的离群数据点。这可能是由于对数值数据的“离群值”有一个明确的定义,但对于非数值数据则并非如此。因此,需要对“分类离群值”(“分类”是指某些变量可能只取固定数量的值,称为“类别”)进行更一般的定义,以便我们更好地理解混合数据集中的“离群值”或“异常”意味着什么。虽然一个简单的方法可能涉及检测数值和分类数据的离群值,但这是相当天真的,因为离群值可能基于不同类型变量之间的关系。事实上,在文献中存在非常少量的用于混合类型数据的异常检测算法,但是这些算法利用了上述朴素方法,并且也没有软件实现。我们的项目旨在开发一种新的方法来识别混合数据集中的异常数据点,通过以无监督的方式采用异常检测技术(这意味着我们无法获得关于哪些数据点是异常的“地面实况”)。这将涉及利用统计工具来捕捉构成混合数据集的变量之间的任何依赖关系或相互作用,以便更好地了解数据集,从而了解哪些观测结果可能是无关的。将这种方法的结果与混合类型数据的聚类算法相结合,可以提高现有非鲁棒方法的性能。此外,这种方法可以扩展到考虑与数据集内的“不完整”观察有关的鲁棒性的另一个方面。数据不规则性和缺失观察是来自多个行业的从业者必须面对的非常常见的问题,例如在汽车,教育,保险,零售或电信行业,所有这些行业都以某种形式使用细分技术。因此,我们希望为他们提供一个框架,在这个框架下,他们可以获得有意义的和容易解释的结果,而不受“误导”或遗漏的观察结果的影响。该项目属于EPSRC统计和应用概率研究领域的福尔斯。1
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
其他文献
Internet-administered, low-intensity cognitive behavioral therapy for parents of children treated for cancer: A feasibility trial (ENGAGE).
针对癌症儿童父母的互联网管理、低强度认知行为疗法:可行性试验 (ENGAGE)。
- DOI:
10.1002/cam4.5377 - 发表时间:
2023-03 - 期刊:
- 影响因子:4
- 作者:
- 通讯作者:
Differences in child and adolescent exposure to unhealthy food and beverage advertising on television in a self-regulatory environment.
在自我监管的环境中,儿童和青少年在电视上接触不健康食品和饮料广告的情况存在差异。
- DOI:
10.1186/s12889-023-15027-w - 发表时间:
2023-03-23 - 期刊:
- 影响因子:4.5
- 作者:
- 通讯作者:
The association between rheumatoid arthritis and reduced estimated cardiorespiratory fitness is mediated by physical symptoms and negative emotions: a cross-sectional study.
类风湿性关节炎与估计心肺健康降低之间的关联是由身体症状和负面情绪介导的:一项横断面研究。
- DOI:
10.1007/s10067-023-06584-x - 发表时间:
2023-07 - 期刊:
- 影响因子:3.4
- 作者:
- 通讯作者:
ElasticBLAST: accelerating sequence search via cloud computing.
ElasticBLAST:通过云计算加速序列搜索。
- DOI:
10.1186/s12859-023-05245-9 - 发表时间:
2023-03-26 - 期刊:
- 影响因子:3
- 作者:
- 通讯作者:
Amplified EQCM-D detection of extracellular vesicles using 2D gold nanostructured arrays fabricated by block copolymer self-assembly.
使用通过嵌段共聚物自组装制造的 2D 金纳米结构阵列放大 EQCM-D 检测细胞外囊泡。
- DOI:
10.1039/d2nh00424k - 发表时间:
2023-03-27 - 期刊:
- 影响因子:9.7
- 作者:
- 通讯作者:
的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('', 18)}}的其他基金
An implantable biosensor microsystem for real-time measurement of circulating biomarkers
用于实时测量循环生物标志物的植入式生物传感器微系统
- 批准号:
2901954 - 财政年份:2028
- 资助金额:
-- - 项目类别:
Studentship
Exploiting the polysaccharide breakdown capacity of the human gut microbiome to develop environmentally sustainable dishwashing solutions
利用人类肠道微生物群的多糖分解能力来开发环境可持续的洗碗解决方案
- 批准号:
2896097 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
A Robot that Swims Through Granular Materials
可以在颗粒材料中游动的机器人
- 批准号:
2780268 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Likelihood and impact of severe space weather events on the resilience of nuclear power and safeguards monitoring.
严重空间天气事件对核电和保障监督的恢复力的可能性和影响。
- 批准号:
2908918 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Proton, alpha and gamma irradiation assisted stress corrosion cracking: understanding the fuel-stainless steel interface
质子、α 和 γ 辐照辅助应力腐蚀开裂:了解燃料-不锈钢界面
- 批准号:
2908693 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Field Assisted Sintering of Nuclear Fuel Simulants
核燃料模拟物的现场辅助烧结
- 批准号:
2908917 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Assessment of new fatigue capable titanium alloys for aerospace applications
评估用于航空航天应用的新型抗疲劳钛合金
- 批准号:
2879438 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Developing a 3D printed skin model using a Dextran - Collagen hydrogel to analyse the cellular and epigenetic effects of interleukin-17 inhibitors in
使用右旋糖酐-胶原蛋白水凝胶开发 3D 打印皮肤模型,以分析白细胞介素 17 抑制剂的细胞和表观遗传效应
- 批准号:
2890513 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
Understanding the interplay between the gut microbiome, behavior and urbanisation in wild birds
了解野生鸟类肠道微生物组、行为和城市化之间的相互作用
- 批准号:
2876993 - 财政年份:2027
- 资助金额:
-- - 项目类别:
Studentship
相似国自然基金
铝合金中新型耐热合金相的应用基础研究
- 批准号:50801067
- 批准年份:2008
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
高维稀疏数据聚类研究
- 批准号:70771007
- 批准年份:2007
- 资助金额:16.0 万元
- 项目类别:面上项目
相似海外基金
Model-Based Clustering for Manifest Variables of Mixed Type
混合类型显变量的基于模型的聚类
- 批准号:
424130-2012 - 财政年份:2017
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Model-Based Clustering for Manifest Variables of Mixed Type
混合类型显变量的基于模型的聚类
- 批准号:
424130-2012 - 财政年份:2016
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
QuBBD: Collaborative Research: Interactive Ensemble clustering for mixed data with application to mood disorders
QuBBD:协作研究:混合数据的交互式集成聚类及其在情绪障碍中的应用
- 批准号:
1557593 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Standard Grant
QuBBD: Collaborative Research: Interactive Ensemble clustering for mixed data with application to mood disorders
QuBBD:协作研究:混合数据的交互式集成聚类及其在情绪障碍中的应用
- 批准号:
1557642 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Standard Grant
QuBBD: Collaborative Proposal: Interactive Ensemble Clustering for Mixed Data with Application to Mood Disorders
QuBBD:协作提案:混合数据的交互式集成聚类及其在情绪障碍中的应用
- 批准号:
1557668 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Standard Grant
QuBBD: Collaborative Research: Interactive Ensemble clustering for mixed data with application to mood disorders
QuBBD:协作研究:混合数据的交互式集成聚类及其在情绪障碍中的应用
- 批准号:
1557576 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Standard Grant
QuBBD: Collaborative Research: Interactive Ensemble clustering for mixed data with application to mood disorders
QuBBD:协作研究:混合数据的交互式集成聚类及其在情绪障碍中的应用
- 批准号:
1557589 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Standard Grant
Model-Based Clustering for Manifest Variables of Mixed Type
混合类型显变量的基于模型的聚类
- 批准号:
424130-2012 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Model-Based Clustering for Manifest Variables of Mixed Type
混合类型显变量的基于模型的聚类
- 批准号:
424130-2012 - 财政年份:2015
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual
Model-Based Clustering for Manifest Variables of Mixed Type
混合类型显变量的基于模型的聚类
- 批准号:
424130-2012 - 财政年份:2014
- 资助金额:
-- - 项目类别:
Discovery Grants Program - Individual