基于已存知识重用的大数据分布式递进分类挖掘方法研究

结题报告

项目介绍

AI项目解读

基本信息

批准号：
61702229
项目类别：
青年科学基金项目
资助金额：
22.0万
负责人：
申彦
依托单位：
江苏大学
学科分类：
F06.人工智能
结题年份：
2020
批准年份：
2017
项目状态：
已结题
起止时间：
2018-01-01 至2020-12-31

项目参与者：
朱玉全；耿霞；刘湘雯；彭晓冰；闵信军；王博宸；缪琦；
关键词：
分类算法分布式学习集成学习大数据挖掘算法

项目摘要

Human acquire the knowledge through learning and keep learning to get improvements with the help of the existing knowledge. Current data mining researches always emphasize discovering knowledge by analyzing data and lack the reuse of the learned knowledge and domain knowledge. This problem becomes very serious in the areas of big data mining. A lot of time and computing resources are spent on the process of repeated analysis for big data. Traditional incremental mining algorithms don’t preserve the old knowledge and would forget the old knowledge after getting the new knowledge. When the previous scenes or similar scenes reproduce, the old knowledge cannot play its due role. Considering the phases and time effectiveness and periodicity of the big data, this research project aims to present a novel method that can mine the big data in a step by step way in distributed environment by reusing the existing knowledge in order to improve the efficiency of classification mining and the accuracy of mining results from the perspective of knowledge reuse. The key research tasks include a novel method of distributed gradual classification mining for big data and a novel method of learned classification knowledge mergence and reuse and a novel method of learned classification knowledge validation, modification and ensembling by time effectiveness and a novel method of integrated utilizing domain knowledge. The aim of the research is to establish a mechanism that is not only can reuse the accumulated experiences from data set but also can reuse theoretical knowledge from domain knowledge and finally to form a virtuous cycle of distributed gradual classification learning for big data.

人类通过学习获取知识，在已有知识帮助之下继续学习而不断获得提高。现有的数据挖掘研究强调通过分析数据发现知识，缺乏对已学知识及领域知识的再利用。该问题在大数据挖掘领域变得尤为突出，大量时间及计算资源耗费在对大数据重复分析的过程中。传统增量挖掘没有保存老知识，在学到了新知识后会遗忘老知识。当先前或类似的场景再现时，老知识并没有很好地发挥应有作用。考虑到大数据的阶段性、时效性、周期性，本课题旨在研究分布式环境下重用已存知识对大数据分阶段递进分类挖掘的方法，从知识重用的角度提高分类挖掘的效率以及准确度。拟研究的主要内容包括：大数据分布式递进分类挖掘的方法，已学分类知识融合、选择重用的方法，已学分类知识验证、修正、纵向时序集成的方法，整合利用领域知识的方法。拟通过该研究建立起一种不仅能够重用来自数据的经验知识，而且能够重用来自领域的理论知识的机制，形成一种良性循环的大数据分布式递进分类学习过程。

结项摘要

针对非稳定环境中累积大数据的分类挖掘是机器学习及模式识别领域的热点问题之一。本课题研究对累积大数据分布式递进挖掘，并在挖掘过程中逐步存储并选择重用已存知识，形成一种有记忆的分类学习的方法，从知识重用的角度提高大数据分类挖掘的效率及准确率。研究内容：1）大数据分布式递进分类挖掘的方法；2）已学分类知识融合、选择重用的方法；3）已学分类知识验证、修正、纵向时序集成的方法；4）分布式递进分类挖掘过程中利用领域知识的方法。主要成果包括：1）提出了一种基于滑动窗口的快速LearnNSE算法。该算法仅考虑单个基分类器近期窗口内的分类准确率计算其投票权重，在与LearnNSE取得同等分类准确率的情况下，提高了分类学习的效率。2）提出了一种采用渐进学习模式的SBS-CLearning分类算法。该算法在前阶段基分类器的基础之上先增量学习，再完成最终的加权集成，相比LearnNSE提高了分类准确率。3）提出了一种并行反向PRLearnNSE分类算法。该算法改变了基分类器的集成机制，利用老的基分类器作为新基分类器的补充，形成了一种并行集成机制，在取得接近LearnNSE算法分类准确率的前提下，大幅提高了分类学习效率。4）提出了一种基于正向补充机制的多分类器时序集成算法。该算法调整了LearnNSE的集成机制，设计了一种新的利用最新基分类器的集成追踪数据产生环境的变化，再选择有助于当前分类的老基分类器进行正向补充集成的机制。该算法拥有对已学分类知识再利用的能力，不仅能取得非常接近，一些场景下甚至优于LearnNSE的分类准确率，还提高了集成学习效率。5）提出了一种分布式时序处理模型DSPM。DSPM不仅能取得非常接近，在很多场景下甚至优于LearnNSE的准确率，还能提高分类学习效率，兼顾短时产生及长时间累积的大数据，适用于对分类挖掘实时性要求较高的场合。本课题研发的分类算法不仅可以利用来自训练数据的已学知识，还可以利用来自领域的理论知识，为非稳定环境中累积大数据的分类挖掘研究提供了有价值的参考。