权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Adaptive Online Mining of Big Data Streams

大数据流的自适应在线挖掘

基本信息

批准号：
RGPIN-2019-06799
负责人：
An, Aijun
金额：
$ 3.5万
依托单位：
York University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2021
资助国家：
加拿大
起止时间：
2021-01-01 至 2022-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=739113
关键词：
Adaptive Online Mining Big Data

项目摘要

Big data streams are continuous flows of data that arrive in high volume and high velocity. Such data streams have become ubiquitous as many sources, such as sensor networks, business transactions and surveillance cameras, produce data continuously and rapidly. There has been a growing demand for real-time online analyzing and learning of big data streams so that patterns and models can be learned from such data in a timely manner to support fast decision making in a dynamic environment, for example, performing fraud detection at a point of sale. However, although machine learning techniques have become very effective in many applications, such as computer vision and speech recognition, learning a complex model (such as a deep neural network) from big data can be very time-consuming, making it impractical to work online. The objective of this research program is to accelerate machine learning programs and make it work in an online fashion. We will develop novel techniques for parallelizing machine learning models to speed up the learning process. We will address the following challenges in online learning of data streams. First, online algorithms are often constrained by space and time. Not all data can be stored. Algorithms often need to process the data in a single pass. However, many machine learning algorithms require a large number of passes over the data to find good solutions. How to adapt such learning algorithms to work online is an open challenge. Second, streaming data evolve over time with unknown dynamics, a phenomenon known as concept drift. Online learning from data streams should keep the model up to date and quickly adapt it to concept drift. Although methods have been proposed to train drift-adaptive online models, little has been done on adaptively learning complex nonlinear functions. Third, in streaming environments, data flow into the system at an unpredictable rate and the available resource may change dynamically due to resource sharing with other computing jobs. The processing system must keep up with the data rate and resource change. Resource adaptive online learning is highly needed. We will develop novel strategies for handling concept drift and resource constraints for learning complex functions online. Parallel and distributed solutions over multiple processors or machines will be developed to speed up learning. Strategies that make trade-offs between resource consumption and the accuracy of a learned model according to resource conditions will be investigated. Anytime learning algorithms will be designed that can produce a best possible answer according to real-time constraints. Distributed online mining of big and fast data streams is still far from mature. The proposed research will advance the field by proposing novel solutions to its open challenges and will have a wide range of applications in various fields, e.g., fraud detection in real time and image recognition in a dynamic environment.

大数据流是以高容量和高速度到达的连续数据流。这样的数据流已经变得无处不在，因为许多源（诸如传感器网络、商业交易和监控摄像机）连续且快速地产生数据。对大数据流的实时在线分析和学习的需求不断增长，以便可以及时地从这些数据中学习模式和模型，以支持动态环境中的快速决策，例如，在销售点执行欺诈检测。然而，尽管机器学习技术在许多应用中变得非常有效，例如计算机视觉和语音识别，但从大数据中学习复杂模型（例如深度神经网络）可能非常耗时，使得在线工作变得不切实际。该研究计划的目标是加速机器学习程序，并使其以在线方式工作。我们将开发并行化机器学习模型的新技术，以加快学习过程。我们将解决数据流在线学习中的以下挑战。首先，在线算法通常受到空间和时间的限制。并非所有数据都可以存储。算法通常需要在单次传递中处理数据。然而，许多机器学习算法需要对数据进行大量的遍历才能找到好的解决方案。如何使这种学习算法适应在线工作是一个开放的挑战。第二，流数据随着时间的推移以未知的动态演变，这种现象称为概念漂移。从数据流中进行在线学习应该使模型保持最新，并快速适应概念漂移。虽然已经提出了训练漂移自适应在线模型的方法，但对复杂非线性函数的自适应学习却做得很少。第三，在流环境中，数据以不可预测的速率流入系统，并且由于与其他计算作业的资源共享，可用资源可能动态地改变。处理系统必须跟上数据速率和资源变化。资源适应性在线学习是非常必要的。我们将开发新的策略来处理概念漂移和资源限制，以在线学习复杂的功能。将开发多处理器或机器上的并行和分布式解决方案，以加快学习速度。将研究根据资源条件在资源消耗和学习模型的准确性之间进行权衡的策略。随时学习算法将被设计成可以根据实时约束产生最佳可能答案。大规模快速数据流的分布式在线挖掘还远未成熟。拟议的研究将通过提出新的解决方案来推动该领域的开放性挑战，并将在各个领域具有广泛的应用，例如，真实的实时欺诈检测和动态环境中的图像识别。