权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Online Mining of Big Data Streams Using Cloud Computing

使用云计算在线挖掘大数据流

基本信息

批准号：
RGPIN-2014-06565
负责人：
An, Aijun
金额：
$ 3.93万
依托单位：
York University
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2017
资助国家：
加拿大
起止时间：
2017-01-01 至 2018-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=634718
关键词：
Online Mining Big Data Streams

项目摘要

In a world where data are growing at extraordinary rates, there is huge demand for fast and effective analysis of big data to discover useful information for making business decisions. This research program tackles the problem of discovering useful information from big data streams. Big data streams, characterized by high volume and high velocity, have become ubiquitous as many sources (such as social networks, sensor networks and financial markets) produce data continuously and rapidly. Effectively and efficiently discovering patterns from such massive and fast-evolving data will allow businesses to quickly react to their dynamically changing environment to, for example, perform fraud detection at a point of sale, determine which ad to show, or detect spam in comments on news in which trends change quickly in time. Many challenges exist in discovering useful information from big data streams. To handle very fast data, systems have to process the data as fast as the arriving data. However, most existing data stream mining methods are sequential algorithms that run on a single machine and are limited by the memory and speed of the machine. To mine massive data, parallel and distributed computing over a cloud of computers has become a mainstream solution to achieve low latency and high scalability, and MapReduce has become a popular programming paradigm for easily writing applications that process massive data in parallel in a fault-tolerant manner. However, converting a stream mining algorithm into an online parallel MapReduce-style algorithm poses challenges. Most learning algorithms are highly sequential. Parallelizing such algorithms needs considerable efforts and may require the design of new algorithms. In addition, in stream environments, data flow into the system at a rate over which we have no control. The processing system must keep up with the data rate or degrade gracefully. Resource adaptive online learning with bounded approximation is highly needed, which has not been addressed adequately in the MapReduce-style data processing model.To address the above challenges, we will develop parallel versions of stream-mining algorithms using MapReduce-style distributed stream-processing platforms. We will build on our previous and on-going research in data mining and parallelize the stream-mining algorithms that we have developed recently, which include, but not limited to, classification rule learning, high utility pattern mining, and Monte Carlo based learning algorithms. In addition, we will develop resource-adaptive techniques for learning from big data streams. Adaptive data structures and anytime learning algorithms will be developed that can produce best possible answers under resource constraints and can utilize the extra time and memory, if given, to increase the quality of the answers. Moreover, we will identify the pros and cons in developing parallel stream mining algorithms using the state-of-the-art MapReduce-style stream processing platforms and provide feedbacks to the community as to what is further needed in these platforms for them to better serve online learning of big data streams in the cloud.Mining big and fast data streams using cloud computing is still in its infancy. The proposed research will advance the field by proposing novel solutions to its open challenges and will have a wide range of applications in various fields that produce massive data streams.

在数据以超乎寻常的速度增长的世界里，对大数据的快速有效分析有着巨大的需求，以发现有助于制定商业决策的有用信息。这项研究计划解决了从大数据流中发现有用信息的问题。随着社交网络、传感器网络和金融市场等多种来源(如社交网络、传感器网络和金融市场)不断快速地产生数据，以大容量和高速度为特征的大数据流已经变得无处不在。有效和高效地从这种海量和快速发展的数据中发现模式将使企业能够对其动态变化的环境做出快速反应，例如，在销售点执行欺诈检测，确定要显示的广告，或在趋势快速变化的新闻评论中检测垃圾邮件。从大数据流中发现有用的信息存在许多挑战。为了处理非常快的数据，系统必须与到达的数据一样快地处理数据。然而，大多数现有的数据流挖掘方法都是在单机上运行的顺序算法，并且受到机器内存和速度的限制。为了挖掘海量数据，计算机云上的并行和分布式计算已经成为实现低延迟和高可伸缩性的主流解决方案，而MapReduce已经成为一种流行的编程范例，可以轻松地编写以容错方式并行处理海量数据的应用程序。然而，将流挖掘算法转换为在线并行MapReduce式算法会带来挑战。大多数学习算法都是高度连续的。将这些算法并行化需要付出相当大的努力，并且可能需要设计新的算法。此外，在流环境中，数据以我们无法控制的速度流入系统。处理系统必须跟上数据速率或优雅地降级。为了解决上述问题，我们将利用MapReduce型分布式流处理平台开发流挖掘算法的并行版本。我们将在之前和正在进行的数据挖掘研究的基础上，并行化我们最近开发的流挖掘算法，包括但不限于分类规则学习、高实用模式挖掘和基于蒙特卡洛的学习算法。此外，我们将开发资源自适应技术，以从大数据流中学习。将开发自适应数据结构和随时学习算法，这些算法可以在资源限制的情况下产生最佳可能的答案，并可以利用额外的时间和内存(如果给定)来提高答案的质量。此外，我们会找出使用最先进的MapReduce式数据流处理平台开发并行数据流挖掘算法的优缺点，并向社区提供反馈，说明这些平台还需要哪些东西来更好地服务于云中大数据流的在线学习。使用云计算挖掘大数据和快速数据流仍处于初级阶段。拟议的研究将通过为该领域的开放挑战提出新的解决方案来推动该领域的发展，并将在产生海量数据流的各个领域中有广泛的应用。