权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Efficient query processing and optimizations for big data workloads

针对大数据工作负载的高效查询处理和优化

基本信息

批准号：
RGPIN-2015-04587
负责人：
Koudas, Nikolaos
金额：
$ 4.37万
依托单位：
University of Toronto
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2018
资助国家：
加拿大
起止时间：
2018-01-01 至 2019-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=655933
关键词：
Efficient query processing optimizations big

项目摘要

Every aspect of computing has been experiencing exponential growth, from sensory data acquisition throughput to processor power, storage and bandwidth. These exponential improvements are enabling the big data revolution. Big data applications consist of volumes of data that are constantly produced in a streaming fashion (e.g., sensor readings, logs, click-through etc.). In addition typical research and analysis workflows on big data are iterative. Namely a model is built using some data parameters, then iteratively refined using the output of the previous modeling phase. Both such primitives, namely streaming data generation and iterative analysis workflows, provide a lot of opportunity for optimizations. ******The goal of this project is to explore these primitives and deliver fundamental algorithms and techniques to efficiently process and optimize big data workloads. The end goal is to encompass such techniques into end-to-end data processing architecture. Streaming data generation provides the opportunity to maintain models already computed on the data in an incremental fashion. As a first example, a statistical operator computed on a data set can be incrementally maintained for new data appended in the data set. In addition, models already computed on the data can be combined (among themselves or with base data) incrementally to compute answers to new modeling query requests. Combining two models could be vastly superior in terms of performance than computing a new model from scratch. ******In this project we plan to introduce incremental computations as a first class citizen in our system design. We will incrementally maintain models of interest (as new data arrive); via materialization of such models, analysis phases will be able to re-use results available from prior analysis. Suitable optimization frameworks will be developed to assess when and under what conditions such combinations and incremental maintenance of models is beneficial. It is evident that the performance of subsequent analysis tasks will benefit from model re-use and/or combination for a wide class of models exploring both exact and approximate computations. Second, we plan to build an end-to-end system encompassing our innovations. Our design will be centered on popular languages for statistical processing and data analysis to express modeling workloads (e.g., R) and the suitable systems infrastructure to implement and execute our framework. ******The end product of our research will be a system encompassing all of the research conducted delivering very fast big data analytics utilizing familiar analytical query processing interfaces such as R. Such a system will benefit and help data scientists conduct advanced research in a fraction of the time required, by being able to seamlessly re-use and share results in an incremental fashion.**

从传感数据采集吞吐量到处理器能力、存储和带宽，计算的各个方面都在经历指数级增长。这些指数级的进步正在推动大数据革命。大数据应用程序由以流方式不断产生的大量数据组成（例如传感器读数、日志、点击等）。此外，典型的大数据研究和分析工作流程是迭代的。即使用一些数据参数构建模型，然后使用先前建模阶段的输出进行迭代细化。这两种原语，即流数据生成和迭代分析工作流程，都提供了大量的优化机会。 ******该项目的目标是探索这些原语并提供基本算法和技术，以有效处理和优化大数据工作负载。最终目标是将这些技术纳入端到端数据处理架构中。流数据生成提供了以增量方式维护已在数据上计算的模型的机会。作为第一示例，可以针对附加在数据集中的新数据增量地维护在数据集上计算的统计运算符。此外，已经在数据上计算的模型可以增量地组合（它们之间或与基础数据）以计算新建模查询请求的答案。就性能而言，组合两个模型可能比从头开始计算新模型要优越得多。 ******在这个项目中，我们计划在我们的系统设计中引入增量计算作为一等公民。我们将逐步维护感兴趣的模型（随着新数据的到来）；通过此类模型的具体化，分析阶段将能够重复使用先前分析的结果。将开发合适的优化框架来评估何时以及在什么条件下这种模型组合和增量维护是有益的。显然，后续分析任务的性能将受益于探索精确和近似计算的各种模型的模型重用和/或组合。其次，我们计划建立一个包含我们创新的端到端系统。我们的设计将集中于用于统计处理和数据分析的流行语言，以表达建模工作负载（例如，R）以及用于实现和执行我们的框架的合适的系统基础设施。 ******我们研究的最终产品将是一个包含所有研究的系统，利用熟悉的分析查询处理接口（例如 R）提供非常快速的大数据分析。这样的系统将受益并帮助数据科学家在所需时间的一小部分内进行高级研究，因为能够以增量方式无缝地重用和共享结果。**