权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

CAREER: Query Compilation Techniques for Complex Analytics on Enterprise Clusters

职业：企业集群上复杂分析的查询编译技术

基本信息

批准号：
1453171
负责人：
Tim Kraska
金额：
$ 55万
依托单位：
Brown University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2015
资助国家：
美国
起止时间：
2015-06-01 至 2023-05-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1453171&HistoricalAwards=false
关键词：
CAREER Query Compilation Techniques Complex

项目摘要

Big data and the evolving field of Data Science are fundamentally shifting the meaning of analytics. Highly complex computations have come to define the typical workload, with jobs ranging from machine learning to large-scale visualization. However, there is a fundamental discrepancy between the availability of analytical tools for big Internet companies and those for non-tech enterprises. Current analytics frameworks, like Spark or Hadoop, are designed to meet the needs of giant Internet companies; that is, they are built to process petabytes of data in cloud deployments consisting of thousands of cheap commodity machines. Yet non-tech companies like banks and retailers - or even the typical data scientist - seldom operate deployments of that size, instead preferring smaller clusters, aka Enterprise clusters, with more reliable hardware. In fact, recent industry surveys reported that the median Hadoop cluster was fewer than 10 nodes, and over 65% of users operate clusters smaller than 50 nodes. Targeting complex analytics workloads on smaller clusters, however, fundamentally changes the way we should design analytics tools. Most current systems focus on the major challenges associated with large cloud deployments, where network and disk I/O are the primary bottleneck and failures are common, where the next generation of analytics frameworks should optimize specifically for the computation bottleneck. As part of this project, the PIs will systematically design a new analytical open-source engine, called Tupleware, build specifically for the infrastructure of non-tech companies. Tupleware will make complex analytics more accessible and push the boundaries of what computations are possible.Specifically, the PIs will design, implement and evaluate various program synthesis, i.e., query compilation techniques, for complex analytics on enterprise clusters with fast interconnects and considerable available memory. Existing query compilation techniques focus on SQL and are not designed for workloads where UDFs and iterations dominate the computation, nor do they target distributed setups; all issues the PIs will address in this proposal. Furthermore, the PIs aim to combine high-level query optimization with compiler technology to holistically optimize complex analytical workflows by considering statistics about the data (e.g., the selectivity of predicates) with low-level statistics about the UDFs (e.g., the number of used registers). Finally, all the results will be integrated into the Tupleware system and thus, made accessible for a broader range of users.For further information see the project web site at: http://tupleware.cs.brown.edu/

大数据和不断发展的数据科学领域正在从根本上改变分析的意义。高度复杂的计算已经开始定义典型的工作负载，工作范围从机器学习到大规模可视化。然而，大型互联网公司和非科技企业的分析工具之间存在根本差异。目前的分析框架，如Spark或Hadoop，旨在满足大型互联网公司的需求;也就是说，它们是为了在由数千台廉价商品机器组成的云部署中处理PB级数据而构建的。然而，像银行和零售商这样的非科技公司-甚至是典型的数据科学家-很少运营这种规模的部署，而是更喜欢拥有更可靠硬件的小型集群，即企业集群。事实上，最近的行业调查报告显示，Hadoop集群的中位数小于10个节点，超过65%的用户操作的集群小于50个节点。然而，将复杂的分析工作负载定位在较小的集群上，从根本上改变了我们设计分析工具的方式。当前大多数系统都专注于与大型云部署相关的主要挑战，其中网络和磁盘I/O是主要瓶颈，故障很常见，下一代分析框架应该专门针对计算瓶颈进行优化。作为该项目的一部分，PI将系统地设计一个新的分析开源引擎，称为Tupleware，专门为非科技公司的基础设施构建。Tupleware将使复杂的分析更容易获得，并推动计算的界限。具体来说，PI将设计，实现和评估各种程序合成，即，查询编译技术，用于在具有快速互连和大量可用内存的企业集群上进行复杂分析。现有的查询编译技术专注于SQL，并不是为UDF和迭代主导计算的工作负载而设计的，也不是针对分布式设置的; PI将在本提案中解决所有问题。此外，PI旨在将高级查询优化与编译器技术结合联合收割机，以通过考虑关于数据的统计数据（例如，谓词的选择性）与关于UDF的低级统计（例如，使用的寄存器的数量）。最后，所有的结果将被整合到Tupleware系统中，从而使更广泛的用户可以访问。欲了解更多信息，请访问项目网站：http://tupleware.cs.brown.edu/

项目成果

期刊论文数量（0）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

数据更新时间：{{ journalArticles.updateTime }}

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Tim Kraska其他文献

Building Database Applications in the Cloud

DOI：
10.3929/ethz-a-006007449
发表时间：
2010
期刊：
影响因子：
0
作者：
Tim Kraska
通讯作者：
Tim Kraska

Towards a Benchmark for the Cloud

迈向云基准

DOI：
发表时间：
2018
期刊：
影响因子：
0
作者：
Carsten Binnig;Donald Kossmann;Tim Kraska;Simon Losing
通讯作者：
Simon Losing

Self-Organizing Data Containers

自组织数据容器

DOI：
发表时间：
2022
期刊：
Conference on Innovative Data Systems Research
影响因子：
0
作者：
S. Madden;Jialin Ding;Tim Kraska;Sivaprasad Sudhir;David Cohen;T. Mattson;Nesime Tatbul
通讯作者：
Nesime Tatbul