CAREER: Query Compilation Techniques for Complex Analytics on Enterprise Clusters
职业:企业集群上复杂分析的查询编译技术
基本信息
- 批准号:1453171
- 负责人:
- 金额:$ 55万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2015
- 资助国家:美国
- 起止时间:2015-06-01 至 2023-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Big data and the evolving field of Data Science are fundamentally shifting the meaning of analytics. Highly complex computations have come to define the typical workload, with jobs ranging from machine learning to large-scale visualization. However, there is a fundamental discrepancy between the availability of analytical tools for big Internet companies and those for non-tech enterprises. Current analytics frameworks, like Spark or Hadoop, are designed to meet the needs of giant Internet companies; that is, they are built to process petabytes of data in cloud deployments consisting of thousands of cheap commodity machines. Yet non-tech companies like banks and retailers - or even the typical data scientist - seldom operate deployments of that size, instead preferring smaller clusters, aka Enterprise clusters, with more reliable hardware. In fact, recent industry surveys reported that the median Hadoop cluster was fewer than 10 nodes, and over 65% of users operate clusters smaller than 50 nodes. Targeting complex analytics workloads on smaller clusters, however, fundamentally changes the way we should design analytics tools. Most current systems focus on the major challenges associated with large cloud deployments, where network and disk I/O are the primary bottleneck and failures are common, where the next generation of analytics frameworks should optimize specifically for the computation bottleneck. As part of this project, the PIs will systematically design a new analytical open-source engine, called Tupleware, build specifically for the infrastructure of non-tech companies. Tupleware will make complex analytics more accessible and push the boundaries of what computations are possible.Specifically, the PIs will design, implement and evaluate various program synthesis, i.e., query compilation techniques, for complex analytics on enterprise clusters with fast interconnects and considerable available memory. Existing query compilation techniques focus on SQL and are not designed for workloads where UDFs and iterations dominate the computation, nor do they target distributed setups; all issues the PIs will address in this proposal. Furthermore, the PIs aim to combine high-level query optimization with compiler technology to holistically optimize complex analytical workflows by considering statistics about the data (e.g., the selectivity of predicates) with low-level statistics about the UDFs (e.g., the number of used registers). Finally, all the results will be integrated into the Tupleware system and thus, made accessible for a broader range of users.For further information see the project web site at: http://tupleware.cs.brown.edu/
大数据和数据科学不断发展的领域从根本上转移了分析的含义。高度复杂的计算已经定义了典型的工作量,从机器学习到大规模可视化,作业不等。但是,大型互联网公司的分析工具的可用性与非技术企业的分析工具之间存在根本差异。当前的分析框架(如Spark或Hadoop)旨在满足巨型互联网公司的需求;也就是说,它们是为了处理由数千台廉价商品机器组成的云部署中的数据之前。然而,像银行和零售商这样的非技术公司(甚至是典型的数据科学家)很少经营这种尺寸的部署,而是更喜欢较小的群集(又名企业集群),并具有更可靠的硬件。实际上,最近的行业调查报告说,Hadoop集群中位数少于10个节点,超过65%的用户操作簇小于50个节点。但是,针对较小群集的复杂分析工作负载从根本上改变了我们应该设计分析工具的方式。当前的大多数系统都集中在与大型云部署相关的主要挑战上,其中网络和磁盘I/O是主要的瓶颈,而失败很常见,下一代分析框架应专门为计算瓶颈优化。作为该项目的一部分,PIS将系统地设计一种名为Tupleware的新分析开源引擎,专门为非技术公司的基础架构建立。 Tupleware将使复杂的分析更容易访问,并突破哪些计算的界限。特别是,PIS将设计,实施和评估各种程序合成,即查询编译技术,以便在企业集群中使用快速互连和可用的可用内存,以对企业集群进行复杂的分析。现有的查询汇编技术集中于SQL,并且不是为UDF和迭代占主导地位的工作负载而设计的,也不针对分布式设置。 PI在此提案中将解决的所有问题。此外,PIS旨在通过考虑有关数据(例如谓词的选择性)与UDFS的低级统计数据(例如,使用的登记册的数量),将高级查询优化与编译器技术相结合,以整体优化复杂的分析工作流程。最后,所有结果都将集成到Tupleware系统中,因此可以为更广泛的用户访问。有关更多信息,请参见项目网站:http://tupleware.cs.brown.edu/
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Tim Kraska其他文献
Building Database Applications in the Cloud
- DOI:
10.3929/ethz-a-006007449 - 发表时间:
2010 - 期刊:
- 影响因子:0
- 作者:
Tim Kraska - 通讯作者:
Tim Kraska
Towards a Benchmark for the Cloud
迈向云基准
- DOI:
- 发表时间:
2018 - 期刊:
- 影响因子:0
- 作者:
Carsten Binnig;Donald Kossmann;Tim Kraska;Simon Losing - 通讯作者:
Simon Losing
Safe Visual Data Exploration
安全的可视化数据探索
- DOI:
- 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Zheguang Zhao;Emanuel Zgraggen;L. Stefani;Carsten Binnig;E. Upfal;Tim Kraska - 通讯作者:
Tim Kraska
Self-Organizing Data Containers
自组织数据容器
- DOI:
- 发表时间:
2022 - 期刊:
- 影响因子:0
- 作者:
S. Madden;Jialin Ding;Tim Kraska;Sivaprasad Sudhir;David Cohen;T. Mattson;Nesime Tatbul - 通讯作者:
Nesime Tatbul
Making the Case for Query-by-Voice with EchoQuery
使用 EchoQuery 进行语音查询的案例
- DOI:
10.1145/2882903.2899394 - 发表时间:
2016 - 期刊:
- 影响因子:0
- 作者:
Gabriel Lyons;Vinh Q. Tran;Carsten Binnig;U. Çetintemel;Tim Kraska - 通讯作者:
Tim Kraska
Tim Kraska的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Tim Kraska', 18)}}的其他基金
III: Medium: Quantifying the Unknown Unknowns for Data Integration
III:媒介:量化数据集成的未知因素
- 批准号:
2033792 - 财政年份:2020
- 资助金额:
$ 55万 - 项目类别:
Continuing Grant
BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing
BD Spokes:SPOKE:NORTHEAST:协作:数据共享的许可模型和生态系统
- 批准号:
1947440 - 财政年份:2019
- 资助金额:
$ 55万 - 项目类别:
Standard Grant
III: Medium: Learning-based Synthesis of Data Processing Engines
III:媒介:基于学习的数据处理引擎综合
- 批准号:
1900933 - 财政年份:2019
- 资助金额:
$ 55万 - 项目类别:
Continuing Grant
III: Medium: Quantifying the Unknown Unknowns for Data Integration
III:媒介:量化数据集成的未知因素
- 批准号:
1562657 - 财政年份:2016
- 资助金额:
$ 55万 - 项目类别:
Continuing Grant
BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing
BD Spokes:SPOKE:NORTHEAST:协作:数据共享的许可模型和生态系统
- 批准号:
1636698 - 财政年份:2016
- 资助金额:
$ 55万 - 项目类别:
Standard Grant
相似国自然基金
面向大规模图查询问题的通用无损压缩框架研究
- 批准号:
- 批准年份:2020
- 资助金额:24 万元
- 项目类别:青年科学基金项目
为新而新:消费者对基于时间标签新鲜感的偏好及其机理探究
- 批准号:71902198
- 批准年份:2019
- 资助金额:18.0 万元
- 项目类别:青年科学基金项目
三维障碍空间中的可视查询问题研究
- 批准号:
- 批准年份:2019
- 资助金额:59 万元
- 项目类别:面上项目
基于优质数据分离的大数据查询问题研究
- 批准号:61702220
- 批准年份:2017
- 资助金额:25.0 万元
- 项目类别:青年科学基金项目
无线数据广播环境下位置相关Skyline查询问题研究
- 批准号:61170174
- 批准年份:2011
- 资助金额:50.0 万元
- 项目类别:面上项目
相似海外基金
III: Small: Native Compilation, Query Processing, and Indexing for In-memory Graph Relational Data Systems
III:小:内存图关系数据系统的本机编译、查询处理和索引
- 批准号:
1910216 - 财政年份:2019
- 资助金额:
$ 55万 - 项目类别:
Standard Grant
RealPDBs: Realistic Data Models and Query Compilation for Large-Scale Probabilistic Databases
RealPDB:大规模概率数据库的现实数据模型和查询编译
- 批准号:
EP/R013667/1 - 财政年份:2017
- 资助金额:
$ 55万 - 项目类别:
Research Grant
III: Small: Non-Invasive Real-Time Analytics in Database Systems using Holistic Query Compilation
III:小型:使用整体查询编译在数据库系统中进行非侵入式实时分析
- 批准号:
1718582 - 财政年份:2017
- 资助金额:
$ 55万 - 项目类别:
Continuing Grant
Query Compilation for the Heterogeneous Many Core Age
异构多核时代的查询编译
- 批准号:
361497736 - 财政年份:2017
- 资助金额:
$ 55万 - 项目类别:
Priority Programmes
III: Small: Query Compilation on Probabilistic Databases
III:小:概率数据库上的查询编译
- 批准号:
1115188 - 财政年份:2011
- 资助金额:
$ 55万 - 项目类别:
Standard Grant