CAREER: Query Compilation Techniques for Complex Analytics on Enterprise Clusters

职业:企业集群上复杂分析的查询编译技术

基本信息

  • 批准号:
    1453171
  • 负责人:
  • 金额:
    $ 55万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2015
  • 资助国家:
    美国
  • 起止时间:
    2015-06-01 至 2023-05-31
  • 项目状态:
    已结题

项目摘要

Big data and the evolving field of Data Science are fundamentally shifting the meaning of analytics. Highly complex computations have come to define the typical workload, with jobs ranging from machine learning to large-scale visualization. However, there is a fundamental discrepancy between the availability of analytical tools for big Internet companies and those for non-tech enterprises. Current analytics frameworks, like Spark or Hadoop, are designed to meet the needs of giant Internet companies; that is, they are built to process petabytes of data in cloud deployments consisting of thousands of cheap commodity machines. Yet non-tech companies like banks and retailers - or even the typical data scientist - seldom operate deployments of that size, instead preferring smaller clusters, aka Enterprise clusters, with more reliable hardware. In fact, recent industry surveys reported that the median Hadoop cluster was fewer than 10 nodes, and over 65% of users operate clusters smaller than 50 nodes. Targeting complex analytics workloads on smaller clusters, however, fundamentally changes the way we should design analytics tools. Most current systems focus on the major challenges associated with large cloud deployments, where network and disk I/O are the primary bottleneck and failures are common, where the next generation of analytics frameworks should optimize specifically for the computation bottleneck. As part of this project, the PIs will systematically design a new analytical open-source engine, called Tupleware, build specifically for the infrastructure of non-tech companies. Tupleware will make complex analytics more accessible and push the boundaries of what computations are possible.Specifically, the PIs will design, implement and evaluate various program synthesis, i.e., query compilation techniques, for complex analytics on enterprise clusters with fast interconnects and considerable available memory. Existing query compilation techniques focus on SQL and are not designed for workloads where UDFs and iterations dominate the computation, nor do they target distributed setups; all issues the PIs will address in this proposal. Furthermore, the PIs aim to combine high-level query optimization with compiler technology to holistically optimize complex analytical workflows by considering statistics about the data (e.g., the selectivity of predicates) with low-level statistics about the UDFs (e.g., the number of used registers). Finally, all the results will be integrated into the Tupleware system and thus, made accessible for a broader range of users.For further information see the project web site at: http://tupleware.cs.brown.edu/
大数据和不断发展的数据科学领域正在从根本上改变分析的意义。高度复杂的计算已经开始定义典型的工作负载,工作范围从机器学习到大规模可视化。然而,大型互联网公司和非科技企业的分析工具之间存在根本差异。目前的分析框架,如Spark或Hadoop,旨在满足大型互联网公司的需求;也就是说,它们是为了在由数千台廉价商品机器组成的云部署中处理PB级数据而构建的。然而,像银行和零售商这样的非科技公司-甚至是典型的数据科学家-很少运营这种规模的部署,而是更喜欢拥有更可靠硬件的小型集群,即企业集群。事实上,最近的行业调查报告显示,Hadoop集群的中位数小于10个节点,超过65%的用户操作的集群小于50个节点。然而,将复杂的分析工作负载定位在较小的集群上,从根本上改变了我们设计分析工具的方式。当前大多数系统都专注于与大型云部署相关的主要挑战,其中网络和磁盘I/O是主要瓶颈,故障很常见,下一代分析框架应该专门针对计算瓶颈进行优化。作为该项目的一部分,PI将系统地设计一个新的分析开源引擎,称为Tupleware,专门为非科技公司的基础设施构建。Tupleware将使复杂的分析更容易获得,并推动计算的界限。具体来说,PI将设计,实现和评估各种程序合成,即,查询编译技术,用于在具有快速互连和大量可用内存的企业集群上进行复杂分析。现有的查询编译技术专注于SQL,并不是为UDF和迭代主导计算的工作负载而设计的,也不是针对分布式设置的; PI将在本提案中解决所有问题。此外,PI旨在将高级查询优化与编译器技术结合联合收割机,以通过考虑关于数据的统计数据(例如,谓词的选择性)与关于UDF的低级统计(例如,使用的寄存器的数量)。最后,所有的结果将被整合到Tupleware系统中,从而使更广泛的用户可以访问。欲了解更多信息,请访问项目网站:http://tupleware.cs.brown.edu/

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Tim Kraska其他文献

Building Database Applications in the Cloud
  • DOI:
    10.3929/ethz-a-006007449
  • 发表时间:
    2010
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Tim Kraska
  • 通讯作者:
    Tim Kraska
Towards a Benchmark for the Cloud
迈向云基准
  • DOI:
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Carsten Binnig;Donald Kossmann;Tim Kraska;Simon Losing
  • 通讯作者:
    Simon Losing
Self-Organizing Data Containers
自组织数据容器
Safe Visual Data Exploration
安全的可视化数据探索
  • DOI:
  • 发表时间:
    2017
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Zheguang Zhao;Emanuel Zgraggen;L. Stefani;Carsten Binnig;E. Upfal;Tim Kraska
  • 通讯作者:
    Tim Kraska
Supplementary Materials for Niseko: a Large-Scale Meta-Learning Dataset
Niseko 的补充材料:大规模元学习数据集
  • DOI:
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Zeyuan Shang;Emanuel Zgraggen;P. Eichmann;Tim Kraska
  • 通讯作者:
    Tim Kraska

Tim Kraska的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Tim Kraska', 18)}}的其他基金

III: Medium: Quantifying the Unknown Unknowns for Data Integration
III:媒介:量化数据集成的未知因素
  • 批准号:
    2033792
  • 财政年份:
    2020
  • 资助金额:
    $ 55万
  • 项目类别:
    Continuing Grant
BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing
BD Spokes:SPOKE:NORTHEAST:协作:数据共享的许可模型和生态系统
  • 批准号:
    1947440
  • 财政年份:
    2019
  • 资助金额:
    $ 55万
  • 项目类别:
    Standard Grant
III: Medium: Learning-based Synthesis of Data Processing Engines
III:媒介:基于学习的数据处理引擎综合
  • 批准号:
    1900933
  • 财政年份:
    2019
  • 资助金额:
    $ 55万
  • 项目类别:
    Continuing Grant
III: Medium: Quantifying the Unknown Unknowns for Data Integration
III:媒介:量化数据集成的未知因素
  • 批准号:
    1562657
  • 财政年份:
    2016
  • 资助金额:
    $ 55万
  • 项目类别:
    Continuing Grant
BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing
BD Spokes:SPOKE:NORTHEAST:协作:数据共享的许可模型和生态系统
  • 批准号:
    1636698
  • 财政年份:
    2016
  • 资助金额:
    $ 55万
  • 项目类别:
    Standard Grant

相似海外基金

III: Small: Query-By-Sketch: Simplifying Video Clip Retrieval Through A Visual Query Paradigm
III:小:按草图查询:通过可视化查询范式简化视频剪辑检索
  • 批准号:
    2335881
  • 财政年份:
    2024
  • 资助金额:
    $ 55万
  • 项目类别:
    Standard Grant
Beyond Query: Exploratory Subgraph Discovery and Search System
超越查询:探索性子图发现和搜索系统
  • 批准号:
    DP240101591
  • 财政年份:
    2024
  • 资助金额:
    $ 55万
  • 项目类别:
    Discovery Projects
CRII: AF: Applications of Spectral Sensitivity to Query and Communication Complexity
CRII:AF:频谱敏感性在查询和通信复杂性中的应用
  • 批准号:
    2348489
  • 财政年份:
    2024
  • 资助金额:
    $ 55万
  • 项目类别:
    Standard Grant
Quantum Error Correction in a dual-species Rydberg array (QuERy)
双物种里德堡阵列中的量子纠错 (QuERy)
  • 批准号:
    EP/X025055/1
  • 财政年份:
    2023
  • 资助金额:
    $ 55万
  • 项目类别:
    Research Grant
Large Language Models for Query Optimisation: A New Paradigm in Database Systems
用于查询优化的大型语言模型:数据库系统的新范式
  • 批准号:
    2726025
  • 财政年份:
    2023
  • 资助金额:
    $ 55万
  • 项目类别:
    Studentship
III: Small: RUI: Designing Structure-Phenotype Query-Retrieval and Analysis Systems for Microscopy-Based Whole Organism Studies
III:小:RUI:为基于显微镜的整个生物体研究设计结构表型查询检索和分析系统
  • 批准号:
    2401096
  • 财政年份:
    2023
  • 资助金额:
    $ 55万
  • 项目类别:
    Standard Grant
Query Evaluation
查询评估
  • 批准号:
    EP/V039318/1
  • 财政年份:
    2023
  • 资助金额:
    $ 55万
  • 项目类别:
    Research Grant
Advanced Security and Privacy Techniques for Secure Big Data Query, Sharing and Processing
用于安全大数据查询、共享和处理的先进安全和隐私技术
  • 批准号:
    RGPIN-2022-03244
  • 财政年份:
    2022
  • 资助金额:
    $ 55万
  • 项目类别:
    Discovery Grants Program - Individual
CIVIC-PG Track B: Understanding Native American Tribal Residents Needs through Better Data and Query Systems
CIVIC-PG Track B:通过更好的数据和查询系统了解美洲原住民部落居民的需求
  • 批准号:
    2228275
  • 财政年份:
    2022
  • 资助金额:
    $ 55万
  • 项目类别:
    Standard Grant
CSR: Medium: Approximate Membership Query Data Structures in Computational Biology and Storage
CSR:中:计算生物学和存储中的近似成员资格查询数据结构
  • 批准号:
    2317838
  • 财政年份:
    2022
  • 资助金额:
    $ 55万
  • 项目类别:
    Continuing Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了