CAREER: Query Compilation Techniques for Complex Analytics on Enterprise Clusters
职业:企业集群上复杂分析的查询编译技术
基本信息
- 批准号:1453171
- 负责人:
- 金额:$ 55万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Continuing Grant
- 财政年份:2015
- 资助国家:美国
- 起止时间:2015-06-01 至 2023-05-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Big data and the evolving field of Data Science are fundamentally shifting the meaning of analytics. Highly complex computations have come to define the typical workload, with jobs ranging from machine learning to large-scale visualization. However, there is a fundamental discrepancy between the availability of analytical tools for big Internet companies and those for non-tech enterprises. Current analytics frameworks, like Spark or Hadoop, are designed to meet the needs of giant Internet companies; that is, they are built to process petabytes of data in cloud deployments consisting of thousands of cheap commodity machines. Yet non-tech companies like banks and retailers - or even the typical data scientist - seldom operate deployments of that size, instead preferring smaller clusters, aka Enterprise clusters, with more reliable hardware. In fact, recent industry surveys reported that the median Hadoop cluster was fewer than 10 nodes, and over 65% of users operate clusters smaller than 50 nodes. Targeting complex analytics workloads on smaller clusters, however, fundamentally changes the way we should design analytics tools. Most current systems focus on the major challenges associated with large cloud deployments, where network and disk I/O are the primary bottleneck and failures are common, where the next generation of analytics frameworks should optimize specifically for the computation bottleneck. As part of this project, the PIs will systematically design a new analytical open-source engine, called Tupleware, build specifically for the infrastructure of non-tech companies. Tupleware will make complex analytics more accessible and push the boundaries of what computations are possible.Specifically, the PIs will design, implement and evaluate various program synthesis, i.e., query compilation techniques, for complex analytics on enterprise clusters with fast interconnects and considerable available memory. Existing query compilation techniques focus on SQL and are not designed for workloads where UDFs and iterations dominate the computation, nor do they target distributed setups; all issues the PIs will address in this proposal. Furthermore, the PIs aim to combine high-level query optimization with compiler technology to holistically optimize complex analytical workflows by considering statistics about the data (e.g., the selectivity of predicates) with low-level statistics about the UDFs (e.g., the number of used registers). Finally, all the results will be integrated into the Tupleware system and thus, made accessible for a broader range of users.For further information see the project web site at: http://tupleware.cs.brown.edu/
大数据和不断发展的数据科学领域正在从根本上改变分析的含义。从机器学习到大规模可视化,高度复杂的计算已经定义了典型的工作负荷。然而,大型互联网公司和非科技企业的分析工具之间存在根本性的差异。目前的分析框架,如Spark或Hadoop,是为满足大型互联网公司的需求而设计的;也就是说,它们是为了在由数千台廉价商品机器组成的云部署中处理数PB的数据而构建的。然而,银行和零售商等非科技公司--甚至是典型的数据科学家--很少进行如此规模的部署,而是更喜欢具有更可靠硬件的更小的集群,也就是企业集群。事实上,最近的行业调查报告称,Hadoop集群的中位数不到10个节点,超过65%的用户操作的集群小于50个节点。然而,将复杂的分析工作负载定位在较小的群集上,从根本上改变了我们设计分析工具的方式。当前大多数系统都关注与大型云部署相关的主要挑战,其中网络和磁盘I/O是主要瓶颈,故障屡见不鲜,而下一代分析框架应专门针对计算瓶颈进行优化。作为该项目的一部分,PI将系统地设计一个新的分析性开源引擎,称为Tupleware,专门为非科技公司的基础设施构建。Tupleware将使复杂的分析更容易获得,并推动可能的计算的界限。具体地说,PI将设计、实施和评估各种程序合成,即查询编译技术,用于对具有快速互联和大量可用内存的企业集群进行复杂的分析。现有的查询编译技术侧重于SQL,并不是为UDF和迭代在计算中占主导地位的工作负载设计的,也不是针对分布式设置的;PI将在本提案中解决所有问题。此外,PI旨在将高级查询优化与编译器技术相结合,通过考虑关于数据的统计数据(例如,谓词的选择性)和关于UDF的低级统计数据(例如,使用的寄存器数量),对复杂的分析工作流程进行整体优化。最后,所有结果将被集成到Tupleware系统中,从而使更多的用户可以访问。有关更多信息,请参阅项目网站:http://tupleware.cs.brown.edu/
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
                item.title }}
{{ item.translation_title }}
- DOI:{{ item.doi }} 
- 发表时间:{{ item.publish_year }} 
- 期刊:
- 影响因子:{{ item.factor }}
- 作者:{{ item.authors }} 
- 通讯作者:{{ item.author }} 
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:{{ item.author }} 
数据更新时间:{{ patent.updateTime }}
Tim Kraska其他文献
Building Database Applications in the Cloud
- DOI:10.3929/ethz-a-006007449 
- 发表时间:2010 
- 期刊:
- 影响因子:0
- 作者:Tim Kraska 
- 通讯作者:Tim Kraska 
Towards a Benchmark for the Cloud
迈向云基准
- DOI:
- 发表时间:2018 
- 期刊:
- 影响因子:0
- 作者:Carsten Binnig;Donald Kossmann;Tim Kraska;Simon Losing 
- 通讯作者:Simon Losing 
Self-Organizing Data Containers
自组织数据容器
- DOI:
- 发表时间:2022 
- 期刊:
- 影响因子:0
- 作者:S. Madden;Jialin Ding;Tim Kraska;Sivaprasad Sudhir;David Cohen;T. Mattson;Nesime Tatbul 
- 通讯作者:Nesime Tatbul 
Safe Visual Data Exploration
安全的可视化数据探索
- DOI:
- 发表时间:2017 
- 期刊:
- 影响因子:0
- 作者:Zheguang Zhao;Emanuel Zgraggen;L. Stefani;Carsten Binnig;E. Upfal;Tim Kraska 
- 通讯作者:Tim Kraska 
Supplementary Materials for Niseko: a Large-Scale Meta-Learning Dataset
Niseko 的补充材料:大规模元学习数据集
- DOI:
- 发表时间:2019 
- 期刊:
- 影响因子:0
- 作者:Zeyuan Shang;Emanuel Zgraggen;P. Eichmann;Tim Kraska 
- 通讯作者:Tim Kraska 
Tim Kraska的其他文献
{{
              item.title }}
{{ item.translation_title }}
- DOI:{{ item.doi }} 
- 发表时间:{{ item.publish_year }} 
- 期刊:
- 影响因子:{{ item.factor }}
- 作者:{{ item.authors }} 
- 通讯作者:{{ item.author }} 
{{ truncateString('Tim Kraska', 18)}}的其他基金
III: Medium: Quantifying the Unknown Unknowns for Data Integration
III:媒介:量化数据集成的未知因素
- 批准号:2033792 
- 财政年份:2020
- 资助金额:$ 55万 
- 项目类别:Continuing Grant 
BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing
BD Spokes:SPOKE:NORTHEAST:协作:数据共享的许可模型和生态系统
- 批准号:1947440 
- 财政年份:2019
- 资助金额:$ 55万 
- 项目类别:Standard Grant 
III: Medium: Learning-based Synthesis of Data Processing Engines
III:媒介:基于学习的数据处理引擎综合
- 批准号:1900933 
- 财政年份:2019
- 资助金额:$ 55万 
- 项目类别:Continuing Grant 
III: Medium: Quantifying the Unknown Unknowns for Data Integration
III:媒介:量化数据集成的未知因素
- 批准号:1562657 
- 财政年份:2016
- 资助金额:$ 55万 
- 项目类别:Continuing Grant 
BD Spokes: SPOKE: NORTHEAST: Collaborative: A Licensing Model and Ecosystem for Data Sharing
BD Spokes:SPOKE:NORTHEAST:协作:数据共享的许可模型和生态系统
- 批准号:1636698 
- 财政年份:2016
- 资助金额:$ 55万 
- 项目类别:Standard Grant 
相似海外基金
III: Small: Query-By-Sketch: Simplifying Video Clip Retrieval Through A Visual Query Paradigm
III:小:按草图查询:通过可视化查询范式简化视频剪辑检索
- 批准号:2335881 
- 财政年份:2024
- 资助金额:$ 55万 
- 项目类别:Standard Grant 
Beyond Query: Exploratory Subgraph Discovery and Search System
超越查询:探索性子图发现和搜索系统
- 批准号:DP240101591 
- 财政年份:2024
- 资助金额:$ 55万 
- 项目类别:Discovery Projects 
CRII: AF: Applications of Spectral Sensitivity to Query and Communication Complexity
CRII:AF:频谱敏感性在查询和通信复杂性中的应用
- 批准号:2348489 
- 财政年份:2024
- 资助金额:$ 55万 
- 项目类别:Standard Grant 
Large Language Models for Query Optimisation: A New Paradigm in Database Systems
用于查询优化的大型语言模型:数据库系统的新范式
- 批准号:2726025 
- 财政年份:2023
- 资助金额:$ 55万 
- 项目类别:Studentship 
Quantum Error Correction in a dual-species Rydberg array (QuERy)
双物种里德堡阵列中的量子纠错 (QuERy)
- 批准号:EP/X025055/1 
- 财政年份:2023
- 资助金额:$ 55万 
- 项目类别:Research Grant 
III: Small: RUI: Designing Structure-Phenotype Query-Retrieval and Analysis Systems for Microscopy-Based Whole Organism Studies
III:小:RUI:为基于显微镜的整个生物体研究设计结构表型查询检索和分析系统
- 批准号:2401096 
- 财政年份:2023
- 资助金额:$ 55万 
- 项目类别:Standard Grant 
Advanced Security and Privacy Techniques for Secure Big Data Query, Sharing and Processing
用于安全大数据查询、共享和处理的先进安全和隐私技术
- 批准号:RGPIN-2022-03244 
- 财政年份:2022
- 资助金额:$ 55万 
- 项目类别:Discovery Grants Program - Individual 
AF: Small: Polynomials, Communication, and Query Complexity
AF:小:多项式、通信和查询复杂性
- 批准号:2220232 
- 财政年份:2022
- 资助金额:$ 55万 
- 项目类别:Standard Grant 
CIVIC-PG Track B: Understanding Native American Tribal Residents Needs through Better Data and Query Systems
CIVIC-PG Track B:通过更好的数据和查询系统了解美洲原住民部落居民的需求
- 批准号:2228275 
- 财政年份:2022
- 资助金额:$ 55万 
- 项目类别:Standard Grant 

 刷新
              刷新
            
















 {{item.name}}会员
              {{item.name}}会员
            



