Transparent Compression for General-Purpose Programming Languages
通用编程语言的透明压缩
基本信息
- 批准号:EP/W001012/1
- 负责人:
- 金额:$ 37.38万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2022
- 资助国家:英国
- 起止时间:2022 至 无数据
- 项目状态:未结题
- 来源:
- 关键词:
项目摘要
Lossless compression is a key optimisation technique when storing, transmitting or processing datasets of any size and kind. Compression allows managing datasets larger than primary storage medium and reduces the bandwidth need for data access in applications ranging from physics simulations to machine learning models to sensor data management. Some compression schemes allow computation to be performed directly on compressed data, often reducing computational complexity compared to uncompressed data - such schemes are commonly called "lightweight". For example, summing values in a Run-Length-Encoded (RLE) array requires effort in the order of the size of the compressed input, i.e., usually significantly less than the uncompressed input. Similarly, relational selections and joins can be prefiltered on the dictionary of a dictionary-compressed relation or the minimum-value in a frame-of-reference encoded column.Implementing algorithms to work directly on compressed data is challenging: the computation has to be tightly integrated with the (de-)compression code and inefficiencies in the implementation can easily outweigh benefits in data transfer or processing performance. Consequently, most lightweight compression schemes are bespoke, i.e., developed and tuned for specific, well-understood domains such as relational databases, image processing or linear algebra. While the state-of-the-art is to implement them manually, virtually all lightweight compression schemes can be expressed as sequences of primitive transformations such as Run-Length-Encoding (RLE), dictionary compression or Huffman-coding. Examples of such "compression pipelines" are PFOR-delta (in the Vector DBMS), Vertipaq (in Microsoft SQL Server) or the Compressed-Sparse-Row matrix representation (in many linear algebra packages).However, there are three fundamental problems with the state of the art: first, there is no principled way to assemble these pipelines. Second, these schemes are tied to a specific data/processing model (relational algebra, linear algebra, etc.). Third, and most importantly, the implementation effort is high as every application needs to implement compression from scratch. Unsurprisingly many applications that could benefit from compression shy away from that implementation effort: in particular for domain-scientists writing code in languages like Python or R, low implementation effort takes precedence over efficiency.Our vision is to make the benefits of performance-oriented compression available to applications beyond the mentioned few. For that purpose, we will develop an algebraic framework for the representation and optimisation of bespoke compression schemes in general-purpose programming languages. Instead of "weaving" hundreds of lines of compression-related code into an application's logic, developers will express compression schemes as annotations on collections. The backend transparently transforms code that operates on the vector to take advantage of the compression strategy. This allows even non-experts to implement bespoke compression schemes. Simplifying the interface even further, we will implement a fully automated approach that determines the most appropriate compression scheme for a program, dataset and hardware platform using cost-based optimisation rather than requiring to have it explicitly specified.
在存储、传输或处理任何大小和类型的数据集时,无损压缩是一项关键的优化技术。压缩允许管理比主存储介质更大的数据集,并减少了从物理模拟到机器学习模型再到传感器数据管理等应用中的数据访问带宽需求。一些压缩方案允许直接在压缩数据上执行计算,与未压缩数据相比,通常降低了计算复杂性--这种方案通常被称为“轻量级”。例如,对游程长度编码(RLE)数组中的值求和需要按压缩输入大小的顺序进行求和,即通常比未压缩输入小得多。同样,关系选择和连接可以在字典压缩关系的字典或参考框架编码列中的最小值上进行预过滤。实现直接处理压缩数据的算法是具有挑战性的:计算必须与(解)压缩代码紧密集成,并且实施中的低效很容易超过数据传输或处理性能方面的好处。因此,大多数轻量级压缩方案都是定制的,即为特定的、众所周知的领域(如关系数据库、图像处理或线性代数)开发和调整的。虽然最先进的技术是手动实现它们,但几乎所有轻量级压缩方案都可以表示为原始转换序列,例如游程长度编码(RLE)、字典压缩或霍夫曼编码。这种“压缩管道”的例子有PFOR-Delta(在向量数据库管理系统中)、Vertipaq(在Microsoft SQL Server中)或压缩稀疏行矩阵表示法(在许多线性代数包中)。但是,现有技术有三个基本问题:第一,没有原则上的方法来组装这些管道。其次,这些模式绑定到特定的数据/处理模型(关系代数、线性代数等)。第三,也是最重要的,实现工作量很大,因为每个应用程序都需要从头开始实现压缩。不足为奇的是,许多可以从压缩中受益的应用程序都回避了这种实现工作:尤其是对于使用像Python或R这样的语言编写代码的领域科学家来说,低实现工作比效率更重要。我们的愿景是使面向性能的压缩的好处不仅适用于前面提到的少数几个应用程序。为此,我们将开发一个代数框架,用于在通用编程语言中表示和优化定制的压缩方案。开发人员将把压缩方案表示为集合上的批注,而不是将数百行与压缩相关的代码“编织”到应用程序的逻辑中。后端透明地转换在向量上操作的代码,以利用压缩策略。这使得即使是非专家也可以实施定制的压缩方案。为了进一步简化界面,我们将实施一种完全自动化的方法,使用基于成本的优化来确定程序、数据集和硬件平台的最合适的压缩方案,而不是要求明确指定。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
BOSS - An Architecture for Database Kernel Composition
BOSS - 数据库内核组合的架构
- DOI:10.14778/3636218.3636239
- 发表时间:2024
- 期刊:
- 影响因子:2.5
- 作者:Mohr-Daurat H
- 通讯作者:Mohr-Daurat H
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Holger Pirk其他文献
Waste Not, Want Not! Managing relational data in asymmetric memories
- DOI:
- 发表时间:
2015-05 - 期刊:
- 影响因子:0
- 作者:
Holger Pirk - 通讯作者:
Holger Pirk
Holger Pirk的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
相似海外基金
EAGER: IMPRESS-U: Exploratory Research on Generative Compression for Compressive Lidar
EAGER:IMPRESS-U:压缩激光雷达生成压缩的探索性研究
- 批准号:
2404740 - 财政年份:2024
- 资助金额:
$ 37.38万 - 项目类别:
Standard Grant
An Adsorption-Compression Cold Thermal Energy Storage System (ACCESS)
吸附压缩冷热能存储系统(ACCESS)
- 批准号:
EP/W027593/2 - 财政年份:2024
- 资助金额:
$ 37.38万 - 项目类别:
Research Grant
CAREER: Coding Subspaces: Error Correction, Compression and Applications
职业:编码子空间:纠错、压缩和应用
- 批准号:
2415440 - 财政年份:2024
- 资助金额:
$ 37.38万 - 项目类别:
Continuing Grant
RII Track-4: NSF: Scalable MPI with Adaptive Compression for GPU-based Computing Systems
RII Track-4:NSF:适用于基于 GPU 的计算系统的具有自适应压缩的可扩展 MPI
- 批准号:
2327266 - 财政年份:2024
- 资助金额:
$ 37.38万 - 项目类别:
Standard Grant
Near Lossless Dense Light Field Compression Using Generalized Neural Radiance Field
使用广义神经辐射场的近无损密集光场压缩
- 批准号:
24K20797 - 财政年份:2024
- 资助金额:
$ 37.38万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Research hot cell with active air compression and cool down system
研究具有主动空气压缩和冷却系统的热室
- 批准号:
518282844 - 财政年份:2023
- 资助金额:
$ 37.38万 - 项目类别:
Major Research Instrumentation
Prediction of Sticking Potential for Continuous Direct Compression
连续直接压缩粘着潜力的预测
- 批准号:
2891745 - 财政年份:2023
- 资助金额:
$ 37.38万 - 项目类别:
Studentship
CSR: Small: CONCERT: Designing Scalable Communication Runtimes with On-the-fly Compression for HPC and AI Applications on Heterogeneous Architectures
CSR:小型:CONCERT:为异构架构上的 HPC 和 AI 应用程序设计具有动态压缩的可扩展通信运行时
- 批准号:
2312927 - 财政年份:2023
- 资助金额:
$ 37.38万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Topology-Aware Data Compression for Scientific Analysis and Visualization
合作研究:OAC 核心:用于科学分析和可视化的拓扑感知数据压缩
- 批准号:
2313124 - 财政年份:2023
- 资助金额:
$ 37.38万 - 项目类别:
Standard Grant
Collaborative Research: Frameworks: FZ: A fine-tunable cyberinfrastructure framework to streamline specialized lossy compression development
合作研究:框架:FZ:一个可微调的网络基础设施框架,用于简化专门的有损压缩开发
- 批准号:
2311878 - 财政年份:2023
- 资助金额:
$ 37.38万 - 项目类别:
Standard Grant