权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Transparent Compression for General-Purpose Programming Languages

通用编程语言的透明压缩

基本信息

批准号：
EP/W001012/1
负责人：
Holger Pirk
金额：
$ 37.38万
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2022
资助国家：
英国
起止时间：
2022 至无数据
项目状态：
未结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FW001012%2F1
关键词：
Transparent Compression General Purpose Programming

项目摘要

Lossless compression is a key optimisation technique when storing, transmitting or processing datasets of any size and kind. Compression allows managing datasets larger than primary storage medium and reduces the bandwidth need for data access in applications ranging from physics simulations to machine learning models to sensor data management. Some compression schemes allow computation to be performed directly on compressed data, often reducing computational complexity compared to uncompressed data - such schemes are commonly called "lightweight". For example, summing values in a Run-Length-Encoded (RLE) array requires effort in the order of the size of the compressed input, i.e., usually significantly less than the uncompressed input. Similarly, relational selections and joins can be prefiltered on the dictionary of a dictionary-compressed relation or the minimum-value in a frame-of-reference encoded column.Implementing algorithms to work directly on compressed data is challenging: the computation has to be tightly integrated with the (de-)compression code and inefficiencies in the implementation can easily outweigh benefits in data transfer or processing performance. Consequently, most lightweight compression schemes are bespoke, i.e., developed and tuned for specific, well-understood domains such as relational databases, image processing or linear algebra. While the state-of-the-art is to implement them manually, virtually all lightweight compression schemes can be expressed as sequences of primitive transformations such as Run-Length-Encoding (RLE), dictionary compression or Huffman-coding. Examples of such "compression pipelines" are PFOR-delta (in the Vector DBMS), Vertipaq (in Microsoft SQL Server) or the Compressed-Sparse-Row matrix representation (in many linear algebra packages).However, there are three fundamental problems with the state of the art: first, there is no principled way to assemble these pipelines. Second, these schemes are tied to a specific data/processing model (relational algebra, linear algebra, etc.). Third, and most importantly, the implementation effort is high as every application needs to implement compression from scratch. Unsurprisingly many applications that could benefit from compression shy away from that implementation effort: in particular for domain-scientists writing code in languages like Python or R, low implementation effort takes precedence over efficiency.Our vision is to make the benefits of performance-oriented compression available to applications beyond the mentioned few. For that purpose, we will develop an algebraic framework for the representation and optimisation of bespoke compression schemes in general-purpose programming languages. Instead of "weaving" hundreds of lines of compression-related code into an application's logic, developers will express compression schemes as annotations on collections. The backend transparently transforms code that operates on the vector to take advantage of the compression strategy. This allows even non-experts to implement bespoke compression schemes. Simplifying the interface even further, we will implement a fully automated approach that determines the most appropriate compression scheme for a program, dataset and hardware platform using cost-based optimisation rather than requiring to have it explicitly specified.

在存储、传输或处理任何大小和类型的数据集时，无损压缩是一项关键的优化技术。压缩允许管理比主存储介质更大的数据集，并减少了从物理模拟到机器学习模型再到传感器数据管理等应用中的数据访问带宽需求。一些压缩方案允许直接在压缩数据上执行计算，与未压缩数据相比，通常降低了计算复杂性--这种方案通常被称为“轻量级”。例如，对游程长度编码(RLE)数组中的值求和需要按压缩输入大小的顺序进行求和，即通常比未压缩输入小得多。同样，关系选择和连接可以在字典压缩关系的字典或参考框架编码列中的最小值上进行预过滤。实现直接处理压缩数据的算法是具有挑战性的：计算必须与(解)压缩代码紧密集成，并且实施中的低效很容易超过数据传输或处理性能方面的好处。因此，大多数轻量级压缩方案都是定制的，即为特定的、众所周知的领域(如关系数据库、图像处理或线性代数)开发和调整的。虽然最先进的技术是手动实现它们，但几乎所有轻量级压缩方案都可以表示为原始转换序列，例如游程长度编码(RLE)、字典压缩或霍夫曼编码。这种“压缩管道”的例子有PFOR-Delta(在向量数据库管理系统中)、Vertipaq(在Microsoft SQL Server中)或压缩稀疏行矩阵表示法(在许多线性代数包中)。但是，现有技术有三个基本问题：第一，没有原则上的方法来组装这些管道。其次，这些模式绑定到特定的数据/处理模型(关系代数、线性代数等)。第三，也是最重要的，实现工作量很大，因为每个应用程序都需要从头开始实现压缩。不足为奇的是，许多可以从压缩中受益的应用程序都回避了这种实现工作：尤其是对于使用像Python或R这样的语言编写代码的领域科学家来说，低实现工作比效率更重要。我们的愿景是使面向性能的压缩的好处不仅适用于前面提到的少数几个应用程序。为此，我们将开发一个代数框架，用于在通用编程语言中表示和优化定制的压缩方案。开发人员将把压缩方案表示为集合上的批注，而不是将数百行与压缩相关的代码“编织”到应用程序的逻辑中。后端透明地转换在向量上操作的代码，以利用压缩策略。这使得即使是非专家也可以实施定制的压缩方案。为了进一步简化界面，我们将实施一种完全自动化的方法，使用基于成本的优化来确定程序、数据集和硬件平台的最合适的压缩方案，而不是要求明确指定。