权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Pattern matching algorithms for massive datasets

海量数据集的模式匹配算法

基本信息

批准号：
EP/F02682X/1
负责人：
R Clifford
金额：
$ 35.49万
依托单位：
University of Bristol
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2008
资助国家：
英国
起止时间：
2008 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FF02682X%2F1
关键词：
Pattern matching algorithms massive datasets

项目摘要

This project aims to provide the tools necessary for pattern matchingin massive datasets in the 21st century. Pattern matching problemsare pervasive and it is therefore hard to overstate theirimportance. The hugely successful new field of high throughputcomputational genetics, which is the lifeblood of pharmaceuticalindustries, is founded on the ability to perform approximate stringmatching accurately and quickly. Perhaps more mundanely but no lesssignificant economically, linear time exact matching algorithms arenow taken for granted as basic tools in every text editor and wordprocessor used today.Despite the success that pattern matching algorithms continue toenjoy, new problems in urgent need of a solution arise continually.These revolve around data processing applications where the datasetsare massive, subject to error or ambiguity and where processing isrequired online or in real-time. For example, the problem of exactmatching has well known optimal solutions both for online search andwhen the data to be queried can be indexed beforehand. However,unlike exact matching, the problem of finding the fastest algorithmsfor approximate matching has still not been resolved under almost anymeasure of similarity. Where the data is of a non-standard form, forexample consisting of numerical rather than symbolic information, evenless is currently known about how to search or index the informationefficiently.Another vital difference between the old and new settings is notsimply in the quantity of data available but also the ways in which ithas to be processed. The public genome sequencing projects, forexample, have produced 100s of gigabytes of sequence and related metadata. However these datasets are relatively straightforward to handlecompared to the processing of information passing through Internetrouters and over telephone wires every day or stored in the World WideWeb. In this situation it is not sufficient simply that an algorithmsruns fast. Ideally it should also require considerably less space thanthe input, update at least as quickly as the new data are arriving andstill be able to perform complex queries on the whole dataset.This project will directly address these two separate but interrelatedchallenges. Real-time and online matching algorithms will be developedto handle situations where vast amounts of data are streaming past atvery high rates. The project will also consider new forms ofapproximation and present fast algorithmic solutions that will allowdatasets that result from modern applications and industries to besearched for approximate matches without the need to rely onheuristics. Finally as part of the work on improved methods forapproximate matching, this project will develop faster and smallerindexes that will for the first time allow approximate matching ontruly massive datasets to become feasible in practice.

该项目旨在为世纪海量数据集的模式匹配提供必要的工具。模式匹配问题是普遍存在的，因此很难夸大其重要性。计算遗传学是制药业的生命线，它是一个非常成功的高通量新领域，它建立在准确快速地进行近似字符串匹配的能力之上。也许更普通，但同样重要的经济，线性时间精确匹配算法现在被认为是理所当然的基本工具，在每一个文本编辑器和文字处理器今天使用。尽管成功的模式匹配算法继续享受，新的问题，迫切需要一个解决方案不断出现。这些围绕数据处理应用程序，其中的数据库是巨大的，容易出现错误或模糊，并且需要在线或实时处理。例如，精确匹配问题对于在线搜索和当要查询的数据可以被预先索引时都有众所周知的最优解。然而，与精确匹配不同，找到近似匹配的最快算法的问题在几乎任何相似性度量下都还没有得到解决。在数据是非标准形式的情况下，例如由数字而不是符号信息组成的情况下，目前甚至不知道如何有效地搜索或索引这些信息。新旧环境之间的另一个重要区别不仅在于可用数据的数量，还在于处理数据的方式。例如，公共基因组测序计划已经产生了数百千兆字节的序列和相关的元数据。然而，这些数据集是相对简单的处理，通过因特网和电话线每天传递或存储在万维网上的信息的处理.在这种情况下，仅仅一个算法运行快是不够的.理想情况下，它还应该需要比输入少得多的空间，更新至少与新数据到达一样快，并且仍然能够对整个数据集执行复杂的查询。这个项目将直接解决这两个独立但相互关联的挑战。实时和在线匹配算法将被开发出来，以处理大量数据以非常高的速率流过的情况。该项目还将考虑新的近似形式，并提出快速算法解决方案，允许从现代应用和行业中产生的数据集在不需要依赖于统计学的情况下寻找近似匹配。最后，作为改进近似匹配方法工作的一部分，该项目将开发更快、更小的索引，这将首次使真正大规模数据集的近似匹配在实践中变得可行。