Pattern matching algorithms for massive datasets
海量数据集的模式匹配算法
基本信息
- 批准号:EP/F02682X/1
- 负责人:
- 金额:$ 35.49万
- 依托单位:
- 依托单位国家:英国
- 项目类别:Research Grant
- 财政年份:2008
- 资助国家:英国
- 起止时间:2008 至 无数据
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
This project aims to provide the tools necessary for pattern matchingin massive datasets in the 21st century. Pattern matching problemsare pervasive and it is therefore hard to overstate theirimportance. The hugely successful new field of high throughputcomputational genetics, which is the lifeblood of pharmaceuticalindustries, is founded on the ability to perform approximate stringmatching accurately and quickly. Perhaps more mundanely but no lesssignificant economically, linear time exact matching algorithms arenow taken for granted as basic tools in every text editor and wordprocessor used today.Despite the success that pattern matching algorithms continue toenjoy, new problems in urgent need of a solution arise continually.These revolve around data processing applications where the datasetsare massive, subject to error or ambiguity and where processing isrequired online or in real-time. For example, the problem of exactmatching has well known optimal solutions both for online search andwhen the data to be queried can be indexed beforehand. However,unlike exact matching, the problem of finding the fastest algorithmsfor approximate matching has still not been resolved under almost anymeasure of similarity. Where the data is of a non-standard form, forexample consisting of numerical rather than symbolic information, evenless is currently known about how to search or index the informationefficiently.Another vital difference between the old and new settings is notsimply in the quantity of data available but also the ways in which ithas to be processed. The public genome sequencing projects, forexample, have produced 100s of gigabytes of sequence and related metadata. However these datasets are relatively straightforward to handlecompared to the processing of information passing through Internetrouters and over telephone wires every day or stored in the World WideWeb. In this situation it is not sufficient simply that an algorithmsruns fast. Ideally it should also require considerably less space thanthe input, update at least as quickly as the new data are arriving andstill be able to perform complex queries on the whole dataset.This project will directly address these two separate but interrelatedchallenges. Real-time and online matching algorithms will be developedto handle situations where vast amounts of data are streaming past atvery high rates. The project will also consider new forms ofapproximation and present fast algorithmic solutions that will allowdatasets that result from modern applications and industries to besearched for approximate matches without the need to rely onheuristics. Finally as part of the work on improved methods forapproximate matching, this project will develop faster and smallerindexes that will for the first time allow approximate matching ontruly massive datasets to become feasible in practice.
该项目旨在为世纪海量数据集的模式匹配提供必要的工具。模式匹配问题是普遍存在的,因此很难夸大其重要性。计算遗传学是制药业的生命线,它是一个非常成功的高通量新领域,它建立在准确快速地进行近似字符串匹配的能力之上。也许更普通,但同样重要的经济,线性时间精确匹配算法现在被认为是理所当然的基本工具,在每一个文本编辑器和文字处理器今天使用。尽管成功的模式匹配算法继续享受,新的问题,迫切需要一个解决方案不断出现。这些围绕数据处理应用程序,其中的数据库是巨大的,容易出现错误或模糊,并且需要在线或实时处理。例如,精确匹配问题对于在线搜索和当要查询的数据可以被预先索引时都有众所周知的最优解。然而,与精确匹配不同,找到近似匹配的最快算法的问题在几乎任何相似性度量下都还没有得到解决。在数据是非标准形式的情况下,例如由数字而不是符号信息组成的情况下,目前甚至不知道如何有效地搜索或索引这些信息。新旧环境之间的另一个重要区别不仅在于可用数据的数量,还在于处理数据的方式。例如,公共基因组测序计划已经产生了数百千兆字节的序列和相关的元数据。然而,这些数据集是相对简单的处理,通过因特网和电话线每天传递或存储在万维网上的信息的处理.在这种情况下,仅仅一个算法运行快是不够的.理想情况下,它还应该需要比输入少得多的空间,更新至少与新数据到达一样快,并且仍然能够对整个数据集执行复杂的查询。这个项目将直接解决这两个独立但相互关联的挑战。实时和在线匹配算法将被开发出来,以处理大量数据以非常高的速率流过的情况。该项目还将考虑新的近似形式,并提出快速算法解决方案,允许从现代应用和行业中产生的数据集在不需要依赖于统计学的情况下寻找近似匹配。最后,作为改进近似匹配方法工作的一部分,该项目将开发更快、更小的索引,这将首次使真正大规模数据集的近似匹配在实践中变得可行。
项目成果
期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Automata, Languages and Programming
自动机、语言和编程
- DOI:10.1007/978-3-540-70583-3_9
- 发表时间:2008
- 期刊:
- 影响因子:0
- 作者:Berger M
- 通讯作者:Berger M
From Coding Theory to Efficient Pattern Matching
从编码理论到高效模式匹配
- DOI:
- 发表时间:2009
- 期刊:
- 影响因子:0
- 作者:Raphael Clifford
- 通讯作者:Raphael Clifford
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
R Clifford其他文献
Dental nurse registration with GDC
牙科护士在英国牙科委员会注册
- DOI:
10.1038/sj.bdj.4810237 - 发表时间:
2003-06-14 - 期刊:
- 影响因子:2.300
- 作者:
R Clifford - 通讯作者:
R Clifford
R Clifford的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('R Clifford', 18)}}的其他基金
Dynamic pattern matching: Faster Algorithms and New Bounds
动态模式匹配:更快的算法和新界限
- 批准号:
EP/J011940/1 - 财政年份:2012
- 资助金额:
$ 35.49万 - 项目类别:
Research Grant
相似国自然基金
超高速正则表达式匹配技术研究
- 批准号:61073184
- 批准年份:2010
- 资助金额:12.0 万元
- 项目类别:面上项目
相似海外基金
Comprehensive Evaluation of Algorithms for Indeterminate Pattern-Matching
不确定模式匹配算法的综合评价
- 批准号:
569128-2022 - 财政年份:2022
- 资助金额:
$ 35.49万 - 项目类别:
Postgraduate Scholarships - Doctoral
Development of optimal time-space algorithms on pattern matching problems
模式匹配问题的最优时空算法的发展
- 批准号:
19K20208 - 财政年份:2019
- 资助金额:
$ 35.49万 - 项目类别:
Grant-in-Aid for Early-Career Scientists
Pattern matching algorithms for streaming data
流数据的模式匹配算法
- 批准号:
EP/H028056/2 - 财政年份:2013
- 资助金额:
$ 35.49万 - 项目类别:
Fellowship
Dynamic pattern matching: Faster Algorithms and New Bounds
动态模式匹配:更快的算法和新界限
- 批准号:
EP/J011940/1 - 财政年份:2012
- 资助金额:
$ 35.49万 - 项目类别:
Research Grant
Fast parameterized pattern matching algorithms based on data compression
基于数据压缩的快速参数化模式匹配算法
- 批准号:
23700022 - 财政年份:2011
- 资助金额:
$ 35.49万 - 项目类别:
Grant-in-Aid for Young Scientists (B)
Pattern matching algorithms for streaming data
流数据的模式匹配算法
- 批准号:
EP/H028056/1 - 财政年份:2011
- 资助金额:
$ 35.49万 - 项目类别:
Fellowship
Design of fast tree pattern matching algorithms using bit-parallelism on strings
利用字符串位并行性的快速树模式匹配算法的设计
- 批准号:
21500010 - 财政年份:2009
- 资助金额:
$ 35.49万 - 项目类别:
Grant-in-Aid for Scientific Research (C)
Development of efficient machine discovery system based on data compression and pattern matching
基于数据压缩和模式匹配的高效机器发现系统的开发
- 批准号:
15300049 - 财政年份:2003
- 资助金额:
$ 35.49万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Development of Intelligent full text retrieval system based on data compression and fast string pattern matching algorithms
基于数据压缩和快速字符串模式匹配算法的智能全文检索系统开发
- 批准号:
13558029 - 财政年份:2001
- 资助金额:
$ 35.49万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Development of Intelligent Full-text Search System using Efficient Pattern Matching Algorithms on Compressed Data
利用压缩数据的高效模式匹配算法开发智能全文搜索系统
- 批准号:
10558047 - 财政年份:1998
- 资助金额:
$ 35.49万 - 项目类别:
Grant-in-Aid for Scientific Research (B).