权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Development of Efficient Data Mining Systems for Large Semi-Structured Text Data

大型半结构化文本数据的高效数据挖掘系统开发

基本信息

批准号：
11558040
负责人：
ARIMURA Hiroki
金额：
$ 6.27万
依托单位：
Kyushu University
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (B)
财政年份：
1999
资助国家：
日本
起止时间：
1999 至 2001
项目状态：
已结题

项目摘要

The goal of this research project is to devise an efficient semi-automatic tool that supports human discovery from large unstructured and semi-structured text data. To achieve this goal, we studied in the following three directions.1. The central process of text mining is pattern discovery. We employed the framework of optimized pattern discovery, and developed effcient and robust text mining algorithms that find simple combinatorial patterns from large unstructured texts. To implement these algorithms, we developed a text index structure based on the suffix arrays suitable for text mining. Based on these technologies, we implemented a prototype system and run computer experiments on Web data.2. Another important technology for text is efficient pattern matching. As a theoretical framework, we proposed a unified framework, called Collage system, for realizing various dictionary-based compression methods. We developed both Knuth-Morris-Pratt type and Byer-Moore type pattern matching algorithms employing this framework. We also applied this framework to Byte-Pair-Encoding compression method and Sequitur, the former of which yields the fastest compressed pattern matching algorithm.3. Final process of text mining is information extraction. From theoretical point of view, we first formalize the information extraction problem from semi-structured data, and then gave theoretical analysis of the power and the limitation of such tasks. Then, we developed efficient information extraction algorithms for various types of extraction rules including tree wrappers and hedge patterns and evaluate them through experiments on real-life semi-structured data on the internet.

本研究项目的目标是设计一种高效的半自动工具，支持人类从大型非结构化和半结构化文本数据中进行发现。为了实现这一目标，我们从以下三个方面进行了研究。文本挖掘的核心过程是模式发现。我们采用优化模式发现框架，开发了高效鲁棒的文本挖掘算法，从大型非结构化文本中发现简单的组合模式。为了实现这些算法，我们开发了一个基于适合文本挖掘的后缀数组的文本索引结构。基于这些技术，我们实现了一个原型系统，并在Web数据上进行了计算机实验。另一个重要的文本技术是有效的模式匹配。作为理论框架，我们提出了一个统一的框架，称为Collage系统，用于实现各种基于字典的压缩方法。我们使用这个框架开发了Knuth-Morris-Pratt型和Byer-Moore型模式匹配算法。我们还将该框架应用于Byte-Pair-Encoding压缩方法和Sequitur，前者产生最快的压缩模式匹配算法。文本挖掘的最后一个过程是信息提取。从理论的角度出发，首先形式化了半结构化数据的信息提取问题，然后从理论上分析了这类任务的能力和局限性。然后，我们针对不同类型的提取规则（包括树包装和树篱模式）开发了高效的信息提取算法，并通过互联网上真实的半结构化数据实验对其进行了评估。