权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Study of Class-based Language Model and its Application to Japanese Morphological Analysis

基于类的语言模型研究及其在日语词法分析中的应用

基本信息

批准号：
10680383
负责人：
KITA Kenji
金额：
$ 1.54万
依托单位：
The University of Tokushima
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (C)
财政年份：
1998
资助国家：
日本
起止时间：
1998 至 1999
项目状态：
已结题

来源：
https://kaken.nii.ac.jp/grant/KAKENHI-PROJECT-10680383/
关键词：
natural language processing Japanese language processing morphological analysis word segmentation probabilistic language model PPM* model character class clustering PPMモデル

项目摘要

Morphological analysis is the most fundamental process of Japanese language processing. In Japanese morphological analysis, word segmentation is an important problem because word boundaries are not marked in its writing system.In this research project, we first studied a word segmentation model using a character-based n-gram model, which is our baseline method. Next, we applied the PPM* compression algorithm to the problem of word segmentation. PPM (Prediction by Partial Matching) is a lossless compression algorithm based on a finite-context probabilistic modeling technique and PPM* is a variant of PPM, in which there is no a priori bound on context length.We then studied a method for word segmentation based on a character class model. The character class model is more robust than a character-based model because the number of parameters of the character class model is fewer than that of a character-based model. The measurement for Japanese character clustering is the entropy on a corpus different from the corpus for model estimation and the search method is based on the greedy algorithm. For this reason, this clustering method gives us an optimum character classification without giving the number of classes. As the result of experiments on the ADD (ATR Dialogue Database) corpus, the proposed Japanese word segmenter using the character class model marked a higher accuracy than a character-based model. In particular, the proposed method using a variable-length n;-gram class model achieved 96.38% recall and 96.23% precision for open text.

词法分析是日语语言处理中最基本的过程。在日语的形态分析中，由于单词的边界在书写系统中没有标记，因此分词是一个重要的问题。在本研究中，我们首先研究了基于字符的n-gram模型的分词模型，这是我们的基线方法。接下来，我们将PPM* 压缩算法应用于分词问题。PPM（Prediction by Partial Matching）是一种基于有限上下文概率模型的无损压缩算法，PPM* 是PPM的一种变体，它对上下文长度没有先验限制。字符类模型比基于字符的模型更鲁棒，因为字符类模型的参数的数目比基于字符的模型的参数的数目少。日语字符聚类的度量是语料库上的熵，不同于用于模型估计的语料库，并且搜索方法基于贪婪算法。由于这个原因，这种聚类方法给我们一个最佳的字符分类，而不给类的数量。在ADD（ATR Dialogue Database）语料库上的实验结果表明，基于字符类模型的日语分词器比基于字符的模型具有更高的分词准确率。特别是，所提出的方法使用可变长度的n-gram类模型实现了96.38%的召回率和96.23%的精度为开放文本。