权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

A study on optimazation of units for statistical language models

统计语言模型单位优化研究

基本信息

批准号：
14580403
负责人：
YAMAMOTO Mikio
金额：
$ 2.56万
依托单位：
University of Tsukuba
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (C)
财政年份：
2002
资助国家：
日本
起止时间：
2002 至 2004
项目状态：
已结题

项目摘要

In this project, we investigated and reconsidered two kinds of ‘units' as a basic property of statistical language models. The first unit we reconsidered is ‘tokens' or ‘entries of a dictionary' which are minimal units of sentences. Ordinary statistical language models use words or characters as tokens. But for some applications such as machine translations, we know uses of longer tokens such as phrases improve the system performance. We focused on automatic phrase extractions to build up dictionaries for machine translations with a statistical criterion. We proposed new criteria, minimal mutual information, and showed the method is better than previous phrase extraction methods.Another kind of unit we reconsidered is ‘targets' which are assessed by the models. Ordinary statistical language models evaluate ‘sentences' as targets of applications. But many language applications have to output text which is made up with multiple sentences. We proposed a model to evaluate whole text using Dirichlet mixtures as a distribution for parameters of a multinomial distribution, whose compound distribution is Polya mixtures. We showed lower perplexity of our model than that of the other text models such as the latent Dirichlet allocation(LDA). Experiments of speech recognizer for read documents showed the models effectively correct many misrecognition words using information of whole text.

在这个项目中，我们研究和重新考虑了两种“单位”作为统计语言模型的基本属性。我们重新考虑的第一个单位是“单词”或“字典条目”，它们是句子的最小单位。普通的统计语言模型使用单词或字符作为标记。但是对于一些应用程序，如机器翻译，我们知道使用更长的标记，如短语，可以提高系统性能。我们专注于自动短语提取，以建立具有统计标准的机器翻译词典。我们提出了新的准则，最小互信息，并表明该方法是优于以往的短语提取方法。另一类我们重新考虑的单位是“目标”，这是由模型评估。普通的统计语言模型评估“句子”作为应用程序的目标。但是许多语言应用程序必须输出由多个句子组成的文本。我们提出了一个模型来评估整个文本使用Dirichlet混合物作为一个多项分布，其复合分布是Polya混合物的参数的分布。我们发现，我们的模型比其他文本模型，如潜在的狄利克雷分配（LDA）的困惑。对阅读文本的语音识别实验表明，该模型能有效地利用全文信息纠正大量误识词。