权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Improvement of topic-based language models using Dirichlet mixtures and their applications

基于狄利克雷混合的基于主题的语言模型的改进及其应用

基本信息

批准号：
17500105
负责人：
YAMAMOTO Mikio
金额：
$ 2.37万
依托单位：
University of Tsukuba
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research (C)
财政年份：
2005
资助国家：
日本
起止时间：
2005 至 2006
项目状态：
已结题

项目摘要

For improving statistical language models, we enhanced predictive power of ngram models, which are typical language models, using topic or context information. We proposed new estimation methods for Dirichlet mixtures and evaluated the model on applications ; speech recognition and statistical machine translation.1. We developed a robust estimation method for Dirichlet mixtures language models using hierarchical Bayesian models. In order to approximate integration appeared in Bayesian inference, we used the reversing-EM and variational approximation. In the experiments using various text data, we showed the estimation method achieves the lowest perplexity level.2. Our model was integrated in speech recognition systems, and evaluated by recognition rate. Two integration methods were developed ; (1) modification of probability of trigram models using the unigram rescaling, (2) optimization on document level using document likelihood computed by our model. Comparing Latent Dirichlet Allocation (LDA) with our model, we showed the speech recognition rate of the system with our model is higher than that of LDA.3. We proposed cross-language Dirichlet mixture models which were integrated in phrase-based statistical machine translation systems. Using this model, the system can select contextually or topically correct Japanese words from candidates as translation of English input document. Experiments using newspaper articles translation showed that topic models were effective for lower perplexity.

为了改进统计语言模型，我们使用主题或上下文信息增强了典型语言模型ngram模型的预测能力。我们提出了新的估计方法的Dirichlet混合和评估模型的应用;语音识别和统计机器识别。本文提出了一种基于分层贝叶斯模型的Dirichlet混合语言模型的鲁棒估计方法。为了解决贝叶斯推理中出现的积分近似问题，我们采用了反向EM和变分近似。在使用各种文本数据的实验中，我们表明该估计方法达到了最低的困惑水平.我们的模型集成在语音识别系统中，并通过识别率进行评估。提出了两种集成方法：（1）使用unigram重新缩放修改三元组模型的概率，（2）使用我们的模型计算的文档似然在文档级上进行优化。通过对潜在狄利克雷分配算法（LDA）和本文模型的比较，我们发现本文模型的语音识别率高于LDA.我们提出了跨语言的Dirichlet混合模型，集成在基于短语的统计机器翻译系统。使用该模型，系统可以从候选词中选择上下文或主题正确的日语单词作为英文输入文档的翻译。以报刊文章为例进行的实验表明，主题模型对降低困惑是有效的。