权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Overcoming Data Sparsity in Machine Translation

克服机器翻译中的数据稀疏性

基本信息

批准号：
RGPIN-2017-05875
负责人：
Kondrak, Grzegorz
金额：
$ 1.68万
依托单位：
University of Alberta
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2018
资助国家：
加拿大
起止时间：
2018-01-01 至 2019-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=655843
关键词：
Overcoming Data Sparsity Machine Translation

项目摘要

Canada is a multicultural society. A large percentage of Canadian residents report a mother tongue that is distinct from either English or French. In addition, Canada is home to a rich variety of indigenous languages, some of which have also been granted official status. Everyone has the right to get all official federal government services, publications and documents in both English and French. Important information for new Canadians is often provided in multiple languages and scripts. Increasing the availability of texts in aboriginal languages increases their prestige, and thus helps preserve them.******As a consequence, there exists an acute need for accurate and rapid translations, not only between English and French, but also into other languages. Human translation is slow and expensive, and requires highly-skilled experts. Computer translation programs, known as machine translation, have the potential to fill the gap. Unfortunately, the current technology is far from perfect. The quality of translations involving smaller languages is often poor, and even between major languages, it is sometimes inadequate for technical applications.******Two of the reasons for the low quality of machine translation are the scarcity of bilingual texts for low-resourced languages, and the prevalence of infrequent words, such as certain verb inflections in French. The dominant statistical machine translation approach, which is used in web programs such as Google Translate, struggles to properly translate words that occur only rarely in bilingual texts.******The objective of this proposal is to improve the quality of machine translation by improving the handling of infrequent words. The principal research directions are the incorporation of the state-of-the-art morphological techniques into the translation process, the development of lexicon induction methods, and the translation of out-of-vocabulary words based on the cutting-edge algorithms for cognate identification, name transliteration, and decipherment.******In the current global economy, the enormous demand for fast and freely-available translations can only be satisfied by the machine translation programs. The solutions that I outline in my proposal will not only improve the quality of machine translation, but also influence the research on other aspects of natural language processing, thus accelerating the progress towards the goal of making computers understand human language.

加拿大是一个多元文化的社会。很大比例的加拿大居民报告说，他们的母语与英语或法语截然不同。此外，加拿大拥有丰富多样的土著语言，其中一些也被授予官方地位。每个人都有权获得所有官方联邦政府服务、出版物和英文和法文文件。为新加拿大人提供的重要信息通常是以多种语言和文字提供的。原住民语言文本的可获得性增加了它们的声望，从而有助于保护它们。因此，迫切需要准确和快速的翻译，不仅是英语和法语之间的翻译，而且是其他语言的翻译。人工翻译既慢又贵，需要高技能的专家。被称为机器翻译的计算机翻译程序有可能填补这一空白。不幸的是，目前的技术远远不是完美的。涉及较小语种的翻译质量往往很差，即使在主要语种之间，有时也不足以满足技术应用的需要。*机器翻译质量低的两个原因是，资源匮乏的语种缺乏双语文本，以及出现不常用的单词，如法语中的某些动词词形变化。在谷歌翻译等网络程序中使用的主要统计机器翻译方法，难以正确翻译双语文本中很少出现的单词。*这项建议的目标是通过改进对不常用单词的处理来提高机器翻译的质量。主要的研究方向是将最新的词法技术融入到翻译过程中，开发词典归纳方法，以及基于同源识别、姓名音译和解码的尖端算法的词汇外单词的翻译。在当前的全球经济中，对快速和可自由使用的翻译的巨大需求只能由机器翻译程序来满足。我在提案中概述的解决方案不仅将提高机器翻译的质量，还将影响自然语言处理的其他方面的研究，从而加快实现让计算机理解人类语言的目标的进展。