权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

NSF-BSF: Collaborative Research: RI: Small: Multilingual Language Generation via Understanding of Code Switching

NSF-BSF：协作研究：RI：小型：通过理解代码切换生成多语言

基本信息

批准号：
2203097
负责人：
Yulia Tsvetkov
金额：
$ 34.56万
依托单位：
University of Washington
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2021
资助国家：
美国
起止时间：
2021-10-01 至 2024-12-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=2203097&HistoricalAwards=false
关键词：
NSF BSF Collaborative Research RI

项目摘要

Human language technology has recently matured to the extent that computational systems can generally interact with users in ways that are natural to humans, not just to machines. However, most people in the world today are multilingual, and current approaches to language technology do not reflect the reality that multilingual communication is ubiquitous; that is, current technology can interact naturally with monolingual speakers, but not with multilingual ones. Computational systems should be able to generate language that sounds equally natural to these users, and this includes being able to accommodate nonnative speakers. This project first creates a large-scale, broad coverage dataset, reflecting conversations between humans and an automatic system that is sophisticated enough to generate fluent multilingual (i.e. 'code-switched') utterances, but is simple enough for controlled experiments. The dataset is far larger than ones that are currently available, and is based on a much more detailed understanding of language-switching strategies. Second, this dataset is used to develop new methods to incorporate code-switching into contemporary deep-learning language generation, including dialogue systems, question answering, assistive technologies, summarization and machine translation. This innovation should benefit a dramatic number of multilingual computer users, including less privileged users who are currently required to interact with machines in a language they do not speak fluently. Successful completion of the research program will pave the way for the development of natural language technologies that are more accommodating to such users, building bridges over the digital divide. The overarching goal of this project is to develop multilingual and contextualized language generation technologies that are more controllable and more adaptable to multilingual users. The project achieves this goal by completing the following objectives. (1) It develops psycholinguistically-grounded, scalable approaches to collecting corpora for studying how multilingual speakers adapt to each other's linguistic choices in text conversations. These methodologies are employed to collect large-scale, rich datasets of multilingual human-machine conversations. These datasets, as well as additional corpora of human code-switched interactions, should shed new light on the theoretical understanding of cross-lingual usage patterns, allowing for better understanding of how people employ code-switching in written language. (2) It uses the linguistic insights obtained through this endeavor to define classifiers that predict code-switching. (3) Novel approaches are developed for efficient, large-vocabulary neural language generation that incorporate these classifiers, allowing generation systems to introduce code-switching in a way that sounds natural to multilingual users. Consequently, this project should dramatically advance our understanding of code-switching, especially in the relatively unexplored territory of written dialogue. In addition, its contributions benefit a broad range of applications that rely on language generation, including dialogue systems, question answering, assistive technologies, summarization and machine translation.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

人类语言技术最近已经成熟到计算系统通常可以以人类而不仅仅是机器自然的方式与用户交互的程度。然而，当今世界上大多数人都是多语言的，而目前的语言技术方法并没有反映出多语言交流无处不在的现实;也就是说，目前的技术可以与单语者自然互动，但不能与多语言者互动。计算系统应该能够生成对这些用户来说听起来同样自然的语言，这包括能够适应非母语人士。该项目首先创建了一个大规模的，广泛覆盖的数据集，反映了人类和自动系统之间的对话，该系统足够复杂，可以生成流利的多语言（即“代码转换”）话语，但对于受控实验来说足够简单。该数据集比目前可用的数据集大得多，并且基于对语言转换策略的更详细的理解。其次，该数据集用于开发新方法，将代码转换纳入当代深度学习语言生成，包括对话系统，问答，辅助技术，摘要和机器翻译。这一创新将使大量多语言计算机用户受益，包括目前需要用他们不流利的语言与机器交互的特权较低的用户。研究计划的成功完成将为开发更适合这些用户的自然语言技术铺平道路，在数字鸿沟上架起桥梁。该项目的总体目标是开发多语言和上下文语言生成技术，这些技术更易于控制，更适合多语言用户。该项目通过完成以下目标来实现这一目标。(1)它开发了以心理语言学为基础的，可扩展的方法来收集语料库，以研究多语言使用者如何在文本对话中适应彼此的语言选择。这些方法被用来收集大规模的，丰富的多语言人机对话数据集。这些数据集，以及人类代码转换交互的其他语料库，应该为跨语言使用模式的理论理解提供新的启发，从而更好地理解人们如何在书面语言中使用代码转换。(2)它使用通过这种奋进获得的语言学见解来定义预测语码转换的分类器。(3)开发了新的方法，用于高效的，大词汇量的神经语言生成，其中包含这些分类器，允许生成系统以一种对多语言用户来说听起来很自然的方式引入代码切换。因此，这个项目应该大大提高我们对语码转换的理解，特别是在相对未开发的书面对话领域。此外，它的贡献还使依赖语言生成的广泛应用受益，包括对话系统、问答、辅助技术、摘要和机器翻译。该奖项反映了NSF的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（14）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey

DOI：
10.48550/arxiv.2210.07700
发表时间：
2022-10
期刊：
影响因子：
0
作者：
Sachin Kumar;Vidhisha Balachandran;Lucille Njoo;Antonios Anastasopoulos;Yulia Tsvetkov
通讯作者：
Sachin Kumar;Vidhisha Balachandran;Lucille Njoo;Antonios Anastasopoulos;Yulia Tsvetkov

LEXPLAIN: Improving Model Explanations via Lexicon Supervision

DOI：
10.18653/v1/2023.starsem-1.19
发表时间：
2023
期刊：
影响因子：
0
作者：
Orevaoghene Ahia;Hila Gonen;Vidhisha Balachandran;Yulia Tsvetkov;Noah A. Smith
通讯作者：
Orevaoghene Ahia;Hila Gonen;Vidhisha Balachandran;Yulia Tsvetkov;Noah A. Smith

Machine Translation into Low-resource Language Varieties