权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

RI: Small: Collaborative Research: Automatic Creation of New Speech Sound Inventories

RI：小型：协作研究：自动创建新语音库存

基本信息

批准号：
1910319
负责人：
Mark Hasegawa-Johnson
金额：
$ 25.98万
依托单位：
University of Illinois at Urbana-Champaign
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2019
资助国家：
美国
起止时间：
2019-07-01 至 2023-06-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1910319&HistoricalAwards=false
关键词：
RI Small Collaborative Research Automatic

项目摘要

Speech technology is supposed to be available for everyone, but in reality, it is not. There are 7000 languages spoken in the world, but speech technology (speech-to-text recognition and text-to-speech synthesis) only works in a few hundred of them. This project will solve that problem, by automatically figuring out the set of phonemes for each new language, that is, the set of speech sounds that define differences between words (for example, "peek" versus "peck:" long-E and short-E are distinct phonemes in English). Phonemes are the link between speaking and writing. A neural net that converts speech into text using some kind of phoneme inventory, and then back again, can be said to have used the correct phoneme inventory if its resynthesized speech always has the same meaning as the speech it started with. This approach can even be tested in languages that don't have any standard written form, because the text doesn't have to be real text: it could be chat alphabet (the kind of pseudo-Roman-alphabet that speakers of Arabic and Hindi sometimes use on twitter), or it could even be a picture (showing, in an image, what the user was describing). This research will make it possible for people to talk to their artificial intelligence systems (smart speakers, smart phones, smart cars, etc.) using their native languages. This research will advance science by providing big-data tools that scientists can use to study languages that do not have a (standard) writing system.End-to-end neural network methods can be used to develop speech-to-text-to-speech (S2T2S) and other spoken language processing applications with little additional software infrastructure, and little background knowledge. In fact, toolkits provide recipes so that a researcher with no prior speech experience can train an end-to-end neural system after only a few hours of data preparation. End-to-end systems are only practical, however, for languages with thousands of hours of transcribed data. For under-resourced languages (languages with very little transcribed speech) cross-language adaptation is necessary; for unwritten languages (those lacking any standard and well-known orthographic convention), it is necessary to define a spoken language task that doesn't require writing before one can even attempt cross-language adaptation. Preliminary evidence suggests that both types of cross-language adaptation are performed more accurately if the system has available, or creates, a phoneme inventory for the under-resourced language, and leverages the phoneme inventory to facilitate adaptation. The aim of this project is to automatically infer the acoustic phoneme inventory for under-resourced and unwritten languages in order to maximize the speech technology quality of an end-to-end neural system adapted into that language. The research team has demonstrated that it is possible to visualize sub-categorical distinctions between sounds as a neural net adapts to a new phoneme category; proposed experiments 1 and 2 leverage visualizations of this type, along with other methods of phoneme inventory validation, to improve cross-language adaptation. Experiments 3 and 4 go one step further, by adapting to languages without orthography; for a speech technology system to be trained and used in a language without orthography, it must first learn a useful phoneme inventory. Innovations in this project that occur nowhere else include: (1) the use of articulatory feature transcription as a multi-task training criterion for an end-to-end neural system that seeks to learn the phoneme set of a new language, (2) the use of visualization error rate as a training criterion in multi-task learning -- this training criterion is based on a method recently developed to visualize the adaptation of phoneme categories in a neural network, (3) the application of cross-language adaptation to improve the error rates of image2speech applications in a language without orthography, (4) the use of non-standard orthography (chat alphabet) to transcribe speech in an unwritten language, and (5) the use of non-native transcription (mismatched crowdsourcing) to jump-start the speech2chat training task. The methods proposed here will facilitate the scientific study of language, for example, by helping phoneticians to document the phoneme inventories of undocumented languages, thereby expediting the study of currently undocumented endangered languages before they disappear. Conversely, in minority languages with active but shrinking native speaker populations, planned methods will help develop end-to-end neural training methods with which the native speakers can easily develop new speech applications. All planned software will be packaged as recipes for the speech recognition virtual kitchen, permitting high school students and undergraduates with no speech expertise to develop systems for their own languages, and encouraging their interest in speech.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

语音技术本应适用于所有人，但实际上并非如此。世界上有7000种语言，但语音技术（语音到文本识别和文本到语音合成）只适用于其中的几百种。这个项目将解决这个问题，通过自动计算出每种新语言的音素集，即定义单词之间差异的语音集（例如，“peek”与“peck“：长E和短E是英语中不同的音素）。音素是连接说话和写作的纽带。一个神经网络使用某种音素库将语音转换成文本，然后再转换回来，如果它重新合成的语音总是与它开始时的语音具有相同的含义，那么可以说它使用了正确的音素库。这种方法甚至可以在没有任何标准书面形式的语言中进行测试，因为文本不一定是真实的文本：它可以是聊天字母表（阿拉伯语和印地语的人有时在twitter上使用的那种伪罗马字母表），甚至可以是图片（在图像中显示用户正在描述的内容）。这项研究将使人们与他们的人工智能系统（智能扬声器、智能手机、智能汽车等）对话成为可能使用他们的母语。这项研究将通过提供大数据工具来推进科学，科学家可以使用这些工具来研究没有（标准）书写系统的语言。端到端神经网络方法可以用于开发语音到文本到语音（S2 T2 S）和其他口语处理应用程序，而无需额外的软件基础设施和背景知识。事实上，工具包提供了食谱，使没有语音经验的研究人员可以在几个小时的数据准备后训练端到端的神经系统。然而，端到端系统仅适用于具有数千小时转录数据的语言。对于资源不足的语言（几乎没有转录语音的语言），跨语言适应是必要的;对于非书面语言（缺乏任何标准和众所周知的正字法惯例），有必要定义一个口语任务，在尝试跨语言适应之前不需要写作。初步证据表明，这两种类型的跨语言适应更准确地执行，如果系统有可用的，或创建，一个音素库存的资源不足的语言，并利用音素库存，以促进适应。该项目的目的是自动推断资源不足和非书面语言的声学音素库存，以最大限度地提高适应该语言的端到端神经系统的语音技术质量。研究小组已经证明，当神经网络适应新的音素类别时，可以将声音之间的子类别区分可视化;拟议的实验1和2利用这种类型的可视化，沿着其他音素库存验证方法，以改善跨语言适应。实验3和4更进一步，通过适应没有正字法的语言;对于要在没有正字法的语言中训练和使用的语音技术系统，它必须首先学习有用的音素库存。该项目的创新在其他任何地方都没有出现，包括：（1）使用发音特征转录作为寻求学习新语言的音素集的端到端神经系统的多任务训练标准，（2）在多任务学习中使用可视化错误率作为训练标准--该训练标准基于最近开发的用于可视化神经网络中音素类别的适应的方法，（3）应用跨语言自适应来改善图像到语音应用在没有正字法的语言中的错误率，（4）使用非标准正字法（聊天字母表）以非书面语言转录语音，以及（5）使用非本地转录（不匹配的众包）来启动speech 2chat训练任务。这里提出的方法将促进语言的科学研究，例如，通过帮助语音学家记录未被记录的语言的音素库存，从而加快对目前未被记录的濒危语言的研究。相反，在少数民族语言中，母语人口活跃但不断减少，有计划的方法将有助于开发端到端的神经训练方法，母语者可以轻松开发新的语音应用程序。所有计划中的软件都将被打包成语音识别虚拟厨房的配方，允许没有语音专业知识的高中生和大学生开发自己语言的系统，并鼓励他们对语音的兴趣。该奖项反映了NSF的法定使命，并通过使用基金会的智力价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（14）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions

用于从语音字幕和图像区域中发现类词单元的 DNN-HMM-DNN 混合模型

DOI：
10.21437/interspeech.2020-1148
发表时间：
2020
期刊：
Interspeech
影响因子：
0
作者：
Wang, Liming;Hasegawa-Johnson, Mark
通讯作者：
Hasegawa-Johnson, Mark

That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages

听起来很熟悉：跨语言语音表示迁移的分析

DOI：
10.21437/interspeech.2020-2513
发表时间：
2020
期刊：
Interspeech
影响因子：
0
作者：
Żelasko, Piotr;Moro-Velázquez, Laureano;Hasegawa-Johnson, Mark;Scharenborg, Odette;Dehak, Najim
通讯作者：
Dehak, Najim

Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

DOI：
10.1109/icassp39728.2021.9414418
发表时间：
2021-06
期刊：
ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
影响因子：
0
作者：
Liming Wang;Xinsheng Wang;M. Hasegawa-Johnson;O. Scharenborg;N. Dehak
通讯作者：
Liming Wang;Xinsheng Wang;M. Hasegawa-Johnson;O. Scharenborg;N. Dehak

Cross-lingual articulatory feature information transfer for speech recognition using recurrent progressive neural networks

使用递归渐进神经网络进行语音识别的跨语言发音特征信息传输

DOI：
10.21437/interspeech.2022-11202
发表时间：
2022
期刊：
Proceedings of Interspeech 2022
影响因子：
0
作者：
Morshed, Mahir;Hasegawa-Johnson, Mark
通讯作者：
Hasegawa-Johnson, Mark

Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

DOI：
10.1109/icassp40776.2020.9054664
发表时间：
2020-05
期刊：
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
影响因子：
0
作者：
Leda Sari;Samuel Thomas;M. Hasegawa-Johnson
通讯作者：
Leda Sari;Samuel Thomas;M. Hasegawa-Johnson