权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

More efficient and accurate automatic speech recognition

自动语音识别更高效、准确

基本信息

批准号：
RGPIN-2018-05226
负责人：
OShaughnessy, Douglas
金额：
$ 2.04万
依托单位：
Institut national de la recherche scientifique
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2020
资助国家：
加拿大
起止时间：
2020-01-01 至 2021-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=713553
关键词：
More efficient accurate automatic speech

项目摘要

Current Automatic Speech Recognition (ASR) uses stochastic methods that exclude much of what is known about human speech production and perception. For 30 years, ASR has used Hidden Markov Models (HMMs) and now Deep Neural Networks (DNNs). Both are engineering approaches emphasizing recognition accuracy, but tolerating ever increasing cost (computer memory and processing). NNs have existed for decades, but applications were mostly limited to 3-level multilayer perceptrons, with limited capacity to handle the wide range of variability in speech (sources, channels, speakers, contexts, environments). Despite much increased use of ASR (e.g, Siri, Alexa), performance is still not near human levels, especially for noisy conditions (e.g., many cases where prior model training is limited). Continuing recent DNN ASR research is unlikely to approach acceptable accuracy in many cases unless major changes are made to the methodology. Early ASR methodology in the 1970s used mostly expert-system (ES) approaches, exploiting ideas of how vocal-tract resonances (called formants) related to the phonemes intended by speakers, and focused on the spectral peaks of speech, as this is how the ear filters speech inside the cochlea. In the early 1980s, HMMs took over the ASR field as they were much better at handling variability than simple “if-then” algorithms. Nonetheless, if one could track significant aspects of resonances reliably in poor acoustic conditions (that human listeners handle well), then useful ASR decisions could be made far at lower cost than with recent end-to-end DNN approaches. It is here proposed to combine structural and stochastic information in ASR, exploiting well-known (but, in ASR, little used) knowledge of how humans do speech communication. Another major deficiency of ASR is its lack of use of intonation, despite all evidence that such facilitates human speech communication. Human intonation in speech production (which is clearly exploited by human listeners) has long time ranges, making such information very difficult to track in the current systems that rely on either raw speech (at 8000 samples/s or higher) or 10-ms frames of spectral data. The relative success of modern HMM and DNN approaches show that one can succeed (to a certain level of performance) without using intonation, as many practical speech inputs to ASR are simple and short phrases (and in good quality environments). Nonetheless, proper use of intonation in ASR would surely raise recognition accuracy, just as including language models (LM) into ASR in the 1980s did. We will improve robustness in ASR to common acoustic degradations, have greater efficiency, and exploit intonation. The long-term objective: accurate and efficient ASR, approaching that of human listeners. Short-term objectives: 1) a better spectral measure than filter-bank energies, 2) faster and better adaptation, 3) integrate aspects of intonation.

目前的自动语音识别（ASR）使用随机方法，排除了许多关于人类语音产生和感知的知识。 30年来，ASR一直在使用隐马尔可夫模型（Hidden Markov Models，简称HMF），现在又使用深度神经网络（Deep Neural Networks，简称DNN）。两者都是强调识别准确性的工程方法，但容忍不断增加的成本（计算机内存和处理）。神经网络已经存在了几十年，但应用大多限于3级多层感知器，处理语音（源，通道，扬声器，上下文，环境）的广泛变化的能力有限。尽管ASR的使用大大增加（例如Siri，Alexa），但性能仍然没有接近人类水平，特别是对于嘈杂的条件（例如，在许多情况下，先验模型训练是有限的）。继续最近的DNN ASR研究在许多情况下不太可能接近可接受的准确性，除非对方法进行重大更改。 20世纪70年代早期的ASR方法主要使用专家系统（ES）方法，利用声道共振（称为共振峰）如何与说话者意图的音素相关的想法，并专注于语音的频谱峰值，因为这是耳朵如何过滤耳蜗内的语音。在20世纪80年代初，HALGOT接管了ASR领域，因为它们在处理可变性方面比简单的“if-then”算法要好得多。尽管如此，如果人们可以在恶劣的声学条件下可靠地跟踪共振的重要方面（人类听众处理得很好），那么可以以比最近的端到端DNN方法更低的成本做出有用的ASR决策。这里提出在ASR中结合联合收割机结构和随机信息，利用人类如何进行语音通信的众所周知的（但在ASR中很少使用）知识。 ASR的另一个主要缺陷是它缺乏语调的使用，尽管所有证据都表明语调有助于人类的语音交流。语音产生中的人类语调（这显然是由人类听众利用的）具有长时间范围，使得这样的信息在依赖于原始语音（在8000样本/秒或更高）或10 ms帧的频谱数据的当前系统中非常难以跟踪。现代HMM和DNN方法的相对成功表明，人们可以在不使用语调的情况下取得成功（达到一定的性能水平），因为ASR的许多实际语音输入都是简单而简短的短语（并且在高质量的环境中）。尽管如此，在ASR中正确使用语调肯定会提高识别准确率，就像20世纪80年代将语言模型（LM）纳入ASR一样。我们将提高ASR对常见声学退化的鲁棒性，提高效率，并利用语调。长期目标：准确高效的ASR，接近人类听众。短期目标：1）比滤波器组能量更好的频谱测量，2）更快更好的适应，3）整合语调方面。