权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Emotion detection from voice recordings

从录音中检测情绪

基本信息

批准号：
RGPIN-2016-06628
负责人：
Cardinal, Patrick
金额：
$ 1.6万
依托单位：
École de technologie supérieure
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=754228
关键词：
Emotion detection voice recordings

项目摘要

The speech signal carries several levels of information. From a voice recording, we can extract pronounced words, the speaker's identity, the spoken language or even the speaker's emotional state.In the last five years, there has been a great increase of interest in the field of emotion recognition based on different human modalities, such as speech, heart rate, etc. Building a robust emotion detection system can be very useful in several areas such as medicine and telecommunications. In the medical field, detecting emotions in general can be considered an important tool for diagnosing and following patients suffering from depression. The identification of the emotional state of a person from his voice opens new perspectives for the development of an automated dialogue system, one capable of communicating with patients at home daily and even several times a day to produce a report for the physician.Although much research has been done, actual emotion recognition systems performances are still not adequate for these real life applications. The majority of emotion detection systems have been designed by focussing on the speech modeling rather than the feature extraction aspect. Usually, feature vectors are made of a combination of classical cepstral features (such as MFCC) augmented with some prosodic characteristics (such as speech intonation). The modest results presented in the literature make us think that we should focus on improving the quality of features used for training emotion identification systems instead of focussing on the modeling aspect. For this reason, our research will focus on extracting more accurate features from the speech signal.Three main axes will be established in order to reach this objective. The first axe consists of working on the development of a DNN architecture capable of learning the classification function directly from the raw speech signal. The second axe consists of developing another DNN architecture capable of learning a normalization function in order to be able to mix different databases made of speech signals from different recording conditions. Finally, in the third axe, we will concentrate on high-level features such as phoneme durations and other evidence extracted from words said by the speaker.Improvements made during this research will be very practical for building reliable healthcare applications. Indeed, such applications would be useful for health professionals by helping them improve the quality of treatments in Canada. Moreover, our findings could be applied to several other fields of speech technologies. The approach presented would facilitate the development of command and control applications without the use of a complex and time/power consuming speech recognition engine. This would help start-ups or small businesses in Canada to add voice-based control to their applications without needing to hire a speech recognition specialist.

语音信号带有几个层次的信息。从录音中，我们可以提取出说话人的发音、说话人的身份、说话的语言甚至是说话人的情绪状态。在过去的五年中，人们对基于不同人类模式（如语音、心率等）的情感识别领域的兴趣大大增加。建立一个强大的情绪检测系统在医学和电信等领域非常有用。在医学领域，检测情绪通常被认为是诊断和跟踪抑郁症患者的重要工具。从一个人的声音中识别他的情绪状态，为自动对话系统的发展开辟了新的视角，这种系统能够每天在家里与病人交流，甚至一天几次，为医生提供一份报告。尽管已经进行了大量的研究，但实际的情感识别系统的性能仍然不足以满足这些现实生活中的应用。大多数情感检测系统的设计都侧重于语音建模，而不是特征提取方面。通常，特征向量是由经典的倒谱特征（如MFCC）与一些韵律特征（如语音语调）的结合构成的。文献中提出的适度结果使我们认为我们应该专注于提高用于训练情感识别系统的特征的质量，而不是专注于建模方面。因此，我们的研究将侧重于从语音信号中提取更准确的特征。为实现这一目标，将设立三个主要轴心。第一步是开发能够直接从原始语音信号中学习分类功能的深度神经网络架构。第二步是开发另一种DNN架构，能够学习归一化函数，以便能够混合由不同记录条件下的语音信号组成的不同数据库。最后，在第三斧中，我们将专注于高级特征，如音素持续时间和从说话者所说的单词中提取的其他证据。在此研究期间所做的改进对于构建可靠的医疗保健应用程序非常实用。事实上，这种应用将有助于保健专业人员提高加拿大的治疗质量。此外，我们的发现可以应用于语音技术的其他几个领域。所提出的方法将促进指挥和控制应用程序的开发，而无需使用复杂且耗时/功耗的语音识别引擎。这将有助于加拿大的初创企业或小型企业在其应用程序中添加基于语音的控制，而无需聘请语音识别专家。