期刊文献+

Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法

Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
下载PDF
导出
摘要 语音情绪识别指使用机器从说话人的语音中识别说话人的情绪。语音情绪识别是人机交互的重要环节,但是目前的研究中仍然存在很多问题,例如,缺乏高质量的数据、模型准确性不足、在嘈杂的环境下进行的研究很少等。文中提出了一种基于多头注意力机制的Head Fusion方法,提高了语音情绪识别在相应数据集上的准确性。文中还实现了一个基于注意力的卷积神经网络模型,并在IEMOCAP数据集上进行了实验。语音情绪识别在该数据集上的准确度提高到76.18%(Weighted Accuracy,WA)和76.36%(Unweighted Accuracy,UA)。根据调研,该结果与该数据集上的最新结果(76.4%的WA和70.1%的UA)相比,在保持WA的同时提高了约6%的UA。此外,还使用了混入50种常见噪声的语音数据进行了实验,通过改变噪声强度、对噪声进行时域平移、混合不同的噪声类型,以识别它们对语音情绪识别(Speech Emotion Recognition)准确度的不同影响并验证模型的鲁棒性。文中还将帮助研究人员和工程师通过使用带有适当类型噪声的语音数据来增加其训练数据,从而缓解语音情绪识别研究中高质量数据不足的问题。 Speech emotion recognition(SER)refers to the use of machines to recognize the emotions of a speaker from speech.SER is an important part of human-computer interaction(HCI).But there are still many problems in SER research,e.g.,the lack of high-quality data,insufficient model accuracy,little research under noisy environments.In this paper,we propose a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER.We implemente an attention-based convolutional neural network(ACNN)model and conduct experiments on the interactive emotional dyadic motion capture(IEMOCAP)data set.The accuracy is improved to 76.18%(weighted accuracy,WA)and 76.36%(unweighted accuracy,UA).To the best of our knowledge,compared with the state-of-the-art result on this dataset(76.4%of WA and 70.1%of WA),we achieve a UA improvement of about 6%absolute while achieving a similar WA.Furthermore,We conduct empirical experiments by injecting speech data with 50 types of common noises.We inject the noises by altering the noise intensity,time-shifting the noises,and mixing different noise types,to identify their varied impacts on the SER accuracy and verify the robustness of our model.This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.
作者 徐鸣珂 张帆 XU Ming-ke;ZHANG Fan(School of Computer Science and Technology,Nanjing Tech University,Nanjing 211816,China;IBM Watson Group,Littleton,Massachusetts 01460,USA)
出处 《计算机科学》 CSCD 北大核心 2022年第7期132-141,共10页 Computer Science
关键词 语音情绪识别 注意力机制 卷积神经网络 噪声语音 语音识别 Speech emotion recognition Attention mechanism Convolutional neural network Noisy speech Speech recognition
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部