Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法

Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition

下载PDF

导出

摘要语音情绪识别指使用机器从说话人的语音中识别说话人的情绪。语音情绪识别是人机交互的重要环节,但是目前的研究中仍然存在很多问题,例如,缺乏高质量的数据、模型准确性不足、在嘈杂的环境下进行的研究很少等。文中提出了一种基于多头注意力机制的Head Fusion方法,提高了语音情绪识别在相应数据集上的准确性。文中还实现了一个基于注意力的卷积神经网络模型,并在IEMOCAP数据集上进行了实验。语音情绪识别在该数据集上的准确度提高到76.18%(Weighted Accuracy,WA)和76.36%(Unweighted Accuracy,UA)。根据调研,该结果与该数据集上的最新结果(76.4%的WA和70.1%的UA)相比,在保持WA的同时提高了约6%的UA。此外,还使用了混入50种常见噪声的语音数据进行了实验,通过改变噪声强度、对噪声进行时域平移、混合不同的噪声类型,以识别它们对语音情绪识别(Speech Emotion Recognition)准确度的不同影响并验证模型的鲁棒性。文中还将帮助研究人员和工程师通过使用带有适当类型噪声的语音数据来增加其训练数据,从而缓解语音情绪识别研究中高质量数据不足的问题。 Speech emotion recognition(SER)refers to the use of machines to recognize the emotions of a speaker from speech.SER is an important part of human-computer interaction(HCI).But there are still many problems in SER research,e.g.,the lack of high-quality data,insufficient model accuracy,little research under noisy environments.In this paper,we propose a method called Head Fusion based on the multi-head attention mechanism to improve the accuracy of SER.We implemente an attention-based convolutional neural network(ACNN)model and conduct experiments on the interactive emotional dyadic motion capture(IEMOCAP)data set.The accuracy is improved to 76.18%(weighted accuracy,WA)and 76.36%(unweighted accuracy,UA).To the best of our knowledge,compared with the state-of-the-art result on this dataset(76.4%of WA and 70.1%of WA),we achieve a UA improvement of about 6%absolute while achieving a similar WA.Furthermore,We conduct empirical experiments by injecting speech data with 50 types of common noises.We inject the noises by altering the noise intensity,time-shifting the noises,and mixing different noise types,to identify their varied impacts on the SER accuracy and verify the robustness of our model.This work will also help researchers and engineers properly add their training data by using speech data with the appropriate types of noises to alleviate the problem of insufficient high-quality data.

作者徐鸣珂张帆 XU Ming-ke;ZHANG Fan(School of Computer Science and Technology,Nanjing Tech University,Nanjing 211816,China;IBM Watson Group,Littleton,Massachusetts 01460,USA)

机构地区南京工业大学计算机科学与技术学院国际商业机器麻省实验室

出处《计算机科学》 CSCD 北大核心 2022年第7期132-141,共10页 Computer Science

关键词语音情绪识别注意力机制卷积神经网络噪声语音语音识别 Speech emotion recognition Attention mechanism Convolutional neural network Noisy speech Speech recognition

分类号 TP391.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1LIN Long,TAN Liang.Multi-Distributed Speech Emotion Recognition Based on Mel Frequency Cepstogram and Parameter Transfer[J].Chinese Journal of Electronics,2022,31(1):155-167.
2Chunxia Yu,Ling Xie,Weiping Hu.Feature Optimization of Speech Emotion Recognition[J].Journal of Biomedical Science and Engineering,2016,9(10):37-43.
3陈红顺,陈文杰.基于生成对抗网络的遥感影像场景分类[J].微型电脑应用,2022,38(6):20-23.
4齐建芳.小学高年级情绪觉察主题心理课的问题与改进[J].中小学心理健康教育,2022(20):17-19. 被引量：1
5王磊,张珍珍,刘春玲.述情障碍对孤独症谱系障碍个体情绪加工的影响[J].中国特殊教育,2022(3):80-87. 被引量：6
6李亚雄,杨新智,王宇,张鑫伟,庄军臣.基于UKF的“低慢小”目标航迹跟踪算法[J].兵器装备工程学报,2022,43(S01):316-320.
7Disne SIVALINGAM.An Approach to Speech Emotion Classification Using k-NN and SVMs[J].Instrumentation,2021,8(3):36-45.
8孙岱,周玉侠,孙茂,牛婷婷,闫浩田,俞冬熠.基于近红外光谱技术测定新生儿足底血苯丙氨酸含量的新方法[J].华南国防医学杂志,2022,36(5):351-354. 被引量：1
9卞佳豪,杨广宇.人工智能辅助的蛋白质工程[J].合成生物学,2022,3(3):429-444. 被引量：5

计算机科学

2022年第7期

浏览历史

内容加载中请稍等...

Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法

相关作者

相关机构

相关主题

浏览历史