融合坐标与多头注意力机制的交互语音情感识别

Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition

下载PDF

导出

摘要语音情感识别(SER)是人机交互系统中一项重要且充满挑战性的任务。针对目前SER系统中存在特征单一和特征间交互性较弱的问题,提出多输入交互注意力网络MIAN。该网络由特定特征坐标残差注意力网络和共享特征多头注意力网络两个子网络组成。前者利用Res2Net和坐标注意力模块学习从原始语音中获取的特定特征,并生成多尺度特征表示,增强模型对情感相关信息的表征能力;后者融合前向网络所获取的特征,组成共享特征,并经双向长短时记忆(BiLSTM)网络输入至多头注意力模块,能同时关注不同特征子空间中的相关信息,增强特征之间的交互性,以捕获判别性强的特征。通过2个子网络间的协同作用,能增加模型特征的多样性,增强特征之间的交互能力。在训练过程中,应用双损失函数共同监督,使同类样本更紧凑、不同类样本更分离。实验结果表明,MIAN在EMO-DB和IEMOCAP语料库上分别取得了91.43%和76.33%的加权平均精度,相较于其他主流模型,具有更好的分类性能。 Speech Emotion Recognition(SER)is an important and challenging task in human-computer interaction systems.To address the issues of single-feature representation and weak feature interaction in current SER systems,a Multiinput Interactive Attention Network(MIAN)was proposed.The proposed network consists of two sub-networks,namely the specific feature coordinate residual attention network and the shared feature multi-head attention network.The former utilized Res2Net and coordinate attention modules to learn specific features extracted from raw speech and generate multiscale feature representations,enhancing the model’s ability to represent emotion-related information.The latter integrated the features obtained from the forward network to form shared features,which were then input into the multi-head attention module via Bidirectional Long Short-Term Memory(BiLSTM)network.This setup allowed for simultaneous attention to relevant information in different feature subspaces,enhancing the interaction among features and capturing highly discriminative features.The collaboration of the two sub-networks mentioned above increased the diversity of features and improve the interaction capability among features.During the training process,a dual-loss function was applied for joint supervision,aiming to make the samples of the same class more compact and the samples of different classes more separated.The experimental results demonstrate that the proposed model achieves a weighted average accuracy of 91.43%on EMO-DB corpus and 76.33%on IEMOCAP corpus.Compared to other state-of-the-art models,the proposed model exhibits superior classification performance.

作者高鹏淇黄鹤鸣樊永红 GAO Pengqi;HUANG Heming;FAN Yonghong(College of Computer,Qinghai Normal University,Xining Qinghai 810008,China;The State Key Laboratory of Tibetan Intelligent Information Processing and Application,Xining Qinghai 810008,China)

机构地区青海师范大学计算机学院藏语智能信息处理及应用国家重点实验室

出处《计算机应用》 CSCD 北大核心 2024年第8期2400-2406,共7页 journal of Computer Applications

基金国家自然科学基金资助项目(620660039) 青海省自然科学基金资助项目(2022-ZJ-925) 高等学校学科创新引智计划项目(D20035)。

关键词语音情感识别坐标注意力机制多头注意力机制特定特征学习共享特征学习 Speech Emotion Recognition(SER) coordinate attention mechanism multi-head attention mechanism specific feature learning shared feature learning

分类号 TP183 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献4

1刘建兴,蔡国永,吕光瑞,毕梦莹.基于深度双向长短时记忆网络的文本情感分类[J].桂林电子科技大学学报,2018,38(2):122-126. 被引量：5
2耿磊,傅洪亮,陶华伟,卢远,郭歆莹,赵力.基于动态卷积递归神经网络的语音情感识别[J].计算机工程,2023,49(4):125-130. 被引量：4
3戴妍妍,金赟,马勇,杨子秀,俞佳佳.基于高效通道注意力机制的语音情感识别方法[J].信号处理,2021,37(10):1835-1842. 被引量：7
4张雪英,孙颖,张卫,畅江.语音情感识别的关键技术[J].太原理工大学学报,2015,46(6):629-636 643. 被引量：18

二级参考文献31

1高维深.基于HMM/ANN混合模型的非特定人语音识别研究[D].电子科技大学2013
2尤鸣宇.语音情感识别的关键技术研究[D].浙江大学2007
3郅菲菲.字词认知N170成分发展的人工语言训练研究[D].浙江师范大学2013
4王魁.汉字视知觉左侧化N170-反映字形加工还是语音编码[D].西南大学2012
5聂聃.基于脑电的情感识别[D].上海交通大学2012
6赵仑,著.ERPs实验教程[M]. 东南大学出版社, 2010
7Nia Cason,Corine Astésano,Daniele Sch?n.Bridging music and speech rhythm: Rhythmic priming and audio-motor training affect speech perception[J]. Acta Psychologica . 2014
8Lauri Nummenmaa,Heini Saarim?ki,Enrico Glerean,Athanasios Gotsopoulos,Iiro P. J??skel?inen,Riitta Hari,Mikko Sams.Emotional speech synchronizes brains across listeners and engages large-scale dynamic brain networks[J]. NeuroImage . 2014
9K. Sreenivasa Rao,Shashidhar G. Koolagudi,Ramu Reddy Vempada.Emotion recognition from speech using global and local prosodic features[J]. International Journal of Speech Technology . 2013 (2)
10Ferenc Honbolygó,Valéria Csépe.Saliency or template? ERP evidence for long-term representation of word stress[J]. International Journal of Psychophysiology . 2012

共引文献30

1张雪英,张婷,孙颖,张卫,畅江.情感语音数据库优化及PAD情感模型量化标注[J].太原理工大学学报,2017,48(3):469-474. 被引量：14
2崔婧,刘永翔.智能电视的交互设计研究[J].设计,2018,31(2):140-141. 被引量：2
3任国凤,张雪英,李东,闫建政.普通话双模态情感语音数据库的设计与评价[J].现代电子技术,2018,41(14):182-186. 被引量：1
4金升菊.基于层次集成稀疏表示在语音感情计算中的应用[J].福建电脑,2018,34(9):55-56.
5张雪英,张婷,孙颖,张卫.基于PAD模型的级联分类情感语音识别[J].太原理工大学学报,2018,49(5):731-735. 被引量：8
6陈逸灵,程艳芬,陈先桥,王红霞,李超.PAD三维情感空间中的语音情感识别[J].哈尔滨工业大学学报,2018,50(11):160-166. 被引量：6
7苏灵松,应捷,杨海马,肖昊琪.双通道卷积记忆神经网络文本情感分析[J].软件导刊,2019,18(7):32-36. 被引量：5
8任杰,郭卉,姜囡.不同情感的语音声学特征分析[J].光电技术应用,2019,34(5):31-36. 被引量：1
9陆敬筠,龚玉.基于自注意力的扩展卷积神经网络情感分类[J].计算机工程与设计,2020,41(6):1645-1651. 被引量：4
10李晓宇,徐勇,张心蕊,汪倩,武雅利.语音情感识别研究进展分析[J].现代计算机,2020,26(20):44-47.

1李晓旭,李泊宁,张曦,于春雨.基于CA-Res注意力机制的YOLOv5图像火灾检测算法[J].消防科学与技术,2023,42(8):1113-1116.
2脑机接口技术[J].中学生阅读（中考版）,2024(7):74-75.
3薛凯鹏,徐涛,廖春节.融合自监督和多层交叉注意力的多模态情感分析网络[J].计算机应用,2024,44(8):2387-2392.
4李新忠,熊永良,徐韶光.基于正则化与恒星日滤波的BDS多路径误差削减方法[J].大地测量与地球动力学,2024,44(2):116-121.
5杨本臣,王建宇,金海波.DS-TransFusion:基于改进Swin Transformer的视网膜血管自动分割[J].工程科学学报,2024,46(10):1889-1898.
6何跃,张少斐.基于点云融合算法的无人机三维建模测量研究[J].微型电脑应用,2024,40(6):172-175.
7匡文波,姜泽玮.智能媒体新质生产力:理论内涵、运作逻辑与实现路径[J].中国编辑,2024(7):29-35.
8仲雨乐,韩普,许鑫.基于异构图注意力网络的药物不良反应实体关系联合抽取研究[J].现代情报,2024,44(9):71-81.
9李知政.人工智能的发展对社会工作的影响[J].民风,2024(6):0104-0106.
10魏新龙,李立,温驰,靳一,徐常志.基于模型驱动深度学习的OTFS检测方法[J].电讯技术,2024,64(8):1181-1186.

计算机应用

2024年第8期

浏览历史

内容加载中请稍等...

融合坐标与多头注意力机制的交互语音情感识别

参考文献4

二级参考文献31

共引文献30

相关作者

相关机构

相关主题

浏览历史