摘要
针对说话人识别的性能易受到情感因素影响的问题,提出利用片段级别特征和帧级别特征联合学习的方法。利用长短时记忆网络进行说话人识别任务,提取时序输出作为片段级别的情感说话人特征,保留了语音帧特征原本信息的同时加强了情感信息的表达,再利用全连接网络进一步学习片段级别特征中每一个特征帧的说话人信息来增强帧级别特征的说话人信息表示能力,最后拼接片段级别特征和帧级别特征得到最终的说话人特征以增强特征的表征能力。在普通话情感语音语料库(MASC)上进行实验,验证所提出方法有效性的同时,探究了片段级别特征中包含语音帧数量和不同情感状态对情感说话人识别的影响。
The performance of speaker recognition is easily affected by emotional factors.A joint learning method using segment-level features and frame-level features is proposed in this paper.To retain the original speaker information of each frame and fully express the emotional information,long short-term memory-network is used to extract sequence output as segment-level emotional speaker embedding.Then each frame of the segment-level feature is learned by full-connected network to improve the frame-level feature representation ability.At last,the final speaker embedding is the concatenation of the segment-level feature and the frame-level feature,which can further improve the ability of feature expression.Experiments are conducted on Mandarin emotional speech corpus(MASC)to verify the effectiveness of the proposed method.Meanwhile,this paper discusses the suitable number of frames contained in segment-level feature and the effects of different emotional states on emotional speaker recognition.
作者
刘金琳
李冬冬
王喆
蔡立志
LIU Jinlin;LI Dongdong;WANG Zhe;CAI Lizhi(School of Information Science and Engineering,East China University of Technology,Shanghai 200237,China;Provincial Key Laboratory for Computer Information Processing Technology,Soochow University,Suzhou,Jiangsu 215006,China)
出处
《计算机工程与应用》
CSCD
北大核心
2023年第1期149-155,共7页
Computer Engineering and Applications
基金
国家自然科学基金(61806078)
国家重大新药开发科技专项(2019ZX09210004)
上海市教育发展基金会和上海市教育委员会“曙光计划”(61725301)。
关键词
情感说话人识别
长短时记忆网络
深度神经网络
emotional speaker recognition
long short-term memory
deep neutral network