摘要
To fully make use of information from different representation subspaces,a multi-head attention-based long short-term memory(LSTM)model is proposed in this study for speech emotion recognition(SER).The proposed model uses frame-level features and takes the temporal information of emotion speech as the input of the LSTM layer.Here,a multi-head time-dimension attention(MHTA)layer was employed to linearly project the output of the LSTM layer into different subspaces for the reduced-dimension context vectors.To provide relative vital information from other dimensions,the output of MHTA,the output of feature-dimension attention,and the last time-step output of LSTM were utilized to form multiple context vectors as the input of the fully connected layer.To improve the performance of multiple vectors,feature-dimension attention was employed for the all-time output of the first LSTM layer.The proposed model was evaluated on the eNTERFACE and GEMEP corpora,respectively.The results indicate that the proposed model outperforms LSTM by 14.6%and 10.5%for eNTERFACE and GEMEP,respectively,proving the effectiveness of the proposed model in SER tasks.
针对语音情感识别中不同表征空间的信息利用不足问题,提出了一种多头注意力的双层长短时记忆模型,用于充分挖掘有效的情感信息.该模型以具有时序情感信息的帧级别特征作为输入值,利用长短时记忆模块学习时域特征,设计了特征注意力模块和时间多头注意力模块,对长短时记忆模块的逐层输出值、特征注意力模块输出值、时间多头注意力模块输出值进行融合.结果表明,相比传统的长短时记忆模型,所提方法在eENTERFACE和GEMEP两个数据集上的识别准确率分别提升了14.6%和10.5%,从而证明了其在语音情感识别任务中的有效性.
基金
The National Natural Science Foundation of China(No.61571106,61633013,61673108,81871444).