In order to increase the accuracy rate of emotion recognition in voiceand video,the mixed convolutional neural network(CNN)and recurrent neural network(RNN)ae used to encode and integrate the two information sources.F...In order to increase the accuracy rate of emotion recognition in voiceand video,the mixed convolutional neural network(CNN)and recurrent neural network(RNN)ae used to encode and integrate the two information sources.For the audio signals,several frequency bands as well as some energy functions are extacted as low-level features by using a sophisticated audio technique,and then they are encoded w it a one-dimensional(I D)convolutional neural network to abstact high-level features.Finally,tiese are fed into a recurrent neural network for te sake of capturing dynamic tone changes in a temporal dimensionality.As a contrast,a two-dimensional(2D)convolutional neural network and a similar RNN are used to capture dynamic facial appearance changes of temporal sequences.The method was used in te Chinese Natral Audio-'Visual Emotion Database in te Chinese Conference on Pattern Recognition(CCPR)in2016.Experimental results demonstrate that te classification average precision of the proposed metiod is41.15%,which is increased by16.62%compaed with te baseline algorithm offered by the CCPR in2016.It is proved ta t te proposed method has higher accuracy in te identification of emotional information.展开更多
文摘In order to increase the accuracy rate of emotion recognition in voiceand video,the mixed convolutional neural network(CNN)and recurrent neural network(RNN)ae used to encode and integrate the two information sources.For the audio signals,several frequency bands as well as some energy functions are extacted as low-level features by using a sophisticated audio technique,and then they are encoded w it a one-dimensional(I D)convolutional neural network to abstact high-level features.Finally,tiese are fed into a recurrent neural network for te sake of capturing dynamic tone changes in a temporal dimensionality.As a contrast,a two-dimensional(2D)convolutional neural network and a similar RNN are used to capture dynamic facial appearance changes of temporal sequences.The method was used in te Chinese Natral Audio-'Visual Emotion Database in te Chinese Conference on Pattern Recognition(CCPR)in2016.Experimental results demonstrate that te classification average precision of the proposed metiod is41.15%,which is increased by16.62%compaed with te baseline algorithm offered by the CCPR in2016.It is proved ta t te proposed method has higher accuracy in te identification of emotional information.