Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency info...Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.展开更多
Background A crucial element of human-machine interaction,the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models.One vital challenge in s...Background A crucial element of human-machine interaction,the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models.One vital challenge in speech emotion recognition(SER)is learning robust and discriminative representations from speech.Although machine learning methods have been widely applied in SER research,the inadequate amount of available annotated data has become a bottleneck impeding the extended application of such techniques(e.g.,deep neural networks).To address this issue,we present a deep learning method that combines knowledge transfer and self-attention for SER tasks.Herein,we apply the log-Mel spectrogram with deltas and delta-deltas as inputs.Moreover,given that emotions are time dependent,we apply temporal convolutional neural networks to model the variations in emotions.We further introduce an attention transfer mechanism,which is based on a self-attention algorithm to learn long-term dependencies.The self-attention transfer network(SATN)in our proposed approach takes advantage of attention transfer to learn attention from speech recognition,followed by transferring this knowledge into SER.An evaluation built on Interactive Emotional Dyadic Motion Capture(IEMOCAP)dataset demonstrates the effectiveness of the proposed model.展开更多
In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classificati...In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classification tool designed to educate iHEARu-PLAY users about state-of-the-art speech analysis paradigms. Via this associated speech analysis web interface, in addition, VoiLA encourages users to take an active role in improving the service by providing labelled speech data. The platform allows users to record and upload voice samples directly from their browser, which are then analysed in a state-of-the-art classification pipeline. A set of pre-trained models targeting a range of speaker states and traits such as gender, valence, arousal, dominance, and 24 different discrete emotions is employed. The analysis results are visualised in a way that they are easily interpretable by laymen, giving users unique insights into how their voice sounds. We assess the effectiveness of iHEARu-PLAY and its integrated VoiLA feature via a series of user evaluations which indicate that it is fun and easy to use, and that it provides accurate and informative results.展开更多
Background Although frustration is a common emotional reaction while playing games,an excessive level of frustration can negatively impact a user's experience,discouraging them from further game interactions.The a...Background Although frustration is a common emotional reaction while playing games,an excessive level of frustration can negatively impact a user's experience,discouraging them from further game interactions.The automatic detection of frustration can enable the development of adaptive systems that can adapt a game to a user's specific needs through real-time difficulty adjustment,thereby optimizing the player's experience and guaranteeing game success.To this end,we present a speech-based approach for the automatic detection of frustration during game interactions,a specific task that remains under explored in research.Method The experiments were performed on the Multimodal Game Frustration Database(MGFD),an audiovisual dataset-collected within the Wizard-of-Oz framework-that is specially tailored to investigate verbal and facial expressions of frustration during game interactions.We explored the performance of a variety of acoustic feature sets,including Mel-Spectrograms,Mel Frequency Cepstral Coefficients(MFCCs),and the low-dimensional knowledge-based acoustic feature set eGeMAPS.Because of the continual improvements in speech recognition tasks achieved by the use of convolutional neural networks(CNNs),unlike the MGFD baseline,which is based on the Long Short Term Memory(LSTM)architecture and Support Vector Machine(SVM)classifier-in the present work,we consider typical CNNs,including ResNet,VGG,and AlexNet.Furthermore,given the unresolved debate on the suitability of shallow and deep networks,we also examine the performance of two of the latest deep CNNs:WideResNet and EfficientNet.Results Our best result,achieved with WideResNet and Mel-Spectrogram features,increases the system performance from 58.8%unweighted average recall(UAR)to 93.1%UAR for speech-based automatic frustration recognition.展开更多
基金supported by the German National BMBF IKT2020-Grant(16SV7213)(EmotAsS)the European-Unions Horizon 2020 Research and Innovation Programme(688835)(DE-ENIGMA)the China Scholarship Council(CSC)
文摘Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly,the features extracted from a subsequent fully connected layer are fed into(bidirectional) gated recurrent neural networks, which are followed by a single highway layer and a softmax layer;finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events(DCASE). On the evaluation set, an accuracy of 64.0 % from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 % baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy,when fusing with a spectrogram-based system.
基金the National Natural Science Foundation of China(62071330)the National Science Fund for Distinguished Young Scholars(61425017)+3 种基金the Key Program of the National Natural Science Foundation(61831022)the Key Program of the Natural Science Foundation of Tianjin(18JCZDJC36300)the Open Projects Program of the National Laboratory of Pattern Recognition and the Senior Visiting Scholar Program of Tianjin Normal Universitythe Innovative Medicines Initiative 2 Joint Undertaking(115902),which receives support from the European Union's Horizon 2020 research and innovation program and EFPIA.
文摘Background A crucial element of human-machine interaction,the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models.One vital challenge in speech emotion recognition(SER)is learning robust and discriminative representations from speech.Although machine learning methods have been widely applied in SER research,the inadequate amount of available annotated data has become a bottleneck impeding the extended application of such techniques(e.g.,deep neural networks).To address this issue,we present a deep learning method that combines knowledge transfer and self-attention for SER tasks.Herein,we apply the log-Mel spectrogram with deltas and delta-deltas as inputs.Moreover,given that emotions are time dependent,we apply temporal convolutional neural networks to model the variations in emotions.We further introduce an attention transfer mechanism,which is based on a self-attention algorithm to learn long-term dependencies.The self-attention transfer network(SATN)in our proposed approach takes advantage of attention transfer to learn attention from speech recognition,followed by transferring this knowledge into SER.An evaluation built on Interactive Emotional Dyadic Motion Capture(IEMOCAP)dataset demonstrates the effectiveness of the proposed model.
基金supported by the European Community’s Seventh Framework Programme(No.338164)(ERC Starting Grant iHEARu)
文摘In this contribution, we present iHEARu-PLAY, an online, multi-player platform for crowdsourced database collection and labelling, including the voice analysis application (VoiLA), a free web-based speech classification tool designed to educate iHEARu-PLAY users about state-of-the-art speech analysis paradigms. Via this associated speech analysis web interface, in addition, VoiLA encourages users to take an active role in improving the service by providing labelled speech data. The platform allows users to record and upload voice samples directly from their browser, which are then analysed in a state-of-the-art classification pipeline. A set of pre-trained models targeting a range of speaker states and traits such as gender, valence, arousal, dominance, and 24 different discrete emotions is employed. The analysis results are visualised in a way that they are easily interpretable by laymen, giving users unique insights into how their voice sounds. We assess the effectiveness of iHEARu-PLAY and its integrated VoiLA feature via a series of user evaluations which indicate that it is fun and easy to use, and that it provides accurate and informative results.
基金the European Union's Horizon 2020 Programmes Under Grant Agreement(826506,sustAGE).
文摘Background Although frustration is a common emotional reaction while playing games,an excessive level of frustration can negatively impact a user's experience,discouraging them from further game interactions.The automatic detection of frustration can enable the development of adaptive systems that can adapt a game to a user's specific needs through real-time difficulty adjustment,thereby optimizing the player's experience and guaranteeing game success.To this end,we present a speech-based approach for the automatic detection of frustration during game interactions,a specific task that remains under explored in research.Method The experiments were performed on the Multimodal Game Frustration Database(MGFD),an audiovisual dataset-collected within the Wizard-of-Oz framework-that is specially tailored to investigate verbal and facial expressions of frustration during game interactions.We explored the performance of a variety of acoustic feature sets,including Mel-Spectrograms,Mel Frequency Cepstral Coefficients(MFCCs),and the low-dimensional knowledge-based acoustic feature set eGeMAPS.Because of the continual improvements in speech recognition tasks achieved by the use of convolutional neural networks(CNNs),unlike the MGFD baseline,which is based on the Long Short Term Memory(LSTM)architecture and Support Vector Machine(SVM)classifier-in the present work,we consider typical CNNs,including ResNet,VGG,and AlexNet.Furthermore,given the unresolved debate on the suitability of shallow and deep networks,we also examine the performance of two of the latest deep CNNs:WideResNet and EfficientNet.Results Our best result,achieved with WideResNet and Mel-Spectrogram features,increases the system performance from 58.8%unweighted average recall(UAR)to 93.1%UAR for speech-based automatic frustration recognition.