期刊文献+
共找到28篇文章
< 1 2 >
每页显示 20 50 100
EMOTIONAL SPEECH RECOGNITION BASED ON SVM WITH GMM SUPERVECTOR 被引量:1
1
作者 Chen Yanxiang Xie Jian 《Journal of Electronics(China)》 2012年第3期339-344,共6页
Emotion recognition from speech is an important field of research in human computer interaction. In this letter the framework of Support Vector Machines (SVM) with Gaussian Mixture Model (GMM) supervector is introduce... Emotion recognition from speech is an important field of research in human computer interaction. In this letter the framework of Support Vector Machines (SVM) with Gaussian Mixture Model (GMM) supervector is introduced for emotional speech recognition. Because of the importance of variance in reflecting the distribution of speech, the normalized mean vectors potential to exploit the information from the variance are adopted to form the GMM supervector. Comparative experiments from five aspects are conducted to study their corresponding effect to system performance. The experiment results, which indicate that the influence of number of mixtures is strong as well as influence of duration is weak, provide basis for the train set selection of Universal Background Model (UBM). 展开更多
关键词 emotional speech recognition Support Vector Machines (SVM) Gaussian Mixture Model (GMM) supervector Universal Background Model (USB)
下载PDF
Multi-Objective Equilibrium Optimizer for Feature Selection in High-Dimensional English Speech Emotion Recognition
2
作者 Liya Yue Pei Hu +1 位作者 Shu-Chuan Chu Jeng-Shyang Pan 《Computers, Materials & Continua》 SCIE EI 2024年第2期1957-1975,共19页
Speech emotion recognition(SER)uses acoustic analysis to find features for emotion recognition and examines variations in voice that are caused by emotions.The number of features acquired with acoustic analysis is ext... Speech emotion recognition(SER)uses acoustic analysis to find features for emotion recognition and examines variations in voice that are caused by emotions.The number of features acquired with acoustic analysis is extremely high,so we introduce a hybrid filter-wrapper feature selection algorithm based on an improved equilibrium optimizer for constructing an emotion recognition system.The proposed algorithm implements multi-objective emotion recognition with the minimum number of selected features and maximum accuracy.First,we use the information gain and Fisher Score to sort the features extracted from signals.Then,we employ a multi-objective ranking method to evaluate these features and assign different importance to them.Features with high rankings have a large probability of being selected.Finally,we propose a repair strategy to address the problem of duplicate solutions in multi-objective feature selection,which can improve the diversity of solutions and avoid falling into local traps.Using random forest and K-nearest neighbor classifiers,four English speech emotion datasets are employed to test the proposed algorithm(MBEO)as well as other multi-objective emotion identification techniques.The results illustrate that it performs well in inverted generational distance,hypervolume,Pareto solutions,and execution time,and MBEO is appropriate for high-dimensional English SER. 展开更多
关键词 speech emotion recognition filter-wrapper HIGH-DIMENSIONAL feature selection equilibrium optimizer MULTI-OBJECTIVE
下载PDF
Exploring Sequential Feature Selection in Deep Bi-LSTM Models for Speech Emotion Recognition
3
作者 Fatma Harby Mansor Alohali +1 位作者 Adel Thaljaoui Amira Samy Talaat 《Computers, Materials & Continua》 SCIE EI 2024年第2期2689-2719,共31页
Machine Learning(ML)algorithms play a pivotal role in Speech Emotion Recognition(SER),although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state.The examination of the emotiona... Machine Learning(ML)algorithms play a pivotal role in Speech Emotion Recognition(SER),although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state.The examination of the emotional states of speakers holds significant importance in a range of real-time applications,including but not limited to virtual reality,human-robot interaction,emergency centers,and human behavior assessment.Accurately identifying emotions in the SER process relies on extracting relevant information from audio inputs.Previous studies on SER have predominantly utilized short-time characteristics such as Mel Frequency Cepstral Coefficients(MFCCs)due to their ability to capture the periodic nature of audio signals effectively.Although these traits may improve their ability to perceive and interpret emotional depictions appropriately,MFCCS has some limitations.So this study aims to tackle the aforementioned issue by systematically picking multiple audio cues,enhancing the classifier model’s efficacy in accurately discerning human emotions.The utilized dataset is taken from the EMO-DB database,preprocessing input speech is done using a 2D Convolution Neural Network(CNN)involves applying convolutional operations to spectrograms as they afford a visual representation of the way the audio signal frequency content changes over time.The next step is the spectrogram data normalization which is crucial for Neural Network(NN)training as it aids in faster convergence.Then the five auditory features MFCCs,Chroma,Mel-Spectrogram,Contrast,and Tonnetz are extracted from the spectrogram sequentially.The attitude of feature selection is to retain only dominant features by excluding the irrelevant ones.In this paper,the Sequential Forward Selection(SFS)and Sequential Backward Selection(SBS)techniques were employed for multiple audio cues features selection.Finally,the feature sets composed from the hybrid feature extraction methods are fed into the deep Bidirectional Long Short Term Memory(Bi-LSTM)network to discern emotions.Since the deep Bi-LSTM can hierarchically learn complex features and increases model capacity by achieving more robust temporal modeling,it is more effective than a shallow Bi-LSTM in capturing the intricate tones of emotional content existent in speech signals.The effectiveness and resilience of the proposed SER model were evaluated by experiments,comparing it to state-of-the-art SER techniques.The results indicated that the model achieved accuracy rates of 90.92%,93%,and 92%over the Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS),Berlin Database of Emotional Speech(EMO-DB),and The Interactive Emotional Dyadic Motion Capture(IEMOCAP)datasets,respectively.These findings signify a prominent enhancement in the ability to emotional depictions identification in speech,showcasing the potential of the proposed model in advancing the SER field. 展开更多
关键词 Artificial intelligence application multi features sequential selection speech emotion recognition deep Bi-LSTM
下载PDF
Multilayer Neural Network Based Speech Emotion Recognition for Smart Assistance 被引量:2
4
作者 Sandeep Kumar MohdAnul Haq +4 位作者 Arpit Jain C.Andy Jason Nageswara Rao Moparthi Nitin Mittal Zamil S.Alzamil 《Computers, Materials & Continua》 SCIE EI 2023年第1期1523-1540,共18页
Day by day,biometric-based systems play a vital role in our daily lives.This paper proposed an intelligent assistant intended to identify emotions via voice message.A biometric system has been developed to detect huma... Day by day,biometric-based systems play a vital role in our daily lives.This paper proposed an intelligent assistant intended to identify emotions via voice message.A biometric system has been developed to detect human emotions based on voice recognition and control a few electronic peripherals for alert actions.This proposed smart assistant aims to provide a support to the people through buzzer and light emitting diodes(LED)alert signals and it also keep track of the places like households,hospitals and remote areas,etc.The proposed approach is able to detect seven emotions:worry,surprise,neutral,sadness,happiness,hate and love.The key elements for the implementation of speech emotion recognition are voice processing,and once the emotion is recognized,the machine interface automatically detects the actions by buzzer and LED.The proposed system is trained and tested on various benchmark datasets,i.e.,Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS)database,Acoustic-Phonetic Continuous Speech Corpus(TIMIT)database,Emotional Speech database(Emo-DB)database and evaluated based on various parameters,i.e.,accuracy,error rate,and time.While comparing with existing technologies,the proposed algorithm gave a better error rate and less time.Error rate and time is decreased by 19.79%,5.13 s.for the RAVDEES dataset,15.77%,0.01 s for the Emo-DB dataset and 14.88%,3.62 for the TIMIT database.The proposed model shows better accuracy of 81.02%for the RAVDEES dataset,84.23%for the TIMIT dataset and 85.12%for the Emo-DB dataset compared to Gaussian Mixture Modeling(GMM)and Support Vector Machine(SVM)Model. 展开更多
关键词 speech emotion recognition classifier implementation feature extraction and selection smart assistance
下载PDF
A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition 被引量:1
5
作者 Peizhu Gong Jin Liu +3 位作者 Zhongdai Wu Bing Han YKenWang Huihua He 《Computers, Materials & Continua》 SCIE EI 2023年第2期4203-4220,共18页
Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due... Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach. 展开更多
关键词 speech emotion recognition self-supervised embedding model cross-modal transformer self-attention
下载PDF
Improved Speech Emotion Recognition Focusing on High-Level Data Representations and Swift Feature Extraction Calculation
6
作者 Akmalbek Abdusalomov Alpamis Kutlimuratov +1 位作者 Rashid Nasimov Taeg Keun Whangbo 《Computers, Materials & Continua》 SCIE EI 2023年第12期2915-2933,共19页
The performance of a speech emotion recognition(SER)system is heavily influenced by the efficacy of its feature extraction techniques.The study was designed to advance the field of SER by optimizing feature extraction... The performance of a speech emotion recognition(SER)system is heavily influenced by the efficacy of its feature extraction techniques.The study was designed to advance the field of SER by optimizing feature extraction tech-niques,specifically through the incorporation of high-resolution Mel-spectrograms and the expedited calculation of Mel Frequency Cepstral Coefficients(MFCC).This initiative aimed to refine the system’s accuracy by identifying and mitigating the shortcomings commonly found in current approaches.Ultimately,the primary objective was to elevate both the intricacy and effectiveness of our SER model,with a focus on augmenting its proficiency in the accurate identification of emotions in spoken language.The research employed a dual-strategy approach for feature extraction.Firstly,a rapid computation technique for MFCC was implemented and integrated with a Bi-LSTM layer to optimize the encoding of MFCC features.Secondly,a pretrained ResNet model was utilized in conjunction with feature Stats pooling and dense layers for the effective encoding of Mel-spectrogram attributes.These two sets of features underwent separate processing before being combined in a Convolutional Neural Network(CNN)outfitted with a dense layer,with the aim of enhancing their representational richness.The model was rigorously evaluated using two prominent databases:CMU-MOSEI and RAVDESS.Notable findings include an accuracy rate of 93.2%on the CMU-MOSEI database and 95.3%on the RAVDESS database.Such exceptional performance underscores the efficacy of this innovative approach,which not only meets but also exceeds the accuracy benchmarks established by traditional models in the field of speech emotion recognition. 展开更多
关键词 Feature extraction MFCC ResNet speech emotion recognition
下载PDF
Using Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition
7
作者 Somin Park Mpabulungi Mark +1 位作者 Bogyung Park Hyunki Hong 《Computers, Materials & Continua》 SCIE EI 2023年第10期1009-1030,共22页
Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,langua... Speech emotion recognition is essential for frictionless human-machine interaction,where machines respond to human instructions with context-aware actions.The properties of individuals’voices vary with culture,language,gender,and personality.These variations in speaker-specific properties may hamper the performance of standard representations in downstream tasks such as speech emotion recognition(SER).This study demonstrates the significance of speaker-specific speech characteristics and how considering them can be leveraged to improve the performance of SER models.In the proposed approach,two wav2vec-based modules(a speaker-identification network and an emotion classification network)are trained with the Arcface loss.The speaker-identification network has a single attention block to encode an input audio waveform into a speaker-specific representation.The emotion classification network uses a wav2vec 2.0-backbone as well as four attention blocks to encode the same input audio waveform into an emotion representation.These two representations are then fused into a single vector representation containing emotion and speaker-specific information.Experimental results showed that the use of speaker-specific characteristics improves SER performance.Additionally,combining these with an angular marginal loss such as the Arcface loss improves intra-class compactness while increasing inter-class separability,as demonstrated by the plots of t-distributed stochastic neighbor embeddings(t-SNE).The proposed approach outperforms previous methods using similar training strategies,with a weighted accuracy(WA)of 72.14%and unweighted accuracy(UA)of 72.97%on the Interactive Emotional Dynamic Motion Capture(IEMOCAP)dataset.This demonstrates its effectiveness and potential to enhance human-machine interaction through more accurate emotion recognition in speech. 展开更多
关键词 Attention block IEMOCAP dataset speaker-specific representation speech emotion recognition wav2vec 2.0
下载PDF
Performance Analysis of a Chunk-Based Speech Emotion Recognition Model Using RNN
8
作者 Hyun-Sam Shin Jun-Ki Hong 《Intelligent Automation & Soft Computing》 SCIE 2023年第4期235-248,共14页
Recently,artificial-intelligence-based automatic customer response sys-tem has been widely used instead of customer service representatives.Therefore,it is important for automatic customer service to promptly recognize... Recently,artificial-intelligence-based automatic customer response sys-tem has been widely used instead of customer service representatives.Therefore,it is important for automatic customer service to promptly recognize emotions in a customer’s voice to provide the appropriate service accordingly.Therefore,we analyzed the performance of the emotion recognition(ER)accuracy as a function of the simulation time using the proposed chunk-based speech ER(CSER)model.The proposed CSER model divides voice signals into 3-s long chunks to effi-ciently recognize characteristically inherent emotions in the customer’s voice.We evaluated the performance of the ER of voice signal chunks by applying four RNN techniques—long short-term memory(LSTM),bidirectional-LSTM,gated recurrent units(GRU),and bidirectional-GRU—to the proposed CSER model individually to assess its ER accuracy and time efficiency.The results reveal that GRU shows the best time efficiency in recognizing emotions from speech signals in terms of accuracy as a function of simulation time. 展开更多
关键词 RNN speech emotion recognition attention mechanism time efficiency
下载PDF
Design of Hierarchical Classifier to Improve Speech Emotion Recognition
9
作者 P.Vasuki 《Computer Systems Science & Engineering》 SCIE EI 2023年第1期19-33,共15页
Automatic Speech Emotion Recognition(SER)is used to recognize emotion from speech automatically.Speech Emotion recognition is working well in a laboratory environment but real-time emotion recognition has been influen... Automatic Speech Emotion Recognition(SER)is used to recognize emotion from speech automatically.Speech Emotion recognition is working well in a laboratory environment but real-time emotion recognition has been influenced by the variations in gender,age,the cultural and acoustical background of the speaker.The acoustical resemblance between emotional expressions further increases the complexity of recognition.Many recent research works are concentrated to address these effects individually.Instead of addressing every influencing attribute individually,we would like to design a system,which reduces the effect that arises on any factor.We propose a two-level Hierarchical classifier named Interpreter of responses(IR).Thefirst level of IR has been realized using Support Vector Machine(SVM)and Gaussian Mixer Model(GMM)classifiers.In the second level of IR,a discriminative SVM classifier has been trained and tested with meta information offirst-level classifiers along with the input acoustical feature vector which is used in primary classifiers.To train the system with a corpus of versatile nature,an integrated emotion corpus has been composed using emotion samples of 5 speech corpora,namely;EMO-DB,IITKGP-SESC,SAVEE Corpus,Spanish emotion corpus,CMU's Woogle corpus.The hierarchical classifier has been trained and tested using MFCC and Low-Level Descriptors(LLD).The empirical analysis shows that the proposed classifier outperforms the traditional classifiers.The proposed ensemble design is very generic and can be adapted even when the number and nature of features change.Thefirst-level classifiers GMM or SVM may be replaced with any other learning algorithm. 展开更多
关键词 speech emotion recognition hierarchical classifier design ENSEMBLE emotion speech corpora
下载PDF
Enhancing Human-Machine Interaction:Real-Time Emotion Recognition through Speech Analysis
10
作者 Dominik Esteves de Andrade Rüdiger Buchkremer 《Journal of Computer Science Research》 2023年第3期22-45,共24页
Humans,as intricate beings driven by a multitude of emotions,possess a remarkable ability to decipher and respond to socio-affective cues.However,many individuals and machines struggle to interpret such nuanced signal... Humans,as intricate beings driven by a multitude of emotions,possess a remarkable ability to decipher and respond to socio-affective cues.However,many individuals and machines struggle to interpret such nuanced signals,including variations in tone of voice.This paper explores the potential of intelligent technologies to bridge this gap and improve the quality of conversations.In particular,the authors propose a real-time processing method that captures and evaluates emotions in speech,utilizing a terminal device like the Raspberry Pi computer.Furthermore,the authors provide an overview of the current research landscape surrounding speech emotional recognition and delve into our methodology,which involves analyzing audio files from renowned emotional speech databases.To aid incomprehension,the authors present visualizations of these audio files in situ,employing dB-scaled Mel spectrograms generated through TensorFlow and Matplotlib.The authors use a support vector machine kernel and a Convolutional Neural Network with transfer learning to classify emotions.Notably,the classification accuracies achieved are 70% and 77%,respectively,demonstrating the efficacy of our approach when executed on an edge device rather than relying on a server.The system can evaluate pure emotion in speech and provide corresponding visualizations to depict the speaker’s emotional state in less than one second on a Raspberry Pi.These findings pave the way for more effective and emotionally intelligent human-machine interactions in various domains. 展开更多
关键词 speech emotion recognition Edge computing Real-time computing Raspberry Pi
下载PDF
A novel speech emotion recognition algorithm based on combination of emotion data field and ant colony search strategy 被引量:3
11
作者 查诚 陶华伟 +3 位作者 张昕然 周琳 赵力 杨平 《Journal of Southeast University(English Edition)》 EI CAS 2016年第2期158-163,共6页
In order to effectively conduct emotion recognition from spontaneous, non-prototypical and unsegmented speech so as to create a more natural human-machine interaction; a novel speech emotion recognition algorithm base... In order to effectively conduct emotion recognition from spontaneous, non-prototypical and unsegmented speech so as to create a more natural human-machine interaction; a novel speech emotion recognition algorithm based on the combination of the emotional data field (EDF) and the ant colony search (ACS) strategy, called the EDF-ACS algorithm, is proposed. More specifically, the inter- relationship among the turn-based acoustic feature vectors of different labels are established by using the potential function in the EDF. To perform the spontaneous speech emotion recognition, the artificial colony is used to mimic the turn- based acoustic feature vectors. Then, the canonical ACS strategy is used to investigate the movement direction of each artificial ant in the EDF, which is regarded as the emotional label of the corresponding turn-based acoustic feature vector. The proposed EDF-ACS algorithm is evaluated on the continueous audio)'visual emotion challenge (AVEC) 2012 dataset, which contains the spontaneous, non-prototypical and unsegmented speech emotion data. The experimental results show that the proposed EDF-ACS algorithm outperforms the existing state-of-the-art algorithm in turn-based speech emotion recognition. 展开更多
关键词 speech emotion recognition emotional data field ant colony search human-machine interaction
下载PDF
Auditory attention model based on Chirplet for cross-corpus speech emotion recognition 被引量:1
12
作者 张昕然 宋鹏 +2 位作者 查诚 陶华伟 赵力 《Journal of Southeast University(English Edition)》 EI CAS 2016年第4期402-407,共6页
To solve the problem of mismatching features in an experimental database, which is a key technique in the field of cross-corpus speech emotion recognition, an auditory attention model based on Chirplet is proposed for... To solve the problem of mismatching features in an experimental database, which is a key technique in the field of cross-corpus speech emotion recognition, an auditory attention model based on Chirplet is proposed for feature extraction.First, in order to extract the spectra features, the auditory attention model is employed for variational emotion features detection. Then, the selective attention mechanism model is proposed to extract the salient gist features which showtheir relation to the expected performance in cross-corpus testing.Furthermore, the Chirplet time-frequency atoms are introduced to the model. By forming a complete atom database, the Chirplet can improve the spectrum feature extraction including the amount of information. Samples from multiple databases have the characteristics of multiple components. Hereby, the Chirplet expands the scale of the feature vector in the timefrequency domain. Experimental results show that, compared to the traditional feature model, the proposed feature extraction approach with the prototypical classifier has significant improvement in cross-corpus speech recognition. In addition, the proposed method has better robustness to the inconsistent sources of the training set and the testing set. 展开更多
关键词 speech emotion recognition selective attention mechanism spectrogram feature cross-corpus
下载PDF
Speech emotion recognition via discriminant-cascading dimensionality reduction 被引量:1
13
作者 王如刚 徐新洲 +3 位作者 黄程韦 吴尘 张昕然 赵力 《Journal of Southeast University(English Edition)》 EI CAS 2016年第2期151-157,共7页
In order to accurately identify speech emotion information, the discriminant-cascading effect in dimensionality reduction of speech emotion recognition is investigated. Based on the existing locality preserving projec... In order to accurately identify speech emotion information, the discriminant-cascading effect in dimensionality reduction of speech emotion recognition is investigated. Based on the existing locality preserving projections and graph embedding framework, a novel discriminant-cascading dimensionality reduction method is proposed, which is named discriminant-cascading locality preserving projections (DCLPP). The proposed method specifically utilizes supervised embedding graphs and it keeps the original space for the inner products of samples to maintain enough information for speech emotion recognition. Then, the kernel DCLPP (KDCLPP) is also proposed to extend the mapping form. Validated by the experiments on the corpus of EMO-DB and eNTERFACE'05, the proposed method can clearly outperform the existing common dimensionality reduction methods, such as principal component analysis (PCA), linear discriminant analysis (LDA), locality preserving projections (LPP), local discriminant embedding (LDE), graph-based Fisher analysis (GbFA) and so on, with different categories of classifiers. 展开更多
关键词 speech emotion recognition discriminant-cascading locality preserving projections DISCRIMINANTANALYSIS dimensionality reduction
下载PDF
Novel feature fusion method for speech emotion recognition based on multiple kernel learning
14
作者 金赟 宋鹏 +1 位作者 郑文明 赵力 《Journal of Southeast University(English Edition)》 EI CAS 2013年第2期129-133,共5页
In order to improve the performance of speech emotion recognition, a novel feature fusion method is proposed. Based on the global features, the local information of different kinds of features is utilized. Both the gl... In order to improve the performance of speech emotion recognition, a novel feature fusion method is proposed. Based on the global features, the local information of different kinds of features is utilized. Both the global and the local features are combined together. Moreover, the multiple kernel learning method is adopted. The global features and each kind of local feature are respectively associated with a kernel, and all these kernels are added together with different weights to obtain a mixed kernel for nonlinear mapping. In the reproducing kernel Hilbert space, different kinds of emotional features can be easily classified. In the experiments, the popular Berlin dataset is used, and the optimal parameters of the global and the local kernels are determined by cross-validation. After computing using multiple kernel learning, the weights of all the kernels are obtained, which shows that the formant and intensity features play a key role in speech emotion recognition. The classification results show that the recognition rate is 78. 74% by using the global kernel, and it is 81.10% by using the proposed method, which demonstrates the effectiveness of the proposed method. 展开更多
关键词 speech emotion recognition multiple kemellearning feature fusion support vector machine
下载PDF
SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION 被引量:9
15
作者 Zhao Yan Zhao Li Zou Cairong Yu Yinhua 《Journal of Electronics(China)》 2008年第6期840-844,共5页
Quadratic Discrimination Function (QDF) is commonly used in speech emotion recognition, which proceeds on the premise that the input data is normal distribution. In this paper, we propose a transformation to normali... Quadratic Discrimination Function (QDF) is commonly used in speech emotion recognition, which proceeds on the premise that the input data is normal distribution. In this paper, we propose a transformation to normalize the emotional features, emotion recognition. Features based on prosody then derivate a Modified QDF (MQDF) to speech and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors. The results show that voice quality features are effective supplement for recognition, and the method in this paper could improve the recognition ratio effectively. 展开更多
关键词 speech emotion recognition Principal Component Analysis Neural Network (PCANN) Modified Quadratic Discrimination Function (MQDF)
下载PDF
Transformer-like model with linear attention for speech emotion recognition 被引量:3
16
作者 Du Jing Tang Manting Zhao Li 《Journal of Southeast University(English Edition)》 EI CAS 2021年第2期164-170,共7页
Because of the excellent performance of Transformer in sequence learning tasks,such as natural language processing,an improved Transformer-like model is proposed that is suitable for speech emotion recognition tasks.T... Because of the excellent performance of Transformer in sequence learning tasks,such as natural language processing,an improved Transformer-like model is proposed that is suitable for speech emotion recognition tasks.To alleviate the prohibitive time consumption and memory footprint caused by softmax inside the multihead attention unit in Transformer,a new linear self-attention algorithm is proposed.The original exponential function is replaced by a Taylor series expansion formula.On the basis of the associative property of matrix products,the time and space complexity of softmax operation regarding the input's length is reduced from O(N2)to O(N),where N is the sequence length.Experimental results on the emotional corpora of two languages show that the proposed linear attention algorithm can achieve similar performance to the original scaled dot product attention,while the training time and memory cost are reduced by half.Furthermore,the improved model obtains more robust performance on speech emotion recognition compared with the original Transformer. 展开更多
关键词 TRANSFORMER attention mechanism speech emotion recognition fast softmax
下载PDF
Multi-head attention-based long short-term memory model for speech emotion recognition 被引量:1
17
作者 Zhao Yan Zhao Li +3 位作者 Lu Cheng Li Sunan Tang Chuangao Lian Hailun 《Journal of Southeast University(English Edition)》 EI CAS 2022年第2期103-109,共7页
To fully make use of information from different representation subspaces,a multi-head attention-based long short-term memory(LSTM)model is proposed in this study for speech emotion recognition(SER).The proposed model ... To fully make use of information from different representation subspaces,a multi-head attention-based long short-term memory(LSTM)model is proposed in this study for speech emotion recognition(SER).The proposed model uses frame-level features and takes the temporal information of emotion speech as the input of the LSTM layer.Here,a multi-head time-dimension attention(MHTA)layer was employed to linearly project the output of the LSTM layer into different subspaces for the reduced-dimension context vectors.To provide relative vital information from other dimensions,the output of MHTA,the output of feature-dimension attention,and the last time-step output of LSTM were utilized to form multiple context vectors as the input of the fully connected layer.To improve the performance of multiple vectors,feature-dimension attention was employed for the all-time output of the first LSTM layer.The proposed model was evaluated on the eNTERFACE and GEMEP corpora,respectively.The results indicate that the proposed model outperforms LSTM by 14.6%and 10.5%for eNTERFACE and GEMEP,respectively,proving the effectiveness of the proposed model in SER tasks. 展开更多
关键词 speech emotion recognition long short-term memory(LSTM) multi-head attention mechanism frame-level features self-attention
下载PDF
Self-attention transfer networks for speech emotion recognition 被引量:4
18
作者 Ziping ZHAO Keru Wang +6 位作者 Zhongtian BAO Zixing ZHANG Nicholas CUMMINS Shihuang SUN Haishuai WANG Jianhua TAO Björn WSCHULLER 《Virtual Reality & Intelligent Hardware》 2021年第1期43-54,共12页
Background A crucial element of human-machine interaction,the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models.One vital challenge in s... Background A crucial element of human-machine interaction,the automatic detection of emotional states from human speech has long been regarded as a challenging task for machine learning models.One vital challenge in speech emotion recognition(SER)is learning robust and discriminative representations from speech.Although machine learning methods have been widely applied in SER research,the inadequate amount of available annotated data has become a bottleneck impeding the extended application of such techniques(e.g.,deep neural networks).To address this issue,we present a deep learning method that combines knowledge transfer and self-attention for SER tasks.Herein,we apply the log-Mel spectrogram with deltas and delta-deltas as inputs.Moreover,given that emotions are time dependent,we apply temporal convolutional neural networks to model the variations in emotions.We further introduce an attention transfer mechanism,which is based on a self-attention algorithm to learn long-term dependencies.The self-attention transfer network(SATN)in our proposed approach takes advantage of attention transfer to learn attention from speech recognition,followed by transferring this knowledge into SER.An evaluation built on Interactive Emotional Dyadic Motion Capture(IEMOCAP)dataset demonstrates the effectiveness of the proposed model. 展开更多
关键词 speech emotion recognition Attention transfer Self-attention Temporal convolutional neural networks(TCNs)
下载PDF
Multi-scale discrepancy adversarial network for cross-corpus speech emotion recognition 被引量:2
19
作者 Wanlu ZHENG Wenming ZHENG Yuan ZONG 《Virtual Reality & Intelligent Hardware》 2021年第1期65-75,共11页
Background One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech.In recent years,the challenging problem of cross-corpus speech emotion recognition(SE... Background One of the most critical issues in human-computer interaction applications is recognizing human emotions based on speech.In recent years,the challenging problem of cross-corpus speech emotion recognition(SER)has generated extensive research.Nevertheless,the domain discrepancy between training data and testing data remains a major challenge to achieving improved system performance.Methods This paper introduces a novel multi-scale discrepancy adversarial(MSDA)network for conducting multiple timescales domain adaptation for cross-corpus SER,i.e.,integrating domain discriminators of hierarchical levels into the emotion recognition framework to mitigate the gap between the source and target domains.Specifically,we extract two kinds of speech features,i.e.,handcraft features and deep features,from three timescales of global,local,and hybrid levels.In each timescale,the domain discriminator and the feature extrator compete against each other to learn features that minimize the discrepancy between the two domains by fooling the discriminator.Results Extensive experiments on cross-corpus and cross-language SER were conducted on a combination dataset that combines one Chinese dataset and two English datasets commonly used in SER.The MSDA is affected by the strong discriminate power provided by the adversarial process,where three discriminators are working in tandem with an emotion classifier.Accordingly,the MSDA achieves the best performance over all other baseline methods.Conclusions The proposed architecture was tested on a combination of one Chinese and two English datasets.The experimental results demonstrate the superiority of our powerful discriminative model for solving cross-corpus SER. 展开更多
关键词 Human-computer interaction Cross-corpus speech emotion recognition Hierarchical discri minators Domain adaptation
下载PDF
Transfer learning with deep sparse auto-encoder for speech emotion recognition
20
作者 Liang Zhenlin Liang Ruiyu +3 位作者 Tang Manting Xie Yue Zhao Li Wang Shijia 《Journal of Southeast University(English Edition)》 EI CAS 2019年第2期160-167,共8页
In order to improve the efficiency of speech emotion recognition across corpora,a speech emotion transfer learning method based on the deep sparse auto-encoder is proposed.The algorithm first reconstructs a small amou... In order to improve the efficiency of speech emotion recognition across corpora,a speech emotion transfer learning method based on the deep sparse auto-encoder is proposed.The algorithm first reconstructs a small amount of data in the target domain by training the deep sparse auto-encoder,so that the encoder can learn the low-dimensional structural representation of the target domain data.Then,the source domain data and the target domain data are coded by the trained deep sparse auto-encoder to obtain the reconstruction data of the low-dimensional structural representation close to the target domain.Finally,a part of the reconstructed tagged target domain data is mixed with the reconstructed source domain data to jointly train the classifier.This part of the target domain data is used to guide the source domain data.Experiments on the CASIA,SoutheastLab corpus show that the model recognition rate after a small amount of data transferred reached 89.2%and 72.4%on the DNN.Compared to the training results of the complete original corpus,it only decreased by 2%in the CASIA corpus,and only 3.4%in the SoutheastLab corpus.Experiments show that the algorithm can achieve the effect of labeling all data in the extreme case that the data set has only a small amount of data tagged. 展开更多
关键词 sparse auto-encoder transfer learning speech emotion recognition
下载PDF
上一页 1 2 下一页 到第
使用帮助 返回顶部