期刊文献+
共找到23,888篇文章
< 1 2 250 >
每页显示 20 50 100
TTS在车载乘客信息系统中的应用
1
作者 汤俊芹 《电声技术》 2024年第1期25-28,共4页
随着从文本到语音(Text To Speech,TTS)技术的发展,其语音效果已经可以达到真人播报效果。基于此,提出将TTS技术应用到车载乘客信息系统中,改变传统预录语音文件报站的方式,极大地提高语音播报的灵活性和可维护性。
关键词 从文本到语音(tts) 乘客信息系统 语音质量
下载PDF
Multi-Objective Equilibrium Optimizer for Feature Selection in High-Dimensional English Speech Emotion Recognition
2
作者 Liya Yue Pei Hu +1 位作者 Shu-Chuan Chu Jeng-Shyang Pan 《Computers, Materials & Continua》 SCIE EI 2024年第2期1957-1975,共19页
Speech emotion recognition(SER)uses acoustic analysis to find features for emotion recognition and examines variations in voice that are caused by emotions.The number of features acquired with acoustic analysis is ext... Speech emotion recognition(SER)uses acoustic analysis to find features for emotion recognition and examines variations in voice that are caused by emotions.The number of features acquired with acoustic analysis is extremely high,so we introduce a hybrid filter-wrapper feature selection algorithm based on an improved equilibrium optimizer for constructing an emotion recognition system.The proposed algorithm implements multi-objective emotion recognition with the minimum number of selected features and maximum accuracy.First,we use the information gain and Fisher Score to sort the features extracted from signals.Then,we employ a multi-objective ranking method to evaluate these features and assign different importance to them.Features with high rankings have a large probability of being selected.Finally,we propose a repair strategy to address the problem of duplicate solutions in multi-objective feature selection,which can improve the diversity of solutions and avoid falling into local traps.Using random forest and K-nearest neighbor classifiers,four English speech emotion datasets are employed to test the proposed algorithm(MBEO)as well as other multi-objective emotion identification techniques.The results illustrate that it performs well in inverted generational distance,hypervolume,Pareto solutions,and execution time,and MBEO is appropriate for high-dimensional English SER. 展开更多
关键词 speech emotion recognition filter-wrapper HIGH-DIMENSIONAL feature selection equilibrium optimizer MULTI-OBJECTIVE
下载PDF
Exploring Sequential Feature Selection in Deep Bi-LSTM Models for Speech Emotion Recognition
3
作者 Fatma Harby Mansor Alohali +1 位作者 Adel Thaljaoui Amira Samy Talaat 《Computers, Materials & Continua》 SCIE EI 2024年第2期2689-2719,共31页
Machine Learning(ML)algorithms play a pivotal role in Speech Emotion Recognition(SER),although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state.The examination of the emotiona... Machine Learning(ML)algorithms play a pivotal role in Speech Emotion Recognition(SER),although they encounter a formidable obstacle in accurately discerning a speaker’s emotional state.The examination of the emotional states of speakers holds significant importance in a range of real-time applications,including but not limited to virtual reality,human-robot interaction,emergency centers,and human behavior assessment.Accurately identifying emotions in the SER process relies on extracting relevant information from audio inputs.Previous studies on SER have predominantly utilized short-time characteristics such as Mel Frequency Cepstral Coefficients(MFCCs)due to their ability to capture the periodic nature of audio signals effectively.Although these traits may improve their ability to perceive and interpret emotional depictions appropriately,MFCCS has some limitations.So this study aims to tackle the aforementioned issue by systematically picking multiple audio cues,enhancing the classifier model’s efficacy in accurately discerning human emotions.The utilized dataset is taken from the EMO-DB database,preprocessing input speech is done using a 2D Convolution Neural Network(CNN)involves applying convolutional operations to spectrograms as they afford a visual representation of the way the audio signal frequency content changes over time.The next step is the spectrogram data normalization which is crucial for Neural Network(NN)training as it aids in faster convergence.Then the five auditory features MFCCs,Chroma,Mel-Spectrogram,Contrast,and Tonnetz are extracted from the spectrogram sequentially.The attitude of feature selection is to retain only dominant features by excluding the irrelevant ones.In this paper,the Sequential Forward Selection(SFS)and Sequential Backward Selection(SBS)techniques were employed for multiple audio cues features selection.Finally,the feature sets composed from the hybrid feature extraction methods are fed into the deep Bidirectional Long Short Term Memory(Bi-LSTM)network to discern emotions.Since the deep Bi-LSTM can hierarchically learn complex features and increases model capacity by achieving more robust temporal modeling,it is more effective than a shallow Bi-LSTM in capturing the intricate tones of emotional content existent in speech signals.The effectiveness and resilience of the proposed SER model were evaluated by experiments,comparing it to state-of-the-art SER techniques.The results indicated that the model achieved accuracy rates of 90.92%,93%,and 92%over the Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS),Berlin Database of Emotional Speech(EMO-DB),and The Interactive Emotional Dyadic Motion Capture(IEMOCAP)datasets,respectively.These findings signify a prominent enhancement in the ability to emotional depictions identification in speech,showcasing the potential of the proposed model in advancing the SER field. 展开更多
关键词 Artificial intelligence application multi features sequential selection speech emotion recognition deep Bi-LSTM
下载PDF
Audio-Text Multimodal Speech Recognition via Dual-Tower Architecture for Mandarin Air Traffic Control Communications
4
作者 Shuting Ge Jin Ren +3 位作者 Yihua Shi Yujun Zhang Shunzhi Yang Jinfeng Yang 《Computers, Materials & Continua》 SCIE EI 2024年第3期3215-3245,共31页
In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a p... In air traffic control communications (ATCC), misunderstandings between pilots and controllers could result in fatal aviation accidents. Fortunately, advanced automatic speech recognition technology has emerged as a promising means of preventing miscommunications and enhancing aviation safety. However, most existing speech recognition methods merely incorporate external language models on the decoder side, leading to insufficient semantic alignment between speech and text modalities during the encoding phase. Furthermore, it is challenging to model acoustic context dependencies over long distances due to the longer speech sequences than text, especially for the extended ATCC data. To address these issues, we propose a speech-text multimodal dual-tower architecture for speech recognition. It employs cross-modal interactions to achieve close semantic alignment during the encoding stage and strengthen its capabilities in modeling auditory long-distance context dependencies. In addition, a two-stage training strategy is elaborately devised to derive semantics-aware acoustic representations effectively. The first stage focuses on pre-training the speech-text multimodal encoding module to enhance inter-modal semantic alignment and aural long-distance context dependencies. The second stage fine-tunes the entire network to bridge the input modality variation gap between the training and inference phases and boost generalization performance. Extensive experiments demonstrate the effectiveness of the proposed speech-text multimodal speech recognition method on the ATCC and AISHELL-1 datasets. It reduces the character error rate to 6.54% and 8.73%, respectively, and exhibits substantial performance gains of 28.76% and 23.82% compared with the best baseline model. The case studies indicate that the obtained semantics-aware acoustic representations aid in accurately recognizing terms with similar pronunciations but distinctive semantics. The research provides a novel modeling paradigm for semantics-aware speech recognition in air traffic control communications, which could contribute to the advancement of intelligent and efficient aviation safety management. 展开更多
关键词 speech-text multimodal automatic speech recognition semantic alignment air traffic control communications dual-tower architecture
下载PDF
RDW联合TTS评分对肺挫伤致ARDS的预测价值
5
作者 刘哲 王彤 《中国急救复苏与灾害医学杂志》 2024年第3期331-335,共5页
目的探讨红细胞分布宽度(RDW)联合胸部创伤严重(TTS)评分对肺挫伤致急性呼吸窘迫综合征(ARDS)的预测价值。方法收集2022年2月—2023年2月徐州仁慈医院重症医学科符合纳入标准患者的临床资料,包括患者的性别、年龄、身高、体重、白细胞... 目的探讨红细胞分布宽度(RDW)联合胸部创伤严重(TTS)评分对肺挫伤致急性呼吸窘迫综合征(ARDS)的预测价值。方法收集2022年2月—2023年2月徐州仁慈医院重症医学科符合纳入标准患者的临床资料,包括患者的性别、年龄、身高、体重、白细胞计数、RDW、C反应蛋白、动脉血气、氧合指数、肝功能、肾功能、胸部CT、创伤原因、既往病史、并发症、TTS评分。根据柏林标准分为ARDS组和非ARDS组,分析两组的一般临床资料,绘制RDW、TTS评分及RDW联合TTS评分的ROC曲线。结果本研究中符合纳入标准的患者有56例,其中男性42人,女性14人。发生ARDS的有25人,31人未发生ARDS。其中发生机制为车祸伤37人,堕落伤9人,爆裂伤4人,重物砸伤2人,其他原因外伤4人。两组发生肺挫伤的患者中男性均多于女性,肺挫伤中创伤原因为车祸的人数最多。TTS评分、RDW、WBC、PaO_(2)/FiO_(2)差异具有统计学意义(P<0.05),WBC、TTS、RDW/ALB呈正相关性(r=0.186,P=0.05;r=0.648,P=0.001;r=0.812,P=0.003),与PaO_(2)/FiO_(2)呈负相关性(r=-0.013,P=0.006)。RDW、TTS、RDW联合TTS评分的AUC为0.811、0.966、0.976,RDW联合TTS评分的AUC为0.976,高于RDW及TTS评分。结论RDW、TTS、RDW联合TTS评分对ARDS都有较强的预测价值,RDW联合TTS评分评估肺挫伤后ARDS发生的特异性及敏感性优于单个的临床指标。 展开更多
关键词 肺挫伤 急性呼吸窘迫综合征(ARDS) 红细胞分布宽度(RDW) 胸部创伤严重评分(tts)
下载PDF
Research on the Application of Second Language Acquisition Theory in College English Speech Teaching
6
作者 Hui Zhang 《Journal of Contemporary Educational Research》 2024年第3期173-178,共6页
The teaching of English speeches in universities aims to enhance oral communication ability,improve English communication skills,and expand English knowledge,occupying a core position in English teaching in universiti... The teaching of English speeches in universities aims to enhance oral communication ability,improve English communication skills,and expand English knowledge,occupying a core position in English teaching in universities.This article takes the theory of second language acquisition as the background,analyzes the important role and value of this theory in English speech teaching in universities,and explores how to apply the theory of second language acquisition in English speech teaching in universities.It aims to strengthen the cultivation of English skilled talents and provide a brief reference for improving English speech teaching in universities. 展开更多
关键词 Second language acquisition theory Teaching English speeches in universities Practical strategies
下载PDF
基于TTS技术的智能化英语自动翻译系统
7
作者 王渭刚 《信息技术》 2023年第3期117-121,127,共6页
提出基于TTS技术的智能化英语自动翻译系统设计研究。选型并配置文音转换器与语音处理器,以此为基础,引入TTS技术(文本分析、韵律控制与语音合成),结合英语翻译需求,设计系统软件模块,包括连续语音自动切分与标注模块、语音韵律控制模... 提出基于TTS技术的智能化英语自动翻译系统设计研究。选型并配置文音转换器与语音处理器,以此为基础,引入TTS技术(文本分析、韵律控制与语音合成),结合英语翻译需求,设计系统软件模块,包括连续语音自动切分与标注模块、语音韵律控制模块、语音合成模块及语音库裁减模块。通过上述硬件单元与软件模块的设计,实现了智能化英语自动翻译系统的运行。实验数据显示:相较于对比系统,应用设计系统获得的语音韵律控制参数偏差较小,语音自然度因子数值更大,充分表明设计系统英语翻译语音更为精准。 展开更多
关键词 文本分析 英语翻译 语音自动切分标注 语音库裁减 语音韵律控制
下载PDF
Multilayer Neural Network Based Speech Emotion Recognition for Smart Assistance 被引量:2
8
作者 Sandeep Kumar MohdAnul Haq +4 位作者 Arpit Jain C.Andy Jason Nageswara Rao Moparthi Nitin Mittal Zamil S.Alzamil 《Computers, Materials & Continua》 SCIE EI 2023年第1期1523-1540,共18页
Day by day,biometric-based systems play a vital role in our daily lives.This paper proposed an intelligent assistant intended to identify emotions via voice message.A biometric system has been developed to detect huma... Day by day,biometric-based systems play a vital role in our daily lives.This paper proposed an intelligent assistant intended to identify emotions via voice message.A biometric system has been developed to detect human emotions based on voice recognition and control a few electronic peripherals for alert actions.This proposed smart assistant aims to provide a support to the people through buzzer and light emitting diodes(LED)alert signals and it also keep track of the places like households,hospitals and remote areas,etc.The proposed approach is able to detect seven emotions:worry,surprise,neutral,sadness,happiness,hate and love.The key elements for the implementation of speech emotion recognition are voice processing,and once the emotion is recognized,the machine interface automatically detects the actions by buzzer and LED.The proposed system is trained and tested on various benchmark datasets,i.e.,Ryerson Audio-Visual Database of Emotional Speech and Song(RAVDESS)database,Acoustic-Phonetic Continuous Speech Corpus(TIMIT)database,Emotional Speech database(Emo-DB)database and evaluated based on various parameters,i.e.,accuracy,error rate,and time.While comparing with existing technologies,the proposed algorithm gave a better error rate and less time.Error rate and time is decreased by 19.79%,5.13 s.for the RAVDEES dataset,15.77%,0.01 s for the Emo-DB dataset and 14.88%,3.62 for the TIMIT database.The proposed model shows better accuracy of 81.02%for the RAVDEES dataset,84.23%for the TIMIT dataset and 85.12%for the Emo-DB dataset compared to Gaussian Mixture Modeling(GMM)and Support Vector Machine(SVM)Model. 展开更多
关键词 speech emotion recognition classifier implementation feature extraction and selection smart assistance
下载PDF
A Multi-Level Circulant Cross-Modal Transformer for Multimodal Speech Emotion Recognition 被引量:1
9
作者 Peizhu Gong Jin Liu +3 位作者 Zhongdai Wu Bing Han YKenWang Huihua He 《Computers, Materials & Continua》 SCIE EI 2023年第2期4203-4220,共18页
Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due... Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach. 展开更多
关键词 speech emotion recognition self-supervised embedding model cross-modal transformer self-attention
下载PDF
VC++开发基于Microsoft Speech SDK的TTS软件 被引量:1
10
作者 赵常寿 吴红权 张玉忠 《电脑编程技巧与维护》 2013年第19期13-18,共6页
基于Microsoft Speech SDK提供的SAPI函数,用VC++编写文语转换程序,给出了实现代码,完成文本朗读和语音保存为WAV文件功能。
关键词 tts软件 SAPI函数 ISpVoice接口 语音库
下载PDF
Age-related hearing loss accelerates the decline in fast speech comprehension and the decompensation of cortical network connections 被引量:1
11
作者 He-Mei Huang Gui-Sheng Chen +10 位作者 Zhong-Yi Liu Qing-Lin Meng Jia-Hong Li Han-Wen Dong Yu-Chen Chen Fei Zhao Xiao-Wu Tang Jin-Liang Gao Xi-Ming Chen Yue-Xin Cai Yi-Qing Zheng 《Neural Regeneration Research》 SCIE CAS CSCD 2023年第9期1968-1975,共8页
Patients with age-related hearing loss face hearing difficulties in daily life.The causes of age-related hearing loss are complex and include changes in peripheral hearing,central processing,and cognitive-related abil... Patients with age-related hearing loss face hearing difficulties in daily life.The causes of age-related hearing loss are complex and include changes in peripheral hearing,central processing,and cognitive-related abilities.Furthermore,the factors by which aging relates to hearing loss via changes in audito ry processing ability are still unclear.In this cross-sectional study,we evaluated 27 older adults(over 60 years old) with age-related hearing loss,21 older adults(over 60years old) with normal hearing,and 30 younger subjects(18-30 years old) with normal hearing.We used the outcome of the uppe r-threshold test,including the time-compressed thres h old and the speech recognition threshold in noisy conditions,as a behavioral indicator of auditory processing ability.We also used electroencephalogra p hy to identify presbycusis-related abnormalities in the brain while the participants were in a spontaneous resting state.The timecompressed threshold and speech recognition threshold data indicated significant diffe rences among the groups.In patients with age-related hearing loss,information masking(babble noise) had a greater effect than energy masking(speech-shaped noise) on processing difficulties.In terms of resting-state electroencephalography signals,we observed enhanced fro ntal lobe(Brodmann’s area,BA11) activation in the older adults with normal hearing compared with the younger participants with normal hearing,and greater activation in the parietal(BA7) and occipital(BA19) lobes in the individuals with age-related hearing loss compared with the younger adults.Our functional connection analysis suggested that compared with younger people,the older adults with normal hearing exhibited enhanced connections among networks,including the default mode network,sensorimotor network,cingulo-opercular network,occipital network,and frontoparietal network.These results suggest that both normal aging and the development of age-related hearing loss have a negative effect on advanced audito ry processing capabilities and that hearing loss accele rates the decline in speech comprehension,especially in speech competition situations.Older adults with normal hearing may have increased compensatory attentional resource recruitment represented by the to p-down active listening mechanism,while those with age-related hearing loss exhibit decompensation of network connections involving multisensory integration. 展开更多
关键词 age-related hearing loss aging ELECTROENCEPHALOGRAPHY fast-speech comprehension functional brain network functional connectivity restingstate SLORETA source analysis speech reception threshold
下载PDF
Speech Recognition via CTC-CNN Model
12
作者 Wen-Tsai Sung Hao-WeiKang Sung-Jung Hsiao 《Computers, Materials & Continua》 SCIE EI 2023年第9期3833-3858,共26页
In the speech recognition system,the acoustic model is an important underlying model,and its accuracy directly affects the performance of the entire system.This paper introduces the construction and training process o... In the speech recognition system,the acoustic model is an important underlying model,and its accuracy directly affects the performance of the entire system.This paper introduces the construction and training process of the acoustic model in detail and studies the Connectionist temporal classification(CTC)algorithm,which plays an important role in the end-to-end framework,established a convolutional neural network(CNN)combined with an acoustic model of Connectionist temporal classification to improve the accuracy of speech recognition.This study uses a sound sensor,ReSpeakerMic Array v2.0.1,to convert the collected speech signals into text or corresponding speech signals to improve communication and reduce noise and hardware interference.The baseline acousticmodel in this study faces challenges such as long training time,high error rate,and a certain degree of overfitting.The model is trained through continuous design and improvement of the relevant parameters of the acousticmodel,and finally the performance is selected according to the evaluation index.Excellentmodel,which reduces the error rate to about 18%,thus improving the accuracy rate.Finally,comparative verificationwas carried out from the selection of acoustic feature parameters,the selection of modeling units,and the speaker’s speech rate,which further verified the excellent performance of the CTCCNN_5+BN+Residual model structure.In terms of experiments,to train and verify the CTC-CNN baseline acoustic model,this study uses THCHS-30 and ST-CMDS speech data sets as training data sets,and after 54 epochs of training,the word error rate of the acoustic model training set is 31%,the word error rate of the test set is stable at about 43%.This experiment also considers the surrounding environmental noise.Under the noise level of 80∼90 dB,the accuracy rate is 88.18%,which is the worst performance among all levels.In contrast,at 40–60 dB,the accuracy was as high as 97.33%due to less noise pollution. 展开更多
关键词 Artificial intelligence speech recognition speech to text convolutional neural network automatic speech recognition
下载PDF
Investigation of hearing aid users'speech understanding in noise and their spectral-temporal resolution skills
13
作者 Mert Kılıç Eyyup Kara 《Journal of Otology》 CAS CSCD 2023年第3期146-151,共6页
Purpose:Our study aims to compare speech understanding in noise and spectral-temporal resolution skills with regard to the degree of hearing loss,age,hearing aid use experience and gender of hearing aid users.Methods:... Purpose:Our study aims to compare speech understanding in noise and spectral-temporal resolution skills with regard to the degree of hearing loss,age,hearing aid use experience and gender of hearing aid users.Methods:Our study included sixty-eight hearing aid users aged between 40-70 years,with bilateral mild and moderate symmetrical sensorineural hearing loss.Random gap detection test,Turkish matrix test and spectral-temporally modulated ripple test were implemented on the participants with bilateral hearing aids.The test results acquired were compared statistically according to different variables and the correlations were examined.Results:No statistically significant differences were observed for speech-in-noise recognition,spectraltemporal resolution among older and younger adults in hearing aid users(p>0.05).There wasn’t found a statistically significant difference among test outcomes as regards different hearing loss degrees(p>0.05).Higher performances were obtained in terms of temporal resolution in male participants and participants with more hearing aid use experience(p<0.05).Significant correlations were obtained between the results of speech-in-noise recognition,temporal resolution and spectral resolution tests performed with hearing aids(p<0.05).Conclusion:Our study findings emphasized the importance of regular hearing aid use and it showed that some auditory skills can be improved with hearing aids.Observation of correlations among the speechin-noise recognition,temporal resolution and spectral resolution tests have revealed that these skills should be evaluated as a whole to maximize the patient’s communication abilities. 展开更多
关键词 Hearing aids speech in noise Spectral resolution speech intelligibility Temporal resolution
下载PDF
Improving Speech Enhancement Framework via Deep Learning
14
作者 Sung-Jung Hsiao Wen-Tsai Sung 《Computers, Materials & Continua》 SCIE EI 2023年第5期3817-3832,共16页
Speech plays an extremely important role in social activities.Many individuals suffer from a“speech barrier,”which limits their communication with others.In this study,an improved speech recognitionmethod is propose... Speech plays an extremely important role in social activities.Many individuals suffer from a“speech barrier,”which limits their communication with others.In this study,an improved speech recognitionmethod is proposed that addresses the needs of speech-impaired and deaf individuals.A basic improved connectionist temporal classification convolutional neural network(CTC-CNN)architecture acoustic model was constructed by combining a speech database with a deep neural network.Acoustic sensors were used to convert the collected voice signals into text or corresponding voice signals to improve communication.The method can be extended to modern artificial intelligence techniques,with multiple applications such as meeting minutes,medical reports,and verbatim records for cars,sales,etc.For experiments,a modified CTC-CNN was used to train an acoustic model,which showed better performance than the earlier common algorithms.Thus a CTC-CNN baseline acoustic model was constructed and optimized,which reduced the error rate to about 18%and improved the accuracy rate. 展开更多
关键词 Artificial intelligence speech recognition speech to text CTC-CNN
下载PDF
Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning
15
作者 Thanh X.Le An T.Le Quang H.Nguyen 《Computer Systems Science & Engineering》 SCIE EI 2023年第2期1263-1278,共16页
In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the met... In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the method of con-structing a speech synthesizer for each emotion has some limitations.First,this method often requires an emotional-speech data set with many sentences.Such data sets are very time-intensive and labor-intensive to complete.Second,training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning.In addition,each model for each emotion failed to take advantage of data sets of other emotions.In this paper,we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flow-tron model.In addition,we provide a new method to build a speech corpus that is scalable and whose quality is easy to control.Next,to produce a high-quality speech synthesis model,we used this data set to train the Tacotron 2 model.We used it as a pre-trained model to train the Flowtron model.We applied this method to synthesize Vietnamese speech with sadness and happiness.Mean opi-nion score(MOS)assessment results show that MOS is 3.61 for sadness and 3.95 for happiness.In conclusion,the proposed method proves to be more effec-tive for a high degree of automation and fast emotional sentence generation,using a small emotional-speech data set. 展开更多
关键词 Emotional speech synthesis flowtron speech synthesis style transfer vietnamese speech
下载PDF
A New Speech Encoder Based on Dynamic Framing Approach
16
作者 Renyuan Liu Jian Yang +1 位作者 Xiaobing Zhou Xiaoguang Yue 《Computer Modeling in Engineering & Sciences》 SCIE EI 2023年第8期1259-1276,共18页
Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been con... Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been conducted on two aspects.The first aspect is to encode speech frame by frame.The second aspect is to encode the whole speech to a vector.But the scale in these aspects is fixed.So,encoding speech with an adjustable scale for more latent information is worthy of investigation.But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding.It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding.This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding.The speech feature fromourmodel achieves three functions.First,the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length.Second,our model can get text embedding fromspeech,and the encoded speech feature is similar to the text embedding result.Finally,it can transfer the style of synthesis speech and make it more similar to the given reference speech. 展开更多
关键词 speech synthesis dynamic framing convolution network speech encoding
下载PDF
Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network
17
作者 S.Girirajan A.Pandian 《Intelligent Automation & Soft Computing》 SCIE 2023年第2期1987-2001,共15页
Speech enhancement is the task of taking a noisy speech input and pro-ducing an enhanced speech output.In recent years,the need for speech enhance-ment has been increased due to challenges that occurred in various app... Speech enhancement is the task of taking a noisy speech input and pro-ducing an enhanced speech output.In recent years,the need for speech enhance-ment has been increased due to challenges that occurred in various applications such as hearing aids,Automatic Speech Recognition(ASR),and mobile speech communication systems.Most of the Speech Enhancement research work has been carried out for English,Chinese,and other European languages.Only a few research works involve speech enhancement in Indian regional Languages.In this paper,we propose a two-fold architecture to perform speech enhancement for Tamil speech signal based on convolutional recurrent neural network(CRN)that addresses the speech enhancement in a real-time single channel or track of sound created by the speaker.In thefirst stage mask based long short-term mem-ory(LSTM)is used for noise suppression along with loss function and in the sec-ond stage,Convolutional Encoder-Decoder(CED)is used for speech restoration.The proposed model is evaluated on various speaker and noisy environments like Babble noise,car noise,and white Gaussian noise.The proposed CRN model improves speech quality by 0.1 points when compared with the LSTM base model and also CRN requires fewer parameters for training.The performance of the pro-posed model is outstanding even in low Signal to Noise Ratio(SNR). 展开更多
关键词 speech enhancement convolutional encoder-decoder long short-term memory noise suppression speech restoration
下载PDF
Improved Ant Lion Optimizer with Deep Learning Driven Arabic Hate Speech Detection
18
作者 Abdelwahed Motwakel Badriyya B.Al-onazi +5 位作者 Jaber S.Alzahrani Sana Alazwari Mahmoud Othman Abu Sarwar Zamani Ishfaq Yaseen Amgad Atta Abdelmageed 《Computer Systems Science & Engineering》 SCIE EI 2023年第9期3321-3338,共18页
Arabic is the world’s first language,categorized by its rich and complicated grammatical formats.Furthermore,the Arabic morphology can be perplexing because nearly 10,000 roots and 900 patterns were the basis for ver... Arabic is the world’s first language,categorized by its rich and complicated grammatical formats.Furthermore,the Arabic morphology can be perplexing because nearly 10,000 roots and 900 patterns were the basis for verbs and nouns.The Arabic language consists of distinct variations utilized in a community and particular situations.Social media sites are a medium for expressing opinions and social phenomena like racism,hatred,offensive language,and all kinds of verbal violence.Such conduct does not impact particular nations,communities,or groups only,extending beyond such areas into people’s everyday lives.This study introduces an Improved Ant Lion Optimizer with Deep Learning Dirven Offensive and Hate Speech Detection(IALODL-OHSD)on Arabic Cross-Corpora.The presented IALODL-OHSD model mainly aims to detect and classify offensive/hate speech expressed on social media.In the IALODL-OHSD model,a threestage process is performed,namely pre-processing,word embedding,and classification.Primarily,data pre-processing is performed to transform the Arabic social media text into a useful format.In addition,the word2vec word embedding process is utilized to produce word embeddings.The attentionbased cascaded long short-term memory(ACLSTM)model is utilized for the classification process.Finally,the IALO algorithm is exploited as a hyperparameter optimizer to boost classifier results.To illustrate a brief result analysis of the IALODL-OHSD model,a detailed set of simulations were performed.The extensive comparison study portrayed the enhanced performance of the IALODL-OHSD model over other approaches. 展开更多
关键词 Hate speech offensive speech Arabic corpora natural language processing social networks
下载PDF
Design of Hierarchical Classifier to Improve Speech Emotion Recognition
19
作者 P.Vasuki 《Computer Systems Science & Engineering》 SCIE EI 2023年第1期19-33,共15页
Automatic Speech Emotion Recognition(SER)is used to recognize emotion from speech automatically.Speech Emotion recognition is working well in a laboratory environment but real-time emotion recognition has been influen... Automatic Speech Emotion Recognition(SER)is used to recognize emotion from speech automatically.Speech Emotion recognition is working well in a laboratory environment but real-time emotion recognition has been influenced by the variations in gender,age,the cultural and acoustical background of the speaker.The acoustical resemblance between emotional expressions further increases the complexity of recognition.Many recent research works are concentrated to address these effects individually.Instead of addressing every influencing attribute individually,we would like to design a system,which reduces the effect that arises on any factor.We propose a two-level Hierarchical classifier named Interpreter of responses(IR).Thefirst level of IR has been realized using Support Vector Machine(SVM)and Gaussian Mixer Model(GMM)classifiers.In the second level of IR,a discriminative SVM classifier has been trained and tested with meta information offirst-level classifiers along with the input acoustical feature vector which is used in primary classifiers.To train the system with a corpus of versatile nature,an integrated emotion corpus has been composed using emotion samples of 5 speech corpora,namely;EMO-DB,IITKGP-SESC,SAVEE Corpus,Spanish emotion corpus,CMU's Woogle corpus.The hierarchical classifier has been trained and tested using MFCC and Low-Level Descriptors(LLD).The empirical analysis shows that the proposed classifier outperforms the traditional classifiers.The proposed ensemble design is very generic and can be adapted even when the number and nature of features change.Thefirst-level classifiers GMM or SVM may be replaced with any other learning algorithm. 展开更多
关键词 speech emotion recognition hierarchical classifier design ENSEMBLE emotion speech corpora
下载PDF
An Intervention Study of Language Cognition and Emotional Speech Community Method for Children’s Speech Disorders
20
作者 Yali Qiang 《International Journal of Mental Health Promotion》 2023年第5期627-637,共11页
Speech disorders are a common type of childhood disease.Through experimental intervention,this study aims to improve the vocabulary comprehension levels and language ability of children with speech disorders through t... Speech disorders are a common type of childhood disease.Through experimental intervention,this study aims to improve the vocabulary comprehension levels and language ability of children with speech disorders through the language cognition and emotional speech community method.We also conduct a statistical analysis of the inter-ventional effect.Among children with speech disorders in Dongguan City,224 were selected and grouped accord-ing to their receptive language ability and IQ.The 112 children in the experimental group(EG)received speech therapy with language cognitive and emotional speech community,while the 112 children in the control group(CG)only received conventional treatment.After six months of experimental intervention,the Peabody Picture Vocabulary Test-Revised(PPVT-R)was used to test the language ability of the two groups.Overall,we employed a quantitative approach to obtain numerical values,examine the variables identified,and test hypotheses.Further-more,we used descriptive statistics to explore the research questions related to the study and statistically describe the overall distribution of the demographic variables.The statistical t-test was used to analyze the data.The data shows that after intervention through language cognition and emotional speech community therapy,the PPVT-R score of the EG was significantly higher than that of the CG.Therefore,we conclude that there is a significant difference in language ability between the EG and CG after the therapy.Although both groups improved,the post-therapy language level of EG is significantly higher than that of CG.The total effective rate in EG is higher than CG,and the difference is statistically significant(p<0.05).Therefore,we conclude that the language cogni-tion and emotional speech community method is effective as an interventional treatment of children’s speech dis-orders and that it is more effective than traditional treatment methods. 展开更多
关键词 Language cognition and emotion speech community children’s speech disorder
下载PDF
上一页 1 2 250 下一页 到第
使用帮助 返回顶部