期刊文献+
共找到623篇文章
< 1 2 32 >
每页显示 20 50 100
Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning
1
作者 Thanh X.Le An T.Le Quang H.Nguyen 《Computer Systems Science & Engineering》 SCIE EI 2023年第2期1263-1278,共16页
In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the met... In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the method of con-structing a speech synthesizer for each emotion has some limitations.First,this method often requires an emotional-speech data set with many sentences.Such data sets are very time-intensive and labor-intensive to complete.Second,training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning.In addition,each model for each emotion failed to take advantage of data sets of other emotions.In this paper,we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flow-tron model.In addition,we provide a new method to build a speech corpus that is scalable and whose quality is easy to control.Next,to produce a high-quality speech synthesis model,we used this data set to train the Tacotron 2 model.We used it as a pre-trained model to train the Flowtron model.We applied this method to synthesize Vietnamese speech with sadness and happiness.Mean opi-nion score(MOS)assessment results show that MOS is 3.61 for sadness and 3.95 for happiness.In conclusion,the proposed method proves to be more effec-tive for a high degree of automation and fast emotional sentence generation,using a small emotional-speech data set. 展开更多
关键词 Emotional speech synthesis flowtron speech synthesis style transfer vietnamese speech
下载PDF
A HMM-based Mandarin Chinese Singing Voice Synthesis System 被引量:4
2
作者 Xian Li Zengfu Wang 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI 2016年第2期192-202,共11页
We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual fea... We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual features are well designed for training. F0 and spectrum of singing voice are simultaneously modeled with context-dependent HMMs. There is a new problem, F0 of singing voice is always sparse because of large amount of context, i.e., tempo and pitch of note, key, time signature and etc. So the features hardly ever appeared in the training data cannot be well obtained. To address this problem, difference between F0 of singing voice and that of musical score (DF0) is modeled by a single Viterbi training. To overcome the over-smoothing of the generated F0 contour, syllable level F0 model based on discrete cosine transforms (DCT) is applied, F0 contour is generated by integrating two-level statistical models. The experimental results demonstrate that the proposed system outperforms the baseline system in both objective and subjective evaluations. The proposed system can generate a more natural F0 contour. Furthermore, the syllable level F0 model can make singing voice more expressive. © 2014 Chinese Association of Automation. 展开更多
关键词 Cosine transforms Hidden Markov models Markov processes speech synthesis
下载PDF
HMM-Based Photo-Realistic Talking Face Synthesis Using Facial Expression Parameter Mapping with Deep Neural Networks
3
作者 Kazuki Sato Takashi Nose Akinori Ito 《Journal of Computer and Communications》 2017年第10期50-65,共16页
This paper proposes a technique for synthesizing a pixel-based photo-realistic talking face animation using two-step synthesis with HMMs and DNNs. We introduce facial expression parameters as an intermediate represent... This paper proposes a technique for synthesizing a pixel-based photo-realistic talking face animation using two-step synthesis with HMMs and DNNs. We introduce facial expression parameters as an intermediate representation that has a good correspondence with both of the input contexts and the output pixel data of face images. The sequences of the facial expression parameters are modeled using context-dependent HMMs with static and dynamic features. The mapping from the expression parameters to the target pixel images are trained using DNNs. We examine the required amount of the training data for HMMs and DNNs and compare the performance of the proposed technique with the conventional PCA-based technique through objective and subjective evaluation experiments. 展开更多
关键词 Visual-speech synthesis TALKING Head Hidden MARKOV Models (HMMs) Deep Neural Networks (DNNs) FACIAL Expression Parameter
下载PDF
Application of Cochlear Model in Speech Analysis/Synthesis Using Sinusoidal Representation 被引量:1
4
作者 Yuan Jingxian Wan Wanggen Yu Xiaoqing (School of Communication & Information Engineering, Shanghai University) 《Advances in Manufacturing》 SCIE CAS 1999年第1期47-52,共6页
A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. T... A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. The computer simulation shows that speech can be synthesized with only 12 parameters per frame on the average. The method has the advantages of few parameters, low complexity and high performance of speech representation. The synthetic speech has high intelligibility. 展开更多
关键词 speech analysis/synthesis sinusoidal representation cochlear model auditory spectrum
下载PDF
Emotional Speech Synthesis Based on Prosodic Feature Modification 被引量:2
5
作者 Ling He Hua Huang Margaret Lech 《Engineering(科研)》 2013年第10期73-77,共5页
The synthesis of emotional speech has wide applications in the field of human-computer interaction, medicine, industry and so on. In this work, an emotional speech synthesis system is proposed based on prosodic featur... The synthesis of emotional speech has wide applications in the field of human-computer interaction, medicine, industry and so on. In this work, an emotional speech synthesis system is proposed based on prosodic features modification and Time Domain Pitch Synchronous OverLap Add (TD-PSOLA) waveform concatenative algorithm. The system produces synthesized speech with four types of emotion: angry, happy, sad and bored. The experiment results show that the proposed emotional speech synthesis system achieves a good performance. The produced utterances present clear emotional expression. The subjective test reaches high classification accuracy for different types of synthesized emotional speech utterances. 展开更多
关键词 EMOTIONAL speech synthesis Prosodic Features Time Domain PITCH SYNCHRONOUS OVERLAP ADD
下载PDF
Resources for Development of Hindi Speech Synthesis System: An Overview
6
作者 Archana Balyan 《Open Journal of Applied Sciences》 2017年第6期233-241,共9页
Most of the information in digital world is accessible to few who can read or understand a particular language. The speech corpus acquisition is an essential part of all spoken technology systems. The quality and the ... Most of the information in digital world is accessible to few who can read or understand a particular language. The speech corpus acquisition is an essential part of all spoken technology systems. The quality and the volume of speech data in corpus directly affect the accuracy of the system. However, there are a lot of scopes to develop speech technology system using Hindi language which is spoken primarily in India. To achieve such an ambitious goal, the collection of standard database is a prerequisite. This paper summarizes the Hindi corpus and lexical resources being developed by various organizations across the country. 展开更多
关键词 speech Database CORPORA LEXICON speech synthesis LINGUISTICS Natural Language Processing
下载PDF
An Intonation Speech Synthesis Model for Indonesian Using Pitch Pattern and Phrase Identification
7
作者 Yohanes Suyanto Subanar   +1 位作者 Agus Harjoko Sri Hartati 《Journal of Signal and Information Processing》 2014年第3期80-88,共9页
Prosody in speech synthesis systems (text-to-speech) is a determinant of tone, duration, and loudness of speech sound. Intonation is a part of prosody which determines the speech tone. In Indonesian, intonation is det... Prosody in speech synthesis systems (text-to-speech) is a determinant of tone, duration, and loudness of speech sound. Intonation is a part of prosody which determines the speech tone. In Indonesian, intonation is determined by the structure of sentences, types of sentences, and also the position of the word in a sentence. In this study, a model of speech synthesis that focuses on its intonation is proposed. The speech intonation is determined by sentence structure, intonation patterns of the example sentences, and general rules of Indonesian pronunciation. The model receives texts and intonation patterns as inputs. Based on the general principle of Indonesian pronunciation, a prosody file was made. Based on input text, sentence structure is determined and then interval among parts of a sentence (phrase) can be determined. These intervals are used to correct the duration of the initial prosody file. Furthermore, the frequencies in prosody file were corrected using intonation patterns. The final result is prosody file that can be pronounced by speech engine application. Experiment results of studies using the original voice of radio news announcer and the speech synthesis show that the peaks of?F0?are determined by general rules or intonation patterns which are dominant. Similarity test with the PESQ method shows that the result of the synthesis is 1.18 at MOS-LQO scale. 展开更多
关键词 speech synthesis PESQ INTONATION INDONESIAN
下载PDF
Prosodically Rich Speech Synthesis Interface Using Limited Data of Celebrity Voice
8
作者 Takashi Nose Taiki Kamei 《Journal of Computer and Communications》 2016年第16期79-94,共16页
To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially... To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially important. For these issues, this paper proposes techniques for synthesizing natural-sounding speech that has a rich prosodic personality using a limited amount of data in a text-to-speech (TTS) system. As a target speaker, we chose a well-known prime minister of Japan, Shinzo Abe, who has a good prosodic personality in his speeches. To synthesize natural-sounding and prosodically rich speech, accurate phrasing, robust duration prediction, and rich intonation modeling are important. For these purpose, we propose pause position prediction based on conditional random fields (CRFs), phone-duration prediction using random forests, and mora-based emphasis context labeling. We examine the effectiveness of the above techniques through objective and subjective evaluations. 展开更多
关键词 Parametric speech synthesis Hidden Markov Model (HMM) Prosodic Personality Prosody Modeling Conditional Random Field (CRF) Random Forest Emphasis Context
下载PDF
Towards Realizing Mandarin-Tibetan Bi-lingual Emotional Speech Synthesis with Mandarin Emotional Training Corpus
9
作者 Peiwen Wu Hongwu Yang Zhenye Gan 《国际计算机前沿大会会议论文集》 2017年第2期29-32,共4页
This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral... This paper presents a method of hidden Markov model (HMM)-based Mandarin-Tibetan bi-lingual emotional speech synthesis by speaker adaptive training with a Mandarin emotional speech corpus.A one-speaker Tibetan neutral speech corpus, a multi-speaker Mandarin neutral speech corpus and a multi-speaker Mandarin emotional speech corpus are firstly employed to train a set of mixed language average acoustic models of target emotion by using speaker adaptive training.Then a one-speaker Mandarin neutral speech corpus or a one-speaker Tibetan neutral speech corpus is adopted to obtain a set of speaker dependent acoustic models of target emotion by using the speaker adap-tation transformation. The Mandarin emotional speech or the Tibetan emotional speech is finally synthesized from Mandarin speaker depen-dent acoustic models of target emotion or Tibetan speaker dependent acoustic models of target emotion. Subjective tests show that the aver-age emotional mean opinion score is 4.14 for Tibetan and 4.26 for Mandarin. The average mean opinion score is 4.16 for Tibetan and 4.28 for Mandarin. The average degradation opinion score is 4.28 for Tibetan and 4.24 for Mandarin. Therefore, the proposed method can synthesize both Tibetan speech and Mandarin speech with high naturalness and emotional expression by using only Mandarin emotional training speech corpus. 展开更多
关键词 Mandarin-Tibetan cross-lingual EMOTIONAL speech synthesis hidden Markov model (HMM) Speaker adaptive training Mandarin-Tibetan cross-lingual speech synthesis EMOTIONAL speech synthesis
下载PDF
Control Emotion Intensity for LSTM-Based Expressive Speech Synthesis
10
作者 Xiaolian Zhu Liumeng Xue 《国际计算机前沿大会会议论文集》 2019年第2期654-656,共3页
To improve the performance of human-computer interaction interfaces, emotion is considered to be one of the most important factors. The major objective of expressive speech synthesis is to inject various expressions r... To improve the performance of human-computer interaction interfaces, emotion is considered to be one of the most important factors. The major objective of expressive speech synthesis is to inject various expressions reflecting different emotions to the synthesized speech. To effectively model and control the emotion, emotion intensity is introduced for expressive speech synthesis model to generate speech conveyed the delicate and complicate emotional states. The system was composed of an emotion analysis module with the goal of extracting control emotion intensity vector and a speech synthesis module responsible for mapping text characters to speech waveform. The proposed continuous variable “perception vector” is a data-driven approach of controlling the model to synthesize speech with different emotion intensities. Compared with the system using a one-hot vector to control emotion intensity, this model using perception vector is able to learn the high-level emotion information from low-level acoustic features. In terms of the model controllability and flexibility, both the objective and subjective evaluations demonstrate perception vector outperforms one-hot vector. 展开更多
关键词 EMOTION INTENSITY Expressive speech synthesis CONTROLLABLE TEXT-TO-speech NEURAL networks
下载PDF
Towards Realizing Sign Language-to-Speech Conversion by Combining Deep Learning and Statistical Parametric Speech Synthesis
11
作者 Xiaochun An Hongwu Yang Zhenye Gan 《国际计算机前沿大会会议论文集》 2016年第1期176-178,共3页
This paper realizes a sign language-to-speech conversion system to solve the communication problem between healthy people and speech disorders. 30 kinds of different static sign languages are firstly recognized by com... This paper realizes a sign language-to-speech conversion system to solve the communication problem between healthy people and speech disorders. 30 kinds of different static sign languages are firstly recognized by combining the support vector machine (SVM) with a restricted Boltzmann machine (RBM) based regulation and a feedback fine-tuning of the deep model. The text of sign language is then obtained from the recognition results. A context-dependent label is generated from the recognized text of sign language by a text analyzer. Meanwhile,a hiddenMarkov model (HMM) basedMandarin-Tibetan bilingual speech synthesis system is developed by using speaker adaptive training.The Mandarin speech or Tibetan speech is then naturally synthesized by using context-dependent label generated from the recognized sign language. Tests show that the static sign language recognition rate of the designed system achieves 93.6%. Subjective evaluation demonstrates that synthesized speech can get 4.0 of the mean opinion score (MOS). 展开更多
关键词 Deep learning Support vector machine Static SIGN language recognition Context-dependent LABEL Hidden Markov model Mandarin-Tibetan BILINGUAL speech synthesis
下载PDF
Multimodal Expression—Synthesis of Facial Emotion,Mouth Movement and Voice
12
作者 张晶 高文 陈熙霖 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 1997年第2期63-71,共9页
This paper presents a multimodal system for synthesis of continuous voice and corresponding images of facial emotions. In the emotion synthesis, a general 2 D face model is established and mapped to a particular face... This paper presents a multimodal system for synthesis of continuous voice and corresponding images of facial emotions. In the emotion synthesis, a general 2 D face model is established and mapped to a particular face by locating some key points of the facial image. The edges of eyes and mouth are approximated by Hough transformation on the proposed models, which has significant advantage over other methods of edge extraction of facial organs, such as deformable templates. A synthesized subsystem of text driven speech and mouth movement is obtained by using the method of emotion synthesis. The parameters for mouth movement are considered as the functions of original mouth shape input to meet the difference of mouth movements among different persons. The method of wave editing is used to synthesize speech, in which Chinese syllables are taken as the basic units to save time. Automatic transformation of mouth shape parameters, automatic synchronism of voice and mouth movement, and realtime synthesis ability are the three major features of this subsystem. The present system can synthesize continuous speech consisting of words in first and second standard Chinese word tables and the corresponding mouth movements. 展开更多
关键词 MULTIMODAL EXPRESSION EMOTION synthesis MOUTH MOVEMENT synthesis speech synthesis
下载PDF
Experimental Georgian Speech Synthesizer Part 1 Structure of Synthesizer
13
作者 Alexander Vashalomidze 《Journal of Mathematics and System Science》 2013年第6期289-300,共12页
The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testi... The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testing of hypotheses and results of experiments, satisfy three conditions: independence from the selection of unit for the synthesis (word or any part of it); taking into account the environment of unit (left and right hand contexts and position of unit); independence from the content of base. Such synthesizer is a good tool for studying many aspects of speech and removes the problem of selection. We can vary the unit and other parameters, described in paper, by the same synthesizer, synthesize the same text and listen to the results directly. This paper describes the formal structure of experimental Georgian speech synthesizer. 展开更多
关键词 speech synthesis interchangeable units adequate covering optimal covering.
下载PDF
A New Speech Encoder Based on Dynamic Framing Approach
14
作者 Renyuan Liu Jian Yang +1 位作者 Xiaobing Zhou Xiaoguang Yue 《Computer Modeling in Engineering & Sciences》 SCIE EI 2023年第8期1259-1276,共18页
Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been con... Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been conducted on two aspects.The first aspect is to encode speech frame by frame.The second aspect is to encode the whole speech to a vector.But the scale in these aspects is fixed.So,encoding speech with an adjustable scale for more latent information is worthy of investigation.But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding.It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding.This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding.The speech feature fromourmodel achieves three functions.First,the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length.Second,our model can get text embedding fromspeech,and the encoded speech feature is similar to the text embedding result.Finally,it can transfer the style of synthesis speech and make it more similar to the given reference speech. 展开更多
关键词 speech synthesis dynamic framing convolution network speech encoding
下载PDF
面向域外说话人适应场景的多层级解耦个性化语音合成
15
作者 高盛祥 杨元樟 +3 位作者 王琳钦 莫尚斌 余正涛 董凌 《广西师范大学学报(自然科学版)》 CAS 北大核心 2024年第4期11-21,共11页
个性化语音合成任务旨在合成特定说话人音色的语音。传统方法在合成域外说话人语音时,与真实语音存在明显音色差异,解耦说话人特征仍较为困难。本文提出面向训练时未出现的域外说话人适应场景下的多层级解耦个性化语音合成方法,通过不... 个性化语音合成任务旨在合成特定说话人音色的语音。传统方法在合成域外说话人语音时,与真实语音存在明显音色差异,解耦说话人特征仍较为困难。本文提出面向训练时未出现的域外说话人适应场景下的多层级解耦个性化语音合成方法,通过不同粒度特征融合,有效提升零资源条件下域外说话人语音合成性能。本文方法采用快速傅里叶卷积提取说话人全局特征,以提高模型对域外说话人的泛化能力,实现句子粒度的说话人解耦;借助语音识别模型解耦音素粒度说话人特征,并通过注意力机制捕捉音素级音色特征,实现音素粒度的说话人解耦。实验结果表明:在公开数据集AISHELL3上,本文方法对域外说话人在客观评价指标说话人特征向量余弦相似度上达到0.697,相比基线模型提高6.25%,有效提升对域外说话人音色特征建模能力。 展开更多
关键词 语音合成 零资源 说话人表征 域外说话人 特征解耦
下载PDF
基于层次化Conformer的语音合成
16
作者 吴克伟 韩超 +2 位作者 孙永宣 彭梦昊 谢昭 《计算机科学》 CSCD 北大核心 2024年第2期161-171,共11页
语音合成需要将输入语句的文本转换为包含音素、单词和语句的语音信号。现有语音合成方法将语句看作一个整体,难以准确地合成出不同长度的语音信号。通过分析语音信号中蕴含的层次化关系,分别设计基于Conformer的层次化文本编码器和基于... 语音合成需要将输入语句的文本转换为包含音素、单词和语句的语音信号。现有语音合成方法将语句看作一个整体,难以准确地合成出不同长度的语音信号。通过分析语音信号中蕴含的层次化关系,分别设计基于Conformer的层次化文本编码器和基于Conformer的层次化语音编码器,并提出了一种基于层次化文本-语音Conformer的语音合成模型。首先,该模型根据输入文本信号的长度,构建层次化文本编码器,包括音素级、单词级、语句级文本编码器3个层次,不同层次的文本编码器描述不同长度的文本信息;并使用Conformer的注意力机制来学习该长度信号中不同时间特征之间的关系。利用层次化的文本编码器,能够找出语句中不同长度需要强调的信息,有效实现不同长度的文本特征提取,缓解合成的语音信号持续时间长度不确定的问题。其次,层次化语音编码器包括音素级、单词级、语句级语音编码器3个层次。每个层次的语音编码器将文本特征作为Conformer的查询向量,将语音特征作为Conformer的关键字向量和值向量,来提取文本特征和语音特征的匹配关系。利用层次化的语音编码器和文本语音匹配关系,可以缓解不同长度语音信号合成不准确的问题。所提模型的层次化文本-语音编码器可以灵活地嵌入现有的多种解码器中,通过文本和语音之间的互补,提供更为可靠的语音合成结果。在LJSpeech和LibriTTS两个数据集上进行实验验证,实验结果表明,所提方法的梅尔倒谱失真小于现有语音合成方法。 展开更多
关键词 语音合成 文本编码器 语音编码器 层次化模型 CONFORMER
下载PDF
基于时序对齐的风格控制语音合成算法
17
作者 郭傲 许柏炎 +1 位作者 蔡瑞初 郝志峰 《广东工业大学学报》 CAS 2024年第2期84-92,共9页
语音合成风格控制的目标是将自然语言转化为对应富有表现力的音频输出。基于Transformer的风格控制语音合成算法能在保持质量的情况下提高了合成速度,但仍存在不足:第一,在风格参考音频和文本长度差异大的情况下,存在合成音频部分风格... 语音合成风格控制的目标是将自然语言转化为对应富有表现力的音频输出。基于Transformer的风格控制语音合成算法能在保持质量的情况下提高了合成速度,但仍存在不足:第一,在风格参考音频和文本长度差异大的情况下,存在合成音频部分风格缺失的问题;第二,基于普通注意力的解码过程容易出现复读、漏读以及跳读的问题。针对以上问题,提出了一种基于时间对齐的风格控制语音合成算法(Temporal Alignment Text-to-Speech,TATTS)分别在编码和解码过程中有效利用时序信息。在编码过程中,TATTS提出了时序对齐的交叉注意力模块联合训练风格音频与文本表示,解决了不等长音频文本的对齐问题;在解码过程中,TATTS考虑了音频时序单调性,在Transformer解码器中引入了逐步单调的多头注意力机制,解决了合成音频中出现的错读问题。与基准模型相比,TATTS在LJSpeech和VCTK数据集上音频结果自然度分别提升了3.8%和4.8%,在VCTK数据集上风格相似度提升了10%,验证了该语音合成算法的有效性,并且体现出风格控制与迁移能力。 展开更多
关键词 语音合成 时序对齐 风格控制 TRANSFORMER 风格迁移
下载PDF
结合轻量卷积的非自回归语音合成方法
18
作者 钟巧霞 曾碧 +1 位作者 林镇涛 林伟 《计算机工程与设计》 北大核心 2024年第4期1166-1172,共7页
对如何有效捕捉音素之间的关联及如何合成韵律丰富的音频进行研究,提出一种结合轻量卷积的非自回归语音合成模型LCTTS。引入轻量卷积建立起音素之间的联系,解决发音出错问题。通过添加音高和能量预测器预测生成语音的韵律,解决音频韵律... 对如何有效捕捉音素之间的关联及如何合成韵律丰富的音频进行研究,提出一种结合轻量卷积的非自回归语音合成模型LCTTS。引入轻量卷积建立起音素之间的联系,解决发音出错问题。通过添加音高和能量预测器预测生成语音的韵律,解决音频韵律缺乏问题。训练模型获取梅尔频谱,结合预先训练好的声码器转化为音频。实验结果表明,提出的LCTTS模型优于先前提出的SpeedySpeech模型,在Emotional Speech Database数据集上平均意见得分获得2.8%的提升,梅尔倒谱失真测度下降0.15。 展开更多
关键词 语音合成 轻量级卷积 韵律合成 梅尔频谱生成 非自回归方法 深度学习 自然语言处理
下载PDF
融合跨说话人韵律迁移的多语种文本到波形生成
19
作者 尚增强 张鹏远 王丽 《声学学报》 EI CAS CSCD 北大核心 2024年第1期171-180,共10页
在多语种语音合成任务中,由于单人多语种数据稀缺,让一个音色同时支持多种语言合成变得非常困难。不同于已有方法仅在声学模型中解耦音色和发音,提出一种融合跨说话人韵律迁移的端到端多语种语音合成方法,采用两级层级条件变分自编码器... 在多语种语音合成任务中,由于单人多语种数据稀缺,让一个音色同时支持多种语言合成变得非常困难。不同于已有方法仅在声学模型中解耦音色和发音,提出一种融合跨说话人韵律迁移的端到端多语种语音合成方法,采用两级层级条件变分自编码器直接建模从文本到波形的生成过程,并解耦了音色、发音和韵律等信息。该方法通过迁移目标语种已有说话人的韵律风格来改善跨语种合成的韵律。实验表明,所提模型在跨语种语音生成上获得了3.91和4.01的自然度和相似度平均意见得分,相比基线跨语种合成字错误率降低到5.85%。韵律迁移以及消融实验也进一步证明了该方法的有效性。 展开更多
关键词 多语种语音合成 韵律迁移 变分自编码器 韵律解耦
下载PDF
基于Unity3D的数字虚拟人交互技术研究与应用
20
作者 李光亚 司占军 《印刷与数字媒体技术研究》 CAS 北大核心 2024年第2期123-134,共12页
目前,数字虚拟人交互技术虽然能够实现与用户的基本交互,但仍然存在着语言理解偏误、缺乏情感表达能力等一系列问题,导致用户的交互体验感不足。在此背景下,本研究首先分析了数字虚拟人技术的发展现状和存在的问题,进而探究了基于Unity3... 目前,数字虚拟人交互技术虽然能够实现与用户的基本交互,但仍然存在着语言理解偏误、缺乏情感表达能力等一系列问题,导致用户的交互体验感不足。在此背景下,本研究首先分析了数字虚拟人技术的发展现状和存在的问题,进而探究了基于Unity3D的数字虚拟人交互技术,并提出了一种由文本直接生成带有情感特征语音的方法。基于此,将其与ChatGPT语言理解与文本生成、文本情感分析和改进后的VITS语音合成技术结合,并使用Kinect 2.0设备模拟全息交互效果,最终构建了一款能够进行准确理解并模拟情感回应的数字虚拟人交互应用。结果表明,该技术可有效提高数字虚拟人的理解与表达能力,为用户提供更好的交互体验,对于数字虚拟人技术的应用和发展具有参考价值。 展开更多
关键词 数字媒体 人工智能 媒体交互 语音合成
下载PDF
上一页 1 2 32 下一页 到第
使用帮助 返回顶部