期刊文献+
共找到19篇文章
< 1 >
每页显示 20 50 100
Emotional Vietnamese Speech Synthesis Using Style-Transfer Learning
1
作者 Thanh X.Le An T.Le Quang H.Nguyen 《Computer Systems Science & Engineering》 SCIE EI 2023年第2期1263-1278,共16页
In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the met... In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the method of con-structing a speech synthesizer for each emotion has some limitations.First,this method often requires an emotional-speech data set with many sentences.Such data sets are very time-intensive and labor-intensive to complete.Second,training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning.In addition,each model for each emotion failed to take advantage of data sets of other emotions.In this paper,we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flow-tron model.In addition,we provide a new method to build a speech corpus that is scalable and whose quality is easy to control.Next,to produce a high-quality speech synthesis model,we used this data set to train the Tacotron 2 model.We used it as a pre-trained model to train the Flowtron model.We applied this method to synthesize Vietnamese speech with sadness and happiness.Mean opi-nion score(MOS)assessment results show that MOS is 3.61 for sadness and 3.95 for happiness.In conclusion,the proposed method proves to be more effec-tive for a high degree of automation and fast emotional sentence generation,using a small emotional-speech data set. 展开更多
关键词 Emotional speech synthesis flowtron speech synthesis style transfer vietnamese speech
下载PDF
Prosodically Rich Speech Synthesis Interface Using Limited Data of Celebrity Voice
2
作者 Takashi Nose Taiki Kamei 《Journal of Computer and Communications》 2016年第16期79-94,共16页
To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially... To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially important. For these issues, this paper proposes techniques for synthesizing natural-sounding speech that has a rich prosodic personality using a limited amount of data in a text-to-speech (TTS) system. As a target speaker, we chose a well-known prime minister of Japan, Shinzo Abe, who has a good prosodic personality in his speeches. To synthesize natural-sounding and prosodically rich speech, accurate phrasing, robust duration prediction, and rich intonation modeling are important. For these purpose, we propose pause position prediction based on conditional random fields (CRFs), phone-duration prediction using random forests, and mora-based emphasis context labeling. We examine the effectiveness of the above techniques through objective and subjective evaluations. 展开更多
关键词 Parametric speech synthesis Hidden Markov Model (HMM) Prosodic Personality Prosody Modeling Conditional Random Field (CRF) Random Forest Emphasis Context
下载PDF
A new speech synthesis method based on the LMA vocal tract model 被引量:2
3
作者 LIU Qingfeng WANG Renhua (Department of Electronic Engineering and Information Science,University of Science & Technology of China Anhui Hefei 230027) 《Chinese Journal of Acoustics》 1998年第2期153-162,共10页
A new speech synthesis algorithm based on the LMA filter in Chinese text-to-speech systern is introduced. Using this method, the system can not only generate speech with higher quality, but also have a more powerful ... A new speech synthesis algorithm based on the LMA filter in Chinese text-to-speech systern is introduced. Using this method, the system can not only generate speech with higher quality, but also have a more powerful ability to modify the prosodic parameters, which ensures a far more natural and intelligible synthesized speech than ever before. First, the fundamental principles of the LMA filter and the construction of the synthesizer are presented, then, how to modify the acoustic parameters with this synthesizer is described; finally, the quantitative evaluation of the system's performance is shown while compared with a relatively successful PSOLA synthesizer KDTALK_1 展开更多
关键词 LMA A new speech synthesis method based on the LMA vocal tract model
原文传递
Guidelines to assessment of speech synthesis systems for Chinese 被引量:1
4
作者 ZHANG Jialu DONG Shiwei(Institute of Acoustzcs, Academia Sinica Beijing 100080) 《Chinese Journal of Acoustics》 1998年第4期289-295,共7页
National assessment of speech synthesis systems for Chinese has been regularly carried out since 1994 in China. New guidelines to the assessment activities which aim at promoting the assessment work to be standardizab... National assessment of speech synthesis systems for Chinese has been regularly carried out since 1994 in China. New guidelines to the assessment activities which aim at promoting the assessment work to be standardizable, automatizable (partially) and accessible to the public by computer network were set up in 1997. Two modules. the phonetic module and the linguistic module, are evaluated individually. The phonetic module is evaluated by using speech intelligibility tests at three levels:syllable, word and sentence, and speech natu-ralness tests (in MOS). As for the linguistic module, the text processing ability, which includes word segmentation, polyphonic characters, numerals, years, symbols and metrological units, is examined automatically. 展开更多
关键词 Guidelines to assessment of speech synthesis systems for Chinese
原文传递
Efficient decoding self-attention for end-to-end speech synthesis
5
作者 Wei ZHAO Li XU 《Frontiers of Information Technology & Electronic Engineering》 SCIE EI CSCD 2022年第7期1127-1138,共12页
Self-attention has been innovatively applied to text-to-speech(TTS)because of its parallel structure and superior strength in modeling sequential data.However,when used in end-to-end speech synthesis with an autoregre... Self-attention has been innovatively applied to text-to-speech(TTS)because of its parallel structure and superior strength in modeling sequential data.However,when used in end-to-end speech synthesis with an autoregressive decoding scheme,its inference speed becomes relatively low due to the quadratic complexity in sequence length.This problem becomes particularly severe on devices without graphics processing units(GPUs).To alleviate the dilemma,we propose an efficient decoding self-attention(EDSA)module as an alternative.Combined with a dynamic programming decoding procedure,TTS model inference can be effectively accelerated to have a linear computation complexity.We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve 720%and 50%higher inference speed on the central processing unit(CPU)and GPU respectively,with almost the same performance.Thus,this method may make the deployment of such models easier when there are limited GPU resources.In addition,our model may perform better than the baseline Transformer TTS on out-of-domain utterances. 展开更多
关键词 Efficient decoding END-TO-END Self-attention speech synthesis
原文传递
A Unified Framework for Multilingual Text-to-Speech Synthesis with SSML Specification as Interface
6
作者 吴志勇 曹光琦 +1 位作者 蒙美玲 蔡莲红 《Tsinghua Science and Technology》 SCIE EI CAS 2009年第5期623-630,共8页
This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects... This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects. The interfaces between consecutive modules conform to the speech synthesis markup language (SSML) specification for standardization, interoperability, multilinguality, and extensibility. Detailed module divisions and implementation technologies for the unified framework are introduced, together with possible extensions for the algorithm research and evaluation of the TTS synthesis. Implementation of a mixed-language TTS system for Chinese Putonghua, Chinese Cantonese, and English demonstrates the feasibility of the proposed unified framework. 展开更多
关键词 text-to-speech (TTS) synthesis MULTILINGUAL unified framework speech synthesis markup language (SSML)
原文传递
Assessment methods of speech synthesis systems for Chinese
7
作者 ZHANG Jialu QI Shiqian and YU Ge (Institute of Acoustics, Academia Sinica Beijing 100080)Received 《Chinese Journal of Acoustics》 1997年第2期97-104,共8页
A national assessment of the performance of speech synthesis systems for Chinese has been carried out yearly since 1994. The quality of synthetic speech of five different systems were evaluated and diagnosed by using ... A national assessment of the performance of speech synthesis systems for Chinese has been carried out yearly since 1994. The quality of synthetic speech of five different systems were evaluated and diagnosed by using speech intelligibility tests. 16 college students (8 male, 8 female) with no experience with synthetic speech were the listeners, they were asked to do open response task by pencilpaper. In addition, speech naturalness was mea-sured by Mean Opinion 展开更多
关键词 PSOLA Assessment methods of speech synthesis systems for Chinese
原文传递
A New Speech Encoder Based on Dynamic Framing Approach
8
作者 Renyuan Liu Jian Yang +1 位作者 Xiaobing Zhou Xiaoguang Yue 《Computer Modeling in Engineering & Sciences》 SCIE EI 2023年第8期1259-1276,共18页
Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been con... Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been conducted on two aspects.The first aspect is to encode speech frame by frame.The second aspect is to encode the whole speech to a vector.But the scale in these aspects is fixed.So,encoding speech with an adjustable scale for more latent information is worthy of investigation.But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding.It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding.This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding.The speech feature fromourmodel achieves three functions.First,the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length.Second,our model can get text embedding fromspeech,and the encoded speech feature is similar to the text embedding result.Finally,it can transfer the style of synthesis speech and make it more similar to the given reference speech. 展开更多
关键词 speech synthesis dynamic framing convolution network speech encoding
下载PDF
Application of Cochlear Model in Speech Analysis/Synthesis Using Sinusoidal Representation 被引量:1
9
作者 Yuan Jingxian Wan Wanggen Yu Xiaoqing (School of Communication & Information Engineering, Shanghai University) 《Advances in Manufacturing》 SCIE CAS 1999年第1期47-52,共6页
A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. T... A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. The computer simulation shows that speech can be synthesized with only 12 parameters per frame on the average. The method has the advantages of few parameters, low complexity and high performance of speech representation. The synthetic speech has high intelligibility. 展开更多
关键词 speech analysis/synthesis sinusoidal representation cochlear model auditory spectrum
下载PDF
A HMM-based Mandarin Chinese Singing Voice Synthesis System 被引量:4
10
作者 Xian Li Zengfu Wang 《IEEE/CAA Journal of Automatica Sinica》 SCIE EI 2016年第2期192-202,共11页
We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual fea... We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual features are well designed for training. F0 and spectrum of singing voice are simultaneously modeled with context-dependent HMMs. There is a new problem, F0 of singing voice is always sparse because of large amount of context, i.e., tempo and pitch of note, key, time signature and etc. So the features hardly ever appeared in the training data cannot be well obtained. To address this problem, difference between F0 of singing voice and that of musical score (DF0) is modeled by a single Viterbi training. To overcome the over-smoothing of the generated F0 contour, syllable level F0 model based on discrete cosine transforms (DCT) is applied, F0 contour is generated by integrating two-level statistical models. The experimental results demonstrate that the proposed system outperforms the baseline system in both objective and subjective evaluations. The proposed system can generate a more natural F0 contour. Furthermore, the syllable level F0 model can make singing voice more expressive. © 2014 Chinese Association of Automation. 展开更多
关键词 Cosine transforms Hidden Markov models Markov processes speech synthesis
下载PDF
Pitch models of Mandarin text-to-speech
11
作者 邵艳秋 穗志方 韩纪庆 《Journal of Harbin Institute of Technology(New Series)》 EI CAS 2009年第2期179-184,共6页
The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pi... The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pitch pattern model are deeply studied in this paper.Key problems in the corpus-based model are calculation of the distance and searching of the optimal path with dynamic programming algorithm.For the pitch pattern model,parameters such as pitch pattern,pitch average and pitch range are used to describe the pitch contour,and six pitch patterns are presented.For the generation of pitch contour,the pitch pattern model is more flexible than the corpus-based model.Both of the two models are linked to the real TTS system,and the MOS results of synthesized Mandarin speech show that the pitch pattern model is better than the corpus-based pitch model. 展开更多
关键词 speech synthesis prosody model pitch model pitch pattern
下载PDF
Experimental Georgian Speech Synthesizer Part 1 Structure of Synthesizer
12
作者 Alexander Vashalomidze 《Journal of Mathematics and System Science》 2013年第6期289-300,共12页
The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testi... The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testing of hypotheses and results of experiments, satisfy three conditions: independence from the selection of unit for the synthesis (word or any part of it); taking into account the environment of unit (left and right hand contexts and position of unit); independence from the content of base. Such synthesizer is a good tool for studying many aspects of speech and removes the problem of selection. We can vary the unit and other parameters, described in paper, by the same synthesizer, synthesize the same text and listen to the results directly. This paper describes the formal structure of experimental Georgian speech synthesizer. 展开更多
关键词 speech synthesis interchangeable units adequate covering optimal covering.
下载PDF
A synthesis method based on speech production and articulatory model
13
作者 YU Zhenli (Dept. of Information and Electronic Engineering, Zhejiang University Hangzhou 310028) Ching Pak-chung (Dept. of Electronic Engineering, The Chinese University of Hong Kang Shatin, N.T. Hong Kong) 《Chinese Journal of Acoustics》 2000年第2期128-141,共14页
A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering pro... A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering process derived from a RTLA of vocal tract system according to the acoustic mechanism of speech production. The vocal-tract area function which controls the synthesis model is derived from the first three formant trajectories by using the inverse solution of speech production. The proposed method not only gives good naturalness and dynamic smoothness, but also is capable to control or modify speech timbres easily and flexibly. Further and mores it needs less number of control parameters and very low update rate of the parameters. 展开更多
关键词 PSOLA A synthesis method based on speech production and articulatory model
原文传递
NICT/ATR Chinese-Japanese-English Speech-to-Speech Translation System 被引量:3
14
作者 Tohru Shimizu Yutaka Ashikari +2 位作者 Eiichiro Sumita 张劲松 Satoshi Nakamura 《Tsinghua Science and Technology》 SCIE EI CAS 2008年第4期540-544,共5页
This paper describes the latest version of the Chinese-Japanese-English handheld speech-tospeech translation system developed by NICT/ATR, which is now ready to be deployed for travelers. With the entire speech-to-spe... This paper describes the latest version of the Chinese-Japanese-English handheld speech-tospeech translation system developed by NICT/ATR, which is now ready to be deployed for travelers. With the entire speech-to-speech translation function being implemented into one terminal, it realizes real-time, location-free speech-to-speech translation. A new noise-suppression technique notably improves the speech recognition performance. Corpus-based approaches of speech recognition, machine translation, and speech synthesis enable coverage of a wide variety of topics and portability to other languages. Test results show that the character accuracy of speech recognition is 82%-94% for Chinese speech, with a bilingual evaluation understudy score of machine translation is 0.55-0.74 for Chinese-Japanese and Chinese-English 展开更多
关键词 speech-to-speech translation speech recognition speech synthesis machine translation large-scale corpus
原文传递
A Synthesis Instance Pruning Approach Based on Virtual Non-uniform Replacements
15
作者 张巍 凌震华 +1 位作者 胡国平 王仁华 《Tsinghua Science and Technology》 SCIE EI CAS 2008年第4期515-521,共7页
The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, us... The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, usually results in loss of non-uniform synthesis instances. In order to solve this problem, we propose the concept of virtual non-uniform instances. According to this concept and the synthesis frequency of each instance, the algorithm named StaRp-VPA is constructed to make up for the loss of nonuniform instances. In experimental testing, the naturalness scored by the mean opinion score (MOS) remains almost unchanged when less than 50% instances are pruned, and the MOS is only slightly degraded for reduction rates above 50%. The test results show that the algorithm StaRp-VPA is effective. 展开更多
关键词 text-to-speech system speech synthesis synthesis instance pruning non-uniform unit
原文传递
Speech Dictation System Based on Character Recognition
16
作者 Wenjun Lu Yanqing Wang Longfei Huang 《国际计算机前沿大会会议论文集》 2021年第1期380-392,共13页
To solve students’ dictation problems, a speech dictation system basedon character recognition is proposed in this paper. The system applied offlinehandwritten Chinese character recognition technology, denoised the i... To solve students’ dictation problems, a speech dictation system basedon character recognition is proposed in this paper. The system applied offlinehandwritten Chinese character recognition technology, denoised the imagethrough Gaussian filter, segmented the text through projection method, and convertedthe image to text through OCR technology. The straight line mark in thepicture was detected by Hough transform technology, and then SKB-FSS algorithmand WST algorithm were used for speech synthesis. Experiments show thatthe system can effectively assist students in dictation. 展开更多
关键词 Character recognition speech synthesis Hough transform Feature extraction Image preprocessing
原文传递
Merge-Weighted Dynamic Time Warping for Speech Recognition 被引量:1
17
作者 张湘莉兰 骆志刚 李明 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第6期1072-1082,共11页
Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy... Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall. 展开更多
关键词 merge-weighted dynamic time warping natural language processing speech recognition and synthesis tem-plate confidence index
原文传递
Improved Grapheme-to-Phoneme Conversion for Mandarin TTS 被引量:1
18
作者 易立夫 李健 +1 位作者 郝杰 熊子瑜 《Tsinghua Science and Technology》 SCIE EI CAS 2009年第5期606-611,共6页
Several methods were developed to improve grapheme-to-phoneme (G2P) conversion models for Chinese text-to-speech (TTS) systems. The critical problem of data sparsity was handled by combining approaches. First, a t... Several methods were developed to improve grapheme-to-phoneme (G2P) conversion models for Chinese text-to-speech (TTS) systems. The critical problem of data sparsity was handled by combining approaches. First, a text-selection method was designed to cover as many G2P text corpus contexts as possible. Then, various data-driven modeling methods were used with comparisons to select the best method for each polyphonic word. Finally, independent models were used for some neutral tone words in addition to the normal G2P models to achieve more compact and flexible G2P models. Tests show that these methods reduce the relative errors by 50% for both normal polyphonic words and Chinese neutral tones. 展开更多
关键词 grapheme-to-phoneme conversion text design Chinese neutral tone speech synthesis
原文传递
Modeling Pitch Contour of Chinese Mandarin Sentences with the PENTA Model 被引量:1
19
作者 Hui Pang Zhiyong Wu Lianhong Cai 《Tsinghua Science and Technology》 EI CAS 2012年第2期218-224,共7页
In continuous speech, the pitch contour of the same syllable may vary much due to its contextual information. The Parallel Encoding and Target Approximation (PENTA) model is applied here to Mandarin speech synthesis... In continuous speech, the pitch contour of the same syllable may vary much due to its contextual information. The Parallel Encoding and Target Approximation (PENTA) model is applied here to Mandarin speech synthesis with a method to predict pitch contours for Chinese syllables with different contexts by combining the Classification And Regression Tree (CART) with the PENTA model to improve its prediction accuracy. CART was first used to cluster the syllables' normalized pitch contours according to the syllables contextual information and the distances between pitch contours. The average pitch contour was used to train the PENTA model with the average contour for each cluster. The initial pitch is required with the PENTA model to predict a continuous pitch contour. A Pitch Discontinuity Model (PDM) was used to predict the initial pitches at positions with voiceless consonants and prosodic boundaries. Initial tests on a Chinese four-syllable word corpus containing 2048 words were extended to tests with a continuous speech corpus containing 5445 sentences. The results are satisfactory in terms of the Root Mean Square Error (RMSE) comparing the predicted pitch contour with the original contour. This method can model pitch contours for Mandarin sentences with any text for speech synthesis. 展开更多
关键词 speech synthesis PENTA model prosody analysis prosody modeling
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部