In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the met...In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the method of con-structing a speech synthesizer for each emotion has some limitations.First,this method often requires an emotional-speech data set with many sentences.Such data sets are very time-intensive and labor-intensive to complete.Second,training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning.In addition,each model for each emotion failed to take advantage of data sets of other emotions.In this paper,we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flow-tron model.In addition,we provide a new method to build a speech corpus that is scalable and whose quality is easy to control.Next,to produce a high-quality speech synthesis model,we used this data set to train the Tacotron 2 model.We used it as a pre-trained model to train the Flowtron model.We applied this method to synthesize Vietnamese speech with sadness and happiness.Mean opi-nion score(MOS)assessment results show that MOS is 3.61 for sadness and 3.95 for happiness.In conclusion,the proposed method proves to be more effec-tive for a high degree of automation and fast emotional sentence generation,using a small emotional-speech data set.展开更多
To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially...To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially important. For these issues, this paper proposes techniques for synthesizing natural-sounding speech that has a rich prosodic personality using a limited amount of data in a text-to-speech (TTS) system. As a target speaker, we chose a well-known prime minister of Japan, Shinzo Abe, who has a good prosodic personality in his speeches. To synthesize natural-sounding and prosodically rich speech, accurate phrasing, robust duration prediction, and rich intonation modeling are important. For these purpose, we propose pause position prediction based on conditional random fields (CRFs), phone-duration prediction using random forests, and mora-based emphasis context labeling. We examine the effectiveness of the above techniques through objective and subjective evaluations.展开更多
A new speech synthesis algorithm based on the LMA filter in Chinese text-to-speech systern is introduced. Using this method, the system can not only generate speech with higher quality, but also have a more powerful ...A new speech synthesis algorithm based on the LMA filter in Chinese text-to-speech systern is introduced. Using this method, the system can not only generate speech with higher quality, but also have a more powerful ability to modify the prosodic parameters, which ensures a far more natural and intelligible synthesized speech than ever before. First, the fundamental principles of the LMA filter and the construction of the synthesizer are presented, then, how to modify the acoustic parameters with this synthesizer is described; finally, the quantitative evaluation of the system's performance is shown while compared with a relatively successful PSOLA synthesizer KDTALK_1展开更多
National assessment of speech synthesis systems for Chinese has been regularly carried out since 1994 in China. New guidelines to the assessment activities which aim at promoting the assessment work to be standardizab...National assessment of speech synthesis systems for Chinese has been regularly carried out since 1994 in China. New guidelines to the assessment activities which aim at promoting the assessment work to be standardizable, automatizable (partially) and accessible to the public by computer network were set up in 1997. Two modules. the phonetic module and the linguistic module, are evaluated individually. The phonetic module is evaluated by using speech intelligibility tests at three levels:syllable, word and sentence, and speech natu-ralness tests (in MOS). As for the linguistic module, the text processing ability, which includes word segmentation, polyphonic characters, numerals, years, symbols and metrological units, is examined automatically.展开更多
Self-attention has been innovatively applied to text-to-speech(TTS)because of its parallel structure and superior strength in modeling sequential data.However,when used in end-to-end speech synthesis with an autoregre...Self-attention has been innovatively applied to text-to-speech(TTS)because of its parallel structure and superior strength in modeling sequential data.However,when used in end-to-end speech synthesis with an autoregressive decoding scheme,its inference speed becomes relatively low due to the quadratic complexity in sequence length.This problem becomes particularly severe on devices without graphics processing units(GPUs).To alleviate the dilemma,we propose an efficient decoding self-attention(EDSA)module as an alternative.Combined with a dynamic programming decoding procedure,TTS model inference can be effectively accelerated to have a linear computation complexity.We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve 720%and 50%higher inference speed on the central processing unit(CPU)and GPU respectively,with almost the same performance.Thus,this method may make the deployment of such models easier when there are limited GPU resources.In addition,our model may perform better than the baseline Transformer TTS on out-of-domain utterances.展开更多
This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects...This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects. The interfaces between consecutive modules conform to the speech synthesis markup language (SSML) specification for standardization, interoperability, multilinguality, and extensibility. Detailed module divisions and implementation technologies for the unified framework are introduced, together with possible extensions for the algorithm research and evaluation of the TTS synthesis. Implementation of a mixed-language TTS system for Chinese Putonghua, Chinese Cantonese, and English demonstrates the feasibility of the proposed unified framework.展开更多
A national assessment of the performance of speech synthesis systems for Chinese has been carried out yearly since 1994. The quality of synthetic speech of five different systems were evaluated and diagnosed by using ...A national assessment of the performance of speech synthesis systems for Chinese has been carried out yearly since 1994. The quality of synthetic speech of five different systems were evaluated and diagnosed by using speech intelligibility tests. 16 college students (8 male, 8 female) with no experience with synthetic speech were the listeners, they were asked to do open response task by pencilpaper. In addition, speech naturalness was mea-sured by Mean Opinion展开更多
Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been con...Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been conducted on two aspects.The first aspect is to encode speech frame by frame.The second aspect is to encode the whole speech to a vector.But the scale in these aspects is fixed.So,encoding speech with an adjustable scale for more latent information is worthy of investigation.But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding.It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding.This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding.The speech feature fromourmodel achieves three functions.First,the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length.Second,our model can get text embedding fromspeech,and the encoded speech feature is similar to the text embedding result.Finally,it can transfer the style of synthesis speech and make it more similar to the given reference speech.展开更多
A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. T...A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. The computer simulation shows that speech can be synthesized with only 12 parameters per frame on the average. The method has the advantages of few parameters, low complexity and high performance of speech representation. The synthetic speech has high intelligibility.展开更多
The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pi...The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pitch pattern model are deeply studied in this paper.Key problems in the corpus-based model are calculation of the distance and searching of the optimal path with dynamic programming algorithm.For the pitch pattern model,parameters such as pitch pattern,pitch average and pitch range are used to describe the pitch contour,and six pitch patterns are presented.For the generation of pitch contour,the pitch pattern model is more flexible than the corpus-based model.Both of the two models are linked to the real TTS system,and the MOS results of synthesized Mandarin speech show that the pitch pattern model is better than the corpus-based pitch model.展开更多
The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testi...The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testing of hypotheses and results of experiments, satisfy three conditions: independence from the selection of unit for the synthesis (word or any part of it); taking into account the environment of unit (left and right hand contexts and position of unit); independence from the content of base. Such synthesizer is a good tool for studying many aspects of speech and removes the problem of selection. We can vary the unit and other parameters, described in paper, by the same synthesizer, synthesize the same text and listen to the results directly. This paper describes the formal structure of experimental Georgian speech synthesizer.展开更多
A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering pro...A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering process derived from a RTLA of vocal tract system according to the acoustic mechanism of speech production. The vocal-tract area function which controls the synthesis model is derived from the first three formant trajectories by using the inverse solution of speech production. The proposed method not only gives good naturalness and dynamic smoothness, but also is capable to control or modify speech timbres easily and flexibly. Further and mores it needs less number of control parameters and very low update rate of the parameters.展开更多
This paper describes the latest version of the Chinese-Japanese-English handheld speech-tospeech translation system developed by NICT/ATR, which is now ready to be deployed for travelers. With the entire speech-to-spe...This paper describes the latest version of the Chinese-Japanese-English handheld speech-tospeech translation system developed by NICT/ATR, which is now ready to be deployed for travelers. With the entire speech-to-speech translation function being implemented into one terminal, it realizes real-time, location-free speech-to-speech translation. A new noise-suppression technique notably improves the speech recognition performance. Corpus-based approaches of speech recognition, machine translation, and speech synthesis enable coverage of a wide variety of topics and portability to other languages. Test results show that the character accuracy of speech recognition is 82%-94% for Chinese speech, with a bilingual evaluation understudy score of machine translation is 0.55-0.74 for Chinese-Japanese and Chinese-English展开更多
The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, us...The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, usually results in loss of non-uniform synthesis instances. In order to solve this problem, we propose the concept of virtual non-uniform instances. According to this concept and the synthesis frequency of each instance, the algorithm named StaRp-VPA is constructed to make up for the loss of nonuniform instances. In experimental testing, the naturalness scored by the mean opinion score (MOS) remains almost unchanged when less than 50% instances are pruned, and the MOS is only slightly degraded for reduction rates above 50%. The test results show that the algorithm StaRp-VPA is effective.展开更多
To solve students’ dictation problems, a speech dictation system basedon character recognition is proposed in this paper. The system applied offlinehandwritten Chinese character recognition technology, denoised the i...To solve students’ dictation problems, a speech dictation system basedon character recognition is proposed in this paper. The system applied offlinehandwritten Chinese character recognition technology, denoised the imagethrough Gaussian filter, segmented the text through projection method, and convertedthe image to text through OCR technology. The straight line mark in thepicture was detected by Hough transform technology, and then SKB-FSS algorithmand WST algorithm were used for speech synthesis. Experiments show thatthe system can effectively assist students in dictation.展开更多
Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy...Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.展开更多
Several methods were developed to improve grapheme-to-phoneme (G2P) conversion models for Chinese text-to-speech (TTS) systems. The critical problem of data sparsity was handled by combining approaches. First, a t...Several methods were developed to improve grapheme-to-phoneme (G2P) conversion models for Chinese text-to-speech (TTS) systems. The critical problem of data sparsity was handled by combining approaches. First, a text-selection method was designed to cover as many G2P text corpus contexts as possible. Then, various data-driven modeling methods were used with comparisons to select the best method for each polyphonic word. Finally, independent models were used for some neutral tone words in addition to the normal G2P models to achieve more compact and flexible G2P models. Tests show that these methods reduce the relative errors by 50% for both normal polyphonic words and Chinese neutral tones.展开更多
In continuous speech, the pitch contour of the same syllable may vary much due to its contextual information. The Parallel Encoding and Target Approximation (PENTA) model is applied here to Mandarin speech synthesis...In continuous speech, the pitch contour of the same syllable may vary much due to its contextual information. The Parallel Encoding and Target Approximation (PENTA) model is applied here to Mandarin speech synthesis with a method to predict pitch contours for Chinese syllables with different contexts by combining the Classification And Regression Tree (CART) with the PENTA model to improve its prediction accuracy. CART was first used to cluster the syllables' normalized pitch contours according to the syllables contextual information and the distances between pitch contours. The average pitch contour was used to train the PENTA model with the average contour for each cluster. The initial pitch is required with the PENTA model to predict a continuous pitch contour. A Pitch Discontinuity Model (PDM) was used to predict the initial pitches at positions with voiceless consonants and prosodic boundaries. Initial tests on a Chinese four-syllable word corpus containing 2048 words were extended to tests with a continuous speech corpus containing 5445 sentences. The results are satisfactory in terms of the Root Mean Square Error (RMSE) comparing the predicted pitch contour with the original contour. This method can model pitch contours for Mandarin sentences with any text for speech synthesis.展开更多
基金funded by the Hanoi University of Science and Technology(HUST)under grant number T2018-PC-210.
文摘In recent years,speech synthesis systems have allowed for the produc-tion of very high-quality voices.Therefore,research in this domain is now turning to the problem of integrating emotions into speech.However,the method of con-structing a speech synthesizer for each emotion has some limitations.First,this method often requires an emotional-speech data set with many sentences.Such data sets are very time-intensive and labor-intensive to complete.Second,training each of these models requires computers with large computational capabilities and a lot of effort and time for model tuning.In addition,each model for each emotion failed to take advantage of data sets of other emotions.In this paper,we propose a new method to synthesize emotional speech in which the latent expressions of emotions are learned from a small data set of professional actors through a Flow-tron model.In addition,we provide a new method to build a speech corpus that is scalable and whose quality is easy to control.Next,to produce a high-quality speech synthesis model,we used this data set to train the Tacotron 2 model.We used it as a pre-trained model to train the Flowtron model.We applied this method to synthesize Vietnamese speech with sadness and happiness.Mean opi-nion score(MOS)assessment results show that MOS is 3.61 for sadness and 3.95 for happiness.In conclusion,the proposed method proves to be more effec-tive for a high degree of automation and fast emotional sentence generation,using a small emotional-speech data set.
文摘To enhance the communication between human and robots at home in the future, speech synthesis interfaces are indispensable that can generate expressive speech. In addition, synthesizing celebrity voice is commercially important. For these issues, this paper proposes techniques for synthesizing natural-sounding speech that has a rich prosodic personality using a limited amount of data in a text-to-speech (TTS) system. As a target speaker, we chose a well-known prime minister of Japan, Shinzo Abe, who has a good prosodic personality in his speeches. To synthesize natural-sounding and prosodically rich speech, accurate phrasing, robust duration prediction, and rich intonation modeling are important. For these purpose, we propose pause position prediction based on conditional random fields (CRFs), phone-duration prediction using random forests, and mora-based emphasis context labeling. We examine the effectiveness of the above techniques through objective and subjective evaluations.
文摘A new speech synthesis algorithm based on the LMA filter in Chinese text-to-speech systern is introduced. Using this method, the system can not only generate speech with higher quality, but also have a more powerful ability to modify the prosodic parameters, which ensures a far more natural and intelligible synthesized speech than ever before. First, the fundamental principles of the LMA filter and the construction of the synthesizer are presented, then, how to modify the acoustic parameters with this synthesizer is described; finally, the quantitative evaluation of the system's performance is shown while compared with a relatively successful PSOLA synthesizer KDTALK_1
文摘National assessment of speech synthesis systems for Chinese has been regularly carried out since 1994 in China. New guidelines to the assessment activities which aim at promoting the assessment work to be standardizable, automatizable (partially) and accessible to the public by computer network were set up in 1997. Two modules. the phonetic module and the linguistic module, are evaluated individually. The phonetic module is evaluated by using speech intelligibility tests at three levels:syllable, word and sentence, and speech natu-ralness tests (in MOS). As for the linguistic module, the text processing ability, which includes word segmentation, polyphonic characters, numerals, years, symbols and metrological units, is examined automatically.
基金Project supported by the National Key Research and Development Program of China(No.2019YFB1312603)the Robotics Institute of Zhejiang University,China(No.K11801)。
文摘Self-attention has been innovatively applied to text-to-speech(TTS)because of its parallel structure and superior strength in modeling sequential data.However,when used in end-to-end speech synthesis with an autoregressive decoding scheme,its inference speed becomes relatively low due to the quadratic complexity in sequence length.This problem becomes particularly severe on devices without graphics processing units(GPUs).To alleviate the dilemma,we propose an efficient decoding self-attention(EDSA)module as an alternative.Combined with a dynamic programming decoding procedure,TTS model inference can be effectively accelerated to have a linear computation complexity.We conduct studies on Mandarin and English datasets and find that our proposed model with EDSA can achieve 720%and 50%higher inference speed on the central processing unit(CPU)and GPU respectively,with almost the same performance.Thus,this method may make the deployment of such models easier when there are limited GPU resources.In addition,our model may perform better than the baseline Transformer TTS on out-of-domain utterances.
基金Supported by the Guangdong-Hong Kong Technology Cooperation Funding Scheme (No.GHP024/06) of Hong Kong SARthe National Natural Science Foundation of China (No.60805008)the Doctoral Program Foundation of Ministry of Education of China(No.200800031015)
文摘This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects. The interfaces between consecutive modules conform to the speech synthesis markup language (SSML) specification for standardization, interoperability, multilinguality, and extensibility. Detailed module divisions and implementation technologies for the unified framework are introduced, together with possible extensions for the algorithm research and evaluation of the TTS synthesis. Implementation of a mixed-language TTS system for Chinese Putonghua, Chinese Cantonese, and English demonstrates the feasibility of the proposed unified framework.
文摘A national assessment of the performance of speech synthesis systems for Chinese has been carried out yearly since 1994. The quality of synthetic speech of five different systems were evaluated and diagnosed by using speech intelligibility tests. 16 college students (8 male, 8 female) with no experience with synthetic speech were the listeners, they were asked to do open response task by pencilpaper. In addition, speech naturalness was mea-sured by Mean Opinion
基金supported by National Key R&D Program of China (2020AAA0107901).
文摘Latent information is difficult to get from the text in speech synthesis.Studies show that features from speech can get more information to help text encoding.In the field of speech encoding,a lot of work has been conducted on two aspects.The first aspect is to encode speech frame by frame.The second aspect is to encode the whole speech to a vector.But the scale in these aspects is fixed.So,encoding speech with an adjustable scale for more latent information is worthy of investigation.But current alignment approaches only support frame-by-frame encoding and speech-to-vector encoding.It remains a challenge to propose a new alignment approach to support adjustable scale speech encoding.This paper presents the dynamic speech encoder with a new alignment approach in conjunction with frame-by-frame encoding and speech-to-vector encoding.The speech feature fromourmodel achieves three functions.First,the speech feature can reconstruct the origin speech while the length of the speech feature is equal to the text length.Second,our model can get text embedding fromspeech,and the encoded speech feature is similar to the text embedding result.Finally,it can transfer the style of synthesis speech and make it more similar to the given reference speech.
文摘A sinusoidal representation of speech and a cochlear model are used to extract speech parameters in this paper, and a speech analysis/synthesis system controlled by the auditory spectrum is developed with the model. The computer simulation shows that speech can be synthesized with only 12 parameters per frame on the average. The method has the advantages of few parameters, low complexity and high performance of speech representation. The synthetic speech has high intelligibility.
基金Sponsored by the National Natural Science Foundation of China(Grant No.60503071)the 973 National Basic Research Program of China(Grant No.2004CB318102)the Postdoctor Science Foundation of China(Grant No.20070420275)
文摘The function of prosody model will directly affect the naturalness of synthesized speech.Aimed at the difficulty in generating the pitch contour in prosody model,two pitch models namely corpus-based pitch model and pitch pattern model are deeply studied in this paper.Key problems in the corpus-based model are calculation of the distance and searching of the optimal path with dynamic programming algorithm.For the pitch pattern model,parameters such as pitch pattern,pitch average and pitch range are used to describe the pitch contour,and six pitch patterns are presented.For the generation of pitch contour,the pitch pattern model is more flexible than the corpus-based model.Both of the two models are linked to the real TTS system,and the MOS results of synthesized Mandarin speech show that the pitch pattern model is better than the corpus-based pitch model.
文摘The term "Experimental" in the title means, that the synthesizer is constructed as tool to conduct experiments, for investigating the influence of environment of unit on sounding of it. Synthesizer as tool for testing of hypotheses and results of experiments, satisfy three conditions: independence from the selection of unit for the synthesis (word or any part of it); taking into account the environment of unit (left and right hand contexts and position of unit); independence from the content of base. Such synthesizer is a good tool for studying many aspects of speech and removes the problem of selection. We can vary the unit and other parameters, described in paper, by the same synthesizer, synthesize the same text and listen to the results directly. This paper describes the formal structure of experimental Georgian speech synthesizer.
基金This work is supported by National Natural Science Foundation of China !(69972046)the NSF of Zhejiang Province! (698076)
文摘A method to synthesize formant targeted sounds based on speech production model and Reflection-Type Line Analog (RTLA) articulatory synthesis model is presented. The synthesis model is implemented with scattering process derived from a RTLA of vocal tract system according to the acoustic mechanism of speech production. The vocal-tract area function which controls the synthesis model is derived from the first three formant trajectories by using the inverse solution of speech production. The proposed method not only gives good naturalness and dynamic smoothness, but also is capable to control or modify speech timbres easily and flexibly. Further and mores it needs less number of control parameters and very low update rate of the parameters.
文摘This paper describes the latest version of the Chinese-Japanese-English handheld speech-tospeech translation system developed by NICT/ATR, which is now ready to be deployed for travelers. With the entire speech-to-speech translation function being implemented into one terminal, it realizes real-time, location-free speech-to-speech translation. A new noise-suppression technique notably improves the speech recognition performance. Corpus-based approaches of speech recognition, machine translation, and speech synthesis enable coverage of a wide variety of topics and portability to other languages. Test results show that the character accuracy of speech recognition is 82%-94% for Chinese speech, with a bilingual evaluation understudy score of machine translation is 0.55-0.74 for Chinese-Japanese and Chinese-English
基金the National Natural Science Foundation of China (No. 60602017)
文摘The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, usually results in loss of non-uniform synthesis instances. In order to solve this problem, we propose the concept of virtual non-uniform instances. According to this concept and the synthesis frequency of each instance, the algorithm named StaRp-VPA is constructed to make up for the loss of nonuniform instances. In experimental testing, the naturalness scored by the mean opinion score (MOS) remains almost unchanged when less than 50% instances are pruned, and the MOS is only slightly degraded for reduction rates above 50%. The test results show that the algorithm StaRp-VPA is effective.
基金This article is supported by the 2020 Innovation and Entrepreneurship Training Program for College Students in Jiangsu Province(Project name:Mom doesn’t have to worry about my dictation any more-dictation software based on character recognition,No.202011460104T)This article is supported by the National Natural Science Foundation of China Youth Science Foundation project(Project name:Research on Deep Discriminant Spares Representation Learning Method for Feature Extraction,No.61806098)This article is supported by Scientific Research Project of Nanjing Xiaozhuang University(Project name:Multi-robot collaborative system,No.2017NXY16).
文摘To solve students’ dictation problems, a speech dictation system basedon character recognition is proposed in this paper. The system applied offlinehandwritten Chinese character recognition technology, denoised the imagethrough Gaussian filter, segmented the text through projection method, and convertedthe image to text through OCR technology. The straight line mark in thepicture was detected by Hough transform technology, and then SKB-FSS algorithmand WST algorithm were used for speech synthesis. Experiments show thatthe system can effectively assist students in dictation.
基金supported by the Research Plan Project of National University of Defense Technology under Grant No.JC13-06-01the OCRit Project made possible by the Global Leadership Round in Genomics&Life Sciences Grant(GL2)
文摘Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language- independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve tile problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several lhnitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.
文摘Several methods were developed to improve grapheme-to-phoneme (G2P) conversion models for Chinese text-to-speech (TTS) systems. The critical problem of data sparsity was handled by combining approaches. First, a text-selection method was designed to cover as many G2P text corpus contexts as possible. Then, various data-driven modeling methods were used with comparisons to select the best method for each polyphonic word. Finally, independent models were used for some neutral tone words in addition to the normal G2P models to achieve more compact and flexible G2P models. Tests show that these methods reduce the relative errors by 50% for both normal polyphonic words and Chinese neutral tones.
基金Supported by the National Natural Science Foundation of China (Nos.60805008,60928005,and 61003094)the Ph.D.Programs Foundation of the Ministry of Education of China (No.200800031015)
文摘In continuous speech, the pitch contour of the same syllable may vary much due to its contextual information. The Parallel Encoding and Target Approximation (PENTA) model is applied here to Mandarin speech synthesis with a method to predict pitch contours for Chinese syllables with different contexts by combining the Classification And Regression Tree (CART) with the PENTA model to improve its prediction accuracy. CART was first used to cluster the syllables' normalized pitch contours according to the syllables contextual information and the distances between pitch contours. The average pitch contour was used to train the PENTA model with the average contour for each cluster. The initial pitch is required with the PENTA model to predict a continuous pitch contour. A Pitch Discontinuity Model (PDM) was used to predict the initial pitches at positions with voiceless consonants and prosodic boundaries. Initial tests on a Chinese four-syllable word corpus containing 2048 words were extended to tests with a continuous speech corpus containing 5445 sentences. The results are satisfactory in terms of the Root Mean Square Error (RMSE) comparing the predicted pitch contour with the original contour. This method can model pitch contours for Mandarin sentences with any text for speech synthesis.