A kind of Web voice browser based on improved synchronous linear predictive coding (ISLPC) and Text-toSpeech (TTS) algorithm and Internet application was proposed. The paper analyzes the features of TTS system wit...A kind of Web voice browser based on improved synchronous linear predictive coding (ISLPC) and Text-toSpeech (TTS) algorithm and Internet application was proposed. The paper analyzes the features of TTS system with ISLPC speech synthesis and discusses the design and implementation of ISLPC TTS-based Web voice browser. The browser integrates Web technology, Chinese information processing, artificial intelligence and the key technology of Chinese ISLPC speech synthesis. It's a visual and audible web browser that can improve information precision for network users. The evaluation results show that ISLPC-based TTS model has a better performance than other browsers in voice quality and capability of identifying Chinese characters.展开更多
Putonghua prosody is characterized by its hierarchical structure when influenced by linguistic environments. Based on this, a neural network, with specially weighted factors and optimizing outputs, is described and ap...Putonghua prosody is characterized by its hierarchical structure when influenced by linguistic environments. Based on this, a neural network, with specially weighted factors and optimizing outputs, is described and applied to construct the Putonghua prosodic model in Text-to-Speech (TTS) system. Extensive tests show that the structure of the neural network characterizes the Putonghua prosody more exactly than traditional models. Learning rate is speeded up and computational precision is improved, which makes the whole prosodic model more efficient. Furthermore, the paper also stylizes the Putonghua syllable pitch contours with SPiS parameters (Syllable Pitch Stylized Parameters), and analyzes them in adjusting the syllable pitch. It shows that the SPiS parameters effectively characterize the Putonghua syllable pitch contours, and facilitate the establishment of the network model and the prosodic controlling.展开更多
This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects...This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects. The interfaces between consecutive modules conform to the speech synthesis markup language (SSML) specification for standardization, interoperability, multilinguality, and extensibility. Detailed module divisions and implementation technologies for the unified framework are introduced, together with possible extensions for the algorithm research and evaluation of the TTS synthesis. Implementation of a mixed-language TTS system for Chinese Putonghua, Chinese Cantonese, and English demonstrates the feasibility of the proposed unified framework.展开更多
The paper describes the development of a text reader for people with vision impairments. The system is designed to extract the content of written documents or commercially printed materials. In terms of hardware, it u...The paper describes the development of a text reader for people with vision impairments. The system is designed to extract the content of written documents or commercially printed materials. In terms of hardware, it utilizes a camera, a small embedded processor board, and an Alexa Echo Dot. The software involves an open source text detection library called Tesseract along with Leptonica and OpenCV. The system in its current version can only work with English text. By using the Amazon cloud web services, a skill set was deployed, which would read aloud the detected text utilizing a OpenCV program via the Alexa Echo Dot. For this development, a Raspberry Pi was utilized as the embedded processor system.展开更多
The Arabic language comes under the category of Semitic languages with an entirely different sentence structure in terms of Natural Language Processing. In such languages, two different words may have identical spelli...The Arabic language comes under the category of Semitic languages with an entirely different sentence structure in terms of Natural Language Processing. In such languages, two different words may have identical spelling whereas their pronunciations and meanings are totally different. To remove this ambiguity, special marks are put above or below? the spelling characters to determine the correct pronunciation. These marks are called diacritics and the language that uses them is called a diacritized language. This paper presents a system for Arabic language diacritization using Hid- den Markov Models (HMMs). The system employs the renowned HMM Tool Kit? (HTK). Each single diacritic is represented as a separate model. The concatenation of output models is coupled with the input? character sequence to form the fully diacritized text. The performance of the proposed system is assessed using a data corpus that includes more than 24000 sentences.展开更多
To improve the performance of human-computer interaction interfaces, emotion is considered to be one of the most important factors. The major objective of expressive speech synthesis is to inject various expressions r...To improve the performance of human-computer interaction interfaces, emotion is considered to be one of the most important factors. The major objective of expressive speech synthesis is to inject various expressions reflecting different emotions to the synthesized speech. To effectively model and control the emotion, emotion intensity is introduced for expressive speech synthesis model to generate speech conveyed the delicate and complicate emotional states. The system was composed of an emotion analysis module with the goal of extracting control emotion intensity vector and a speech synthesis module responsible for mapping text characters to speech waveform. The proposed continuous variable “perception vector” is a data-driven approach of controlling the model to synthesize speech with different emotion intensities. Compared with the system using a one-hot vector to control emotion intensity, this model using perception vector is able to learn the high-level emotion information from low-level acoustic features. In terms of the model controllability and flexibility, both the objective and subjective evaluations demonstrate perception vector outperforms one-hot vector.展开更多
In the paper an intelligent speech production system is established by using language information processing technology. The concept of bi-directional grammar is proposed in Chinese language information processing and...In the paper an intelligent speech production system is established by using language information processing technology. The concept of bi-directional grammar is proposed in Chinese language information processing and a corresponding Chinese characteristic network is completed. Correct text can be generated through grammar parsing and some additional rules. According to the generated text the system generates speech which has good quality in naturalness and intelligibility using Chinese Text-to-Speech Conversion System.展开更多
The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, us...The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, usually results in loss of non-uniform synthesis instances. In order to solve this problem, we propose the concept of virtual non-uniform instances. According to this concept and the synthesis frequency of each instance, the algorithm named StaRp-VPA is constructed to make up for the loss of nonuniform instances. In experimental testing, the naturalness scored by the mean opinion score (MOS) remains almost unchanged when less than 50% instances are pruned, and the MOS is only slightly degraded for reduction rates above 50%. The test results show that the algorithm StaRp-VPA is effective.展开更多
基金Supported by the National High-Technology Re-search and Development Program(2005AA122210) the National Out-standing Youth Foundation (60325104)
文摘A kind of Web voice browser based on improved synchronous linear predictive coding (ISLPC) and Text-toSpeech (TTS) algorithm and Internet application was proposed. The paper analyzes the features of TTS system with ISLPC speech synthesis and discusses the design and implementation of ISLPC TTS-based Web voice browser. The browser integrates Web technology, Chinese information processing, artificial intelligence and the key technology of Chinese ISLPC speech synthesis. It's a visual and audible web browser that can improve information precision for network users. The evaluation results show that ISLPC-based TTS model has a better performance than other browsers in voice quality and capability of identifying Chinese characters.
基金This work was supported by the National Natural Science Foundation of China (69875008) and 863National High Technology Project
文摘Putonghua prosody is characterized by its hierarchical structure when influenced by linguistic environments. Based on this, a neural network, with specially weighted factors and optimizing outputs, is described and applied to construct the Putonghua prosodic model in Text-to-Speech (TTS) system. Extensive tests show that the structure of the neural network characterizes the Putonghua prosody more exactly than traditional models. Learning rate is speeded up and computational precision is improved, which makes the whole prosodic model more efficient. Furthermore, the paper also stylizes the Putonghua syllable pitch contours with SPiS parameters (Syllable Pitch Stylized Parameters), and analyzes them in adjusting the syllable pitch. It shows that the SPiS parameters effectively characterize the Putonghua syllable pitch contours, and facilitate the establishment of the network model and the prosodic controlling.
基金Supported by the Guangdong-Hong Kong Technology Cooperation Funding Scheme (No.GHP024/06) of Hong Kong SARthe National Natural Science Foundation of China (No.60805008)the Doctoral Program Foundation of Ministry of Education of China(No.200800031015)
文摘This paper describes the design of a unified framework for a multilingual text-to-speech (TTS) synthesis engine - Crystal. The unified framework defines the common TTS modules for different languages and/or dialects. The interfaces between consecutive modules conform to the speech synthesis markup language (SSML) specification for standardization, interoperability, multilinguality, and extensibility. Detailed module divisions and implementation technologies for the unified framework are introduced, together with possible extensions for the algorithm research and evaluation of the TTS synthesis. Implementation of a mixed-language TTS system for Chinese Putonghua, Chinese Cantonese, and English demonstrates the feasibility of the proposed unified framework.
文摘The paper describes the development of a text reader for people with vision impairments. The system is designed to extract the content of written documents or commercially printed materials. In terms of hardware, it utilizes a camera, a small embedded processor board, and an Alexa Echo Dot. The software involves an open source text detection library called Tesseract along with Leptonica and OpenCV. The system in its current version can only work with English text. By using the Amazon cloud web services, a skill set was deployed, which would read aloud the detected text utilizing a OpenCV program via the Alexa Echo Dot. For this development, a Raspberry Pi was utilized as the embedded processor system.
文摘The Arabic language comes under the category of Semitic languages with an entirely different sentence structure in terms of Natural Language Processing. In such languages, two different words may have identical spelling whereas their pronunciations and meanings are totally different. To remove this ambiguity, special marks are put above or below? the spelling characters to determine the correct pronunciation. These marks are called diacritics and the language that uses them is called a diacritized language. This paper presents a system for Arabic language diacritization using Hid- den Markov Models (HMMs). The system employs the renowned HMM Tool Kit? (HTK). Each single diacritic is represented as a separate model. The concatenation of output models is coupled with the input? character sequence to form the fully diacritized text. The performance of the proposed system is assessed using a data corpus that includes more than 24000 sentences.
基金the results of the research project funded by Natural Science Foundation of Hebei University of Economics and Business (No. 2016KYQ05).
文摘To improve the performance of human-computer interaction interfaces, emotion is considered to be one of the most important factors. The major objective of expressive speech synthesis is to inject various expressions reflecting different emotions to the synthesized speech. To effectively model and control the emotion, emotion intensity is introduced for expressive speech synthesis model to generate speech conveyed the delicate and complicate emotional states. The system was composed of an emotion analysis module with the goal of extracting control emotion intensity vector and a speech synthesis module responsible for mapping text characters to speech waveform. The proposed continuous variable “perception vector” is a data-driven approach of controlling the model to synthesize speech with different emotion intensities. Compared with the system using a one-hot vector to control emotion intensity, this model using perception vector is able to learn the high-level emotion information from low-level acoustic features. In terms of the model controllability and flexibility, both the objective and subjective evaluations demonstrate perception vector outperforms one-hot vector.
文摘In the paper an intelligent speech production system is established by using language information processing technology. The concept of bi-directional grammar is proposed in Chinese language information processing and a corresponding Chinese characteristic network is completed. Correct text can be generated through grammar parsing and some additional rules. According to the generated text the system generates speech which has good quality in naturalness and intelligibility using Chinese Text-to-Speech Conversion System.
基金the National Natural Science Foundation of China (No. 60602017)
文摘The employment of non-uniform processes assists greatly in the corpus-based text-to-speech (TTS) system to synthesize natural speech. However, tailoring a TTS voice font, or pruning redundant synthesis instances, usually results in loss of non-uniform synthesis instances. In order to solve this problem, we propose the concept of virtual non-uniform instances. According to this concept and the synthesis frequency of each instance, the algorithm named StaRp-VPA is constructed to make up for the loss of nonuniform instances. In experimental testing, the naturalness scored by the mean opinion score (MOS) remains almost unchanged when less than 50% instances are pruned, and the MOS is only slightly degraded for reduction rates above 50%. The test results show that the algorithm StaRp-VPA is effective.