The acoustic characteristics or the chinese vowels of 24 children with cleft palate and 10 normal control children were analyzed by computerized speech signal processing system (CSSPS),and the speech articulation was ...The acoustic characteristics or the chinese vowels of 24 children with cleft palate and 10 normal control children were analyzed by computerized speech signal processing system (CSSPS),and the speech articulation was judged with Glossary of clert palate speech(GCPS).The listening judgement showed that the speech articulation was significantly different between the two groups(P<0.01).The objective quantitative measurement suggested that the formant pattern(FP)of vowels in children with cleft palate was different from that of normal control children except vowel[a](P< 0.05).The acoustic vowelgraph or the Chinese vowels which demonstrated directly the relationship of vocal space and speech perception was stated with the first formant frequence(F1)and the second formant frequence(F2).The authors conclude that the values or F1 and F2 point out the upward and backward tongue movement to close the clert, which reflects the vocal characteristics of trausmission of clert palate speech.展开更多
As one of the most effective methods to improve the accuracy and robustness of speech tasks,the audio-visual fusion approach has recently been introduced into the field of Keyword Spotting(KWS).However,existing audio-...As one of the most effective methods to improve the accuracy and robustness of speech tasks,the audio-visual fusion approach has recently been introduced into the field of Keyword Spotting(KWS).However,existing audio-visual keyword spotting models are limited to detecting isolated words,while keyword spotting for unconstrained speech is still a challenging problem.To this end,an Audio-Visual Keyword Transformer(AVKT)network is proposed to spot keywords in unconstrained video clips.The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable-length audio and visual inputs.The outputs of audio and visual branches are combined in a decision fusion module.As humans can easily notice whether a keyword appears in a sentence or not,our AVKT network can detect whether a video clip with a spoken sentence contains a pre-specified keyword.Moreover,the position of the keyword is localised in the attention map without additional position labels.Exper-imental results on the LRS2-KWS dataset and our newly collected PKU-KWS dataset show that the accuracy of AVKT exceeded 99%in clean scenes and 85%in extremely noisy conditions.The code is available at https://github.com/jialeren/AVKT.展开更多
Sound indexing and segmentation of digital documentsespecially in the internet and digital libraries are very useful tosimplify and to accelerate the multimedia document retrieval. Wecan imagine that we can extract mu...Sound indexing and segmentation of digital documentsespecially in the internet and digital libraries are very useful tosimplify and to accelerate the multimedia document retrieval. Wecan imagine that we can extract multimedia files not only bykeywords but also by speech semantic contents. The maindifficulty of this operation is the parameterization and modellingof the sound track and the discrimination of the speech, musicand noise segments. In this paper, we will present aSpeech/Music/Noise indexing interface designed for audiodiscrimination in multimedia documents. The program uses astatistical method based on ANN and HMM classifiers. After preemphasisand segmentation, the audio segments are analysed bythe cepstral acoustic analysis method. The developed system wasevaluated on a database constituted of music songs with Arabicspeech segments under several noisy environments.展开更多
An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate s...An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate such deaf speakers through certain assistive mechanism, an effort has been taken to understand the acoustic characteristics of deaf speakers by evaluating the territory specific utterances. Speech signals are acquired from 32 normal and 32 deaf speakers by uttering ten Indian native Tamil language words. The speech parameters like pitch, formants, signal-to-noise ratio, energy, intensity, jitter and shimmer are analyzed. From the results, it has been observed that the acoustic characteristics of deaf speakers differ significantly and their quantitative measure dominates the normal speakers for the words considered. The study also reveals that the informative part of speech in a normal and deaf speakers may be identified using the acoustic features. In addition, these attributes may be used for differential corrections of deaf speaker’s speech signal and facilitate listeners to understand the conveyed information.展开更多
Parkinson’s disease(PD),one of whose symptoms is dysphonia,is a prevalent neurodegenerative disease.The use of outdated diagnosis techniques,which yield inaccurate and unreliable results,continues to represent an obs...Parkinson’s disease(PD),one of whose symptoms is dysphonia,is a prevalent neurodegenerative disease.The use of outdated diagnosis techniques,which yield inaccurate and unreliable results,continues to represent an obstacle in early-stage detection and diagnosis for clinical professionals in the medical field.To solve this issue,the study proposes using machine learning and deep learning models to analyze processed speech signals of patients’voice recordings.Datasets of these processed speech signals were obtained and experimented on by random forest and logistic regression classifiers.Results were highly successful,with 90%accuracy produced by the random forest classifier and 81.5%by the logistic regression classifier.Furthermore,a deep neural network was implemented to investigate if such variation in method could add to the findings.It proved to be effective,as the neural network yielded an accuracy of nearly 92%.Such results suggest that it is possible to accurately diagnose early-stage PD through merely testing patients’voices.This research calls for a revolutionary diagnostic approach in decision support systems,and is the first step in a market-wide implementation of healthcare software dedicated to the aid of clinicians in early diagnosis of PD.展开更多
In this paper, the frequency-domain Frost algorithm is enhanced by using conjugate gradient techniques for speech enhancement. Unlike the non-adaptive approach of computing the optimum minimum variance distortionless ...In this paper, the frequency-domain Frost algorithm is enhanced by using conjugate gradient techniques for speech enhancement. Unlike the non-adaptive approach of computing the optimum minimum variance distortionless response (MVDR) solution with the correlation matrix inversion, the Frost algorithm implementing the stochastic constrained least mean square (LMS) algorithm can adaptively converge to the MVDR solution in mean-square sense, but with a very slow convergence rate. In this paper, we propose a frequency-domain constrained conjugate gradient (FDCCG) algorithm to speed up the convergence. The devised FDCCG algorithm avoids the matrix inversion and exhibits fast convergence. The speech enhancement experiments for the target speech signal corrupted by two and five interfering speech signals are demonstrated by using a four-channel acoustic-vector-sensor (AVS) micro-phone array and show the superior performance.展开更多
Research on the feature of speech and image signals are carried out from two perspectives,the time domain and the frequency domain.The speech and image signals are a non-stationary signal,so FT is not used for the non...Research on the feature of speech and image signals are carried out from two perspectives,the time domain and the frequency domain.The speech and image signals are a non-stationary signal,so FT is not used for the non-stationary characteristics of the signal.When short-term stable speech is obtained by windowing and framing the subsequent processing of the signal is completed by the Discrete Fourier Transform(DFT).The Fast Discrete Fourier Transform is a commonly used analysis method for speech and image signal processing in frequency domain.It has the problem of adjusting window size to a for desired resolution.But the Fractional Fourier Transform can have both time domain and frequency domain processing capabilities.This paper performs global processing speech encryption by combining speech with image of Fractional Fourier Transform.The speech signal is embedded watermark image that is processed by fractional transformation,and the embedded watermark has the effect of rotation and superposition,which improves the security of the speech.The paper results show that the proposed speech encryption method has a higher security level by Fractional Fourier Transform.The technology is easy to extend to practical applications.展开更多
Speech or Natural language contents are major tools of communication. This research paper presents a natural language processing based automated system for understanding speech language text. A new rule based model ha...Speech or Natural language contents are major tools of communication. This research paper presents a natural language processing based automated system for understanding speech language text. A new rule based model has been presented for analyzing the natural languages and extracting the relative meanings from the given text. User writes the natural language text in simple English in a few paragraphs and the designed system has a sound ability of analyzing the given script by the user. After composite analysis and extraction of associated information, the designed system gives particular meanings to an assortment of speech language text on the basis of its context. The designed system uses standard speech language rules that are clearly defined for all speech languages as English, Urdu, Chinese, Arabic, French, etc. The designed system provides a quick and reliable way to comprehend speech language context and generate respective meanings.展开更多
Speech recognition rate will deteriorate greatly in human-machine interaction when the speaker's speech mixes with a bystander's voice. This paper proposes a time-frequency approach for Blind Source Seperation...Speech recognition rate will deteriorate greatly in human-machine interaction when the speaker's speech mixes with a bystander's voice. This paper proposes a time-frequency approach for Blind Source Seperation (BSS) for intelligent Human-Machine Interaction(HMI). Main idea of the algorithm is to simultaneously diagonalize the correlation matrix of the pre-whitened signals at different time delays for every frequency bins in time-frequency domain. The prososed method has two merits: (1) fast convergence speed; (2) high signal to interference ratio of the separated signals. Numerical evaluations are used to compare the performance of the proposed algorithm with two other deconvolution algorithms. An efficient algorithm to resolve permutation ambiguity is also proposed in this paper. The algorithm proposed saves more than 10% of computational time with properly selected parameters and achieves good performances for both simulated convolutive mixtures and real room recorded speeches.展开更多
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi...Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.展开更多
Multisource localization occupies an important position in the field of acoustic signal processing and is widely applied in scenarios,such as human‐machine interaction and spatial acoustic parameter acquisition.The d...Multisource localization occupies an important position in the field of acoustic signal processing and is widely applied in scenarios,such as human‐machine interaction and spatial acoustic parameter acquisition.The direction‐of‐arrival(DOA)of a sound source is convenient to render spatial sound in the audio metaverse.A multisource localization method in a reverberation environment is proposed based on the angle distribution of time-frequency(TF)points using a first‐order ambisonics(FOA)microphone.The method is implemented in three steps.1)By exploring the angle distribution of TF points,a single‐source zone(SSZ)detection method is proposed by using a standard deviation‐based measure,which reveals the degree of convergence of TF point angles in a zone.2)To reduce the effect of outliers on localization,an outlier removal method is designed to remove the TF points whose angles are far from the real DOAs,where the median angle of each detected zone is adopted to construct the outlier set.3)DOA estimates of multiple sources are obtained by postprocessing of the angle histogram.Experimental results in both the simulated and real scenarios verify the effectiveness of the proposed method in a reverberation environment,which also show that the proposed method outperforms reference methods.展开更多
In this paper,we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.We first discuss models such as recurrent neural networks(RNNs) a...In this paper,we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.We first discuss models such as recurrent neural networks(RNNs) and convolutional neural networks(CNNs) that can effectively exploit variablelength contextual information,and their various combination with other models.We then describe models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system,the connectionist temporal classification(CTC) criterion,and the attention-based sequenceto-sequence translation model.We further illustrate robustness issues in speech recognition systems,and discuss acoustic model adaptation,speech enhancement and separation,and robust training strategies.We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.展开更多
Voice conversion algorithm aims to provide high level of similarity to the target voice with an acceptable level of quality.The main object of this paper was to build a nonlinear relationship between the parameters fo...Voice conversion algorithm aims to provide high level of similarity to the target voice with an acceptable level of quality.The main object of this paper was to build a nonlinear relationship between the parameters for the acoustical features of source and target speaker using Non-Linear Canonical Correlation Analysis(NLCCA) based on jointed Gaussian mixture model.Speaker indi-viduality transformation was achieved mainly by altering vocal tract characteristics represented by Line Spectral Frequencies(LSF).To obtain the transformed speech which sounded more like the target voices,prosody modification is involved through residual prediction.Both objective and subjective evaluations were conducted.The experimental results demonstrated that our proposed algorithm was effective and outperformed the conventional conversion method utilized by the Minimum Mean Square Error(MMSE) estimation.展开更多
A novel algorithm for voice conversion is proposed in this paper. The mapping function of spectral vectors of the source and target speakers is calculated by the Canonical Correlation Analysis (CCA) estimation based o...A novel algorithm for voice conversion is proposed in this paper. The mapping function of spectral vectors of the source and target speakers is calculated by the Canonical Correlation Analysis (CCA) estimation based on Gaussian mixture models. Since the spectral envelope feature remains a majority of second order statistical information contained in speech after Linear Prediction Coding (LPC) analysis, the CCA method is more suitable for spectral conversion than Minimum Mean Square Error (MMSE) because CCA explicitly considers the variance of each component of the spectral vectors during conversion procedure. Both objective evaluations and subjective listening tests are conducted. The experimental results demonstrate that the proposed scheme can achieve better per- formance than the previous method which uses MMSE estimation criterion.展开更多
This paper presents a new online incremental training algorithm of Gaussian mixture model (GMM), which aims to perform the expectation-maximization(EM) training incrementally to update GMM model parameters online ...This paper presents a new online incremental training algorithm of Gaussian mixture model (GMM), which aims to perform the expectation-maximization(EM) training incrementally to update GMM model parameters online sample by sample, instead of waiting for a block of data with the sufficient size to start training as in the traditional EM procedure. The proposed method is extended from the split-and-merge EM procedure, so inherently it is also capable escaping from local maxima and reducing the chances of singularities. In the application domain, the algorithm is optimized in the context of speech processing applications. Experiments on the synthetic data show the advantage and efficiency of the new method and the results in a speech processing task also confirm the improvement of system performance.展开更多
Sample entropy can reflect the change of level of new information in signal sequence as well as the size of the new information. Based on the sample entropy as the features of speech classification, the paper firstly ...Sample entropy can reflect the change of level of new information in signal sequence as well as the size of the new information. Based on the sample entropy as the features of speech classification, the paper firstly extract the sample entropy of mixed signal, mean and variance to calculate each signal sample entropy, finally uses the K mean clustering to recognize. The simulation results show that: the recognition rate can be increased to 89.2% based on sample entropy.展开更多
This paper represents current research in low-power Very Large Scale Integration (VLSI) domain. Nowadays low power has become more sought research topic in electronic industry. Power dissipation is the most important ...This paper represents current research in low-power Very Large Scale Integration (VLSI) domain. Nowadays low power has become more sought research topic in electronic industry. Power dissipation is the most important area while designing the VLSI chip. Today almost all of the high speed switching devices include the Ternary Content Addressable Memory (TCAM) as one of the most important features. When a device consumes less power that becomes reliable and it would work with more efficiency. Complementary Metal Oxide Semiconductor (CMOS) technology is best known for low power consumption devices. This paper aims at designing a router application device which consumes less power and works more efficiently. Various strategies, methodologies and power management techniques for low power circuits and systems are discussed in this research. From this research the challenges could be developed that might be met while designing low power high performance circuit. This work aims at developing Data Aware AND-type match line architecture for TCAM. A TCAM macro of 256 × 128 was designed using Cadence Advanced Development Environment (ADE) with 90 nm technology file from Taiwan Semiconductor Manufacturing Company (TSMC). The result shows that the proposed Data Aware architecture provides around 35% speed and 45% power improvement over existing architecture.展开更多
An enhanced relative spectral (FLRASTA) technique for speech and speaker recognition is proposed. The new method consists of classical RASTA filtering in logarithmic spectral domain following by another additive RASTA...An enhanced relative spectral (FLRASTA) technique for speech and speaker recognition is proposed. The new method consists of classical RASTA filtering in logarithmic spectral domain following by another additive RASTA filtering in the same domain. In this manner, both the channel distortion and additive noise are removed effectively. In speaker identification and speech recognition experiments on T146 database, the E_RASTA performs equal or better than J_RASTA method in both tasks. The E_RASTA does not need the speech SNR estimation in order to determinate the optimal value of J in J_RASTA, and the information of how the speech degrades. The choice of ERASTA filter also indicates that the low temporal modulation components in speech can deteriorate the performance of both recognition tasks. Besides, the speaker recognition needs less temporal modulation frequency band than that of the speech recognition.展开更多
The Laboratory of Acoustics,Speech and Signal Processing(LASSP),theunique and superior national key laboratory of ASSP in China,has been foundedat the Inst.of Acoustics,Academia Sinica,Beijing PRC.After three years of...The Laboratory of Acoustics,Speech and Signal Processing(LASSP),theunique and superior national key laboratory of ASSP in China,has been foundedat the Inst.of Acoustics,Academia Sinica,Beijing PRC.After three years ofefforts,the construction of the LASSP has been completed successfully and thecertain capability of performing frontier research projects in fundamental theory andapplied technology of sound field and acoustic signal processing has ben formed.A fiexible and complete experimental acoustic signal processing system hasbeen set up in the LASSP.With the remarkable advantage of real time signalprocessing and resource sharing,a wide range of research projects in the field ofASSP can be conducted in the laboratory.The Signal Processing Center of theLASSP is well equipped with many computer research facilities including the展开更多
The 4th National Conference on Speech,Image,Communication and Signal Pro-cessing,which was sponsored by the Institute of Speech,Hearing,and Music Acoustics,Acoustical Society of China and the Institute of Signal Proce...The 4th National Conference on Speech,Image,Communication and Signal Pro-cessing,which was sponsored by the Institute of Speech,Hearing,and Music Acoustics,Acoustical Society of China and the Institute of Signal Processing,Electronic Society ofChina,was held,25—27 October,1989,at Beijing Institute of Post and Telecommun-ication.The conference drew a registration of 150 from different places in the country,which made it the largest conference in the last eight years.The president of Institute of Speech,Hearing,and Music Acoustics,ASC,professorZHANG Jialu made a openning speech at the openning session,and the honorary presi-dent of Acoustical Society of China,professor MAA Dah-You and the president of展开更多
文摘The acoustic characteristics or the chinese vowels of 24 children with cleft palate and 10 normal control children were analyzed by computerized speech signal processing system (CSSPS),and the speech articulation was judged with Glossary of clert palate speech(GCPS).The listening judgement showed that the speech articulation was significantly different between the two groups(P<0.01).The objective quantitative measurement suggested that the formant pattern(FP)of vowels in children with cleft palate was different from that of normal control children except vowel[a](P< 0.05).The acoustic vowelgraph or the Chinese vowels which demonstrated directly the relationship of vocal space and speech perception was stated with the first formant frequence(F1)and the second formant frequence(F2).The authors conclude that the values or F1 and F2 point out the upward and backward tongue movement to close the clert, which reflects the vocal characteristics of trausmission of clert palate speech.
基金Science and Technology Plan of Shenzhen,Grant/Award Number:JCYJ20200109140410340National Natural Science Foundation of China,Grant/Award Number:62073004。
文摘As one of the most effective methods to improve the accuracy and robustness of speech tasks,the audio-visual fusion approach has recently been introduced into the field of Keyword Spotting(KWS).However,existing audio-visual keyword spotting models are limited to detecting isolated words,while keyword spotting for unconstrained speech is still a challenging problem.To this end,an Audio-Visual Keyword Transformer(AVKT)network is proposed to spot keywords in unconstrained video clips.The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable-length audio and visual inputs.The outputs of audio and visual branches are combined in a decision fusion module.As humans can easily notice whether a keyword appears in a sentence or not,our AVKT network can detect whether a video clip with a spoken sentence contains a pre-specified keyword.Moreover,the position of the keyword is localised in the attention map without additional position labels.Exper-imental results on the LRS2-KWS dataset and our newly collected PKU-KWS dataset show that the accuracy of AVKT exceeded 99%in clean scenes and 85%in extremely noisy conditions.The code is available at https://github.com/jialeren/AVKT.
文摘Sound indexing and segmentation of digital documentsespecially in the internet and digital libraries are very useful tosimplify and to accelerate the multimedia document retrieval. Wecan imagine that we can extract multimedia files not only bykeywords but also by speech semantic contents. The maindifficulty of this operation is the parameterization and modellingof the sound track and the discrimination of the speech, musicand noise segments. In this paper, we will present aSpeech/Music/Noise indexing interface designed for audiodiscrimination in multimedia documents. The program uses astatistical method based on ANN and HMM classifiers. After preemphasisand segmentation, the audio segments are analysed bythe cepstral acoustic analysis method. The developed system wasevaluated on a database constituted of music songs with Arabicspeech segments under several noisy environments.
文摘An important concern with the deaf community is inability to hear partially or totally. This may affect the development of language during childhood, which limits their habitual existence. Consequently to facilitate such deaf speakers through certain assistive mechanism, an effort has been taken to understand the acoustic characteristics of deaf speakers by evaluating the territory specific utterances. Speech signals are acquired from 32 normal and 32 deaf speakers by uttering ten Indian native Tamil language words. The speech parameters like pitch, formants, signal-to-noise ratio, energy, intensity, jitter and shimmer are analyzed. From the results, it has been observed that the acoustic characteristics of deaf speakers differ significantly and their quantitative measure dominates the normal speakers for the words considered. The study also reveals that the informative part of speech in a normal and deaf speakers may be identified using the acoustic features. In addition, these attributes may be used for differential corrections of deaf speaker’s speech signal and facilitate listeners to understand the conveyed information.
文摘Parkinson’s disease(PD),one of whose symptoms is dysphonia,is a prevalent neurodegenerative disease.The use of outdated diagnosis techniques,which yield inaccurate and unreliable results,continues to represent an obstacle in early-stage detection and diagnosis for clinical professionals in the medical field.To solve this issue,the study proposes using machine learning and deep learning models to analyze processed speech signals of patients’voice recordings.Datasets of these processed speech signals were obtained and experimented on by random forest and logistic regression classifiers.Results were highly successful,with 90%accuracy produced by the random forest classifier and 81.5%by the logistic regression classifier.Furthermore,a deep neural network was implemented to investigate if such variation in method could add to the findings.It proved to be effective,as the neural network yielded an accuracy of nearly 92%.Such results suggest that it is possible to accurately diagnose early-stage PD through merely testing patients’voices.This research calls for a revolutionary diagnostic approach in decision support systems,and is the first step in a market-wide implementation of healthcare software dedicated to the aid of clinicians in early diagnosis of PD.
基金supported by the Human Sixth Sense Programme at the Advanced Digital Sciences Center from Singapore’s Agency for Science,Technology and Research
文摘In this paper, the frequency-domain Frost algorithm is enhanced by using conjugate gradient techniques for speech enhancement. Unlike the non-adaptive approach of computing the optimum minimum variance distortionless response (MVDR) solution with the correlation matrix inversion, the Frost algorithm implementing the stochastic constrained least mean square (LMS) algorithm can adaptively converge to the MVDR solution in mean-square sense, but with a very slow convergence rate. In this paper, we propose a frequency-domain constrained conjugate gradient (FDCCG) algorithm to speed up the convergence. The devised FDCCG algorithm avoids the matrix inversion and exhibits fast convergence. The speech enhancement experiments for the target speech signal corrupted by two and five interfering speech signals are demonstrated by using a four-channel acoustic-vector-sensor (AVS) micro-phone array and show the superior performance.
基金The work is supported by Regional Innovation Cooperation Project of Sichuan Province(Grant No.22QYCX0082)Jian-Guo Wei received the grant,and the Science and Technology Plan of Qinghai Province,China(Grant No.2019-ZJ-7012)Xiu Juan Ma received the grant.
文摘Research on the feature of speech and image signals are carried out from two perspectives,the time domain and the frequency domain.The speech and image signals are a non-stationary signal,so FT is not used for the non-stationary characteristics of the signal.When short-term stable speech is obtained by windowing and framing the subsequent processing of the signal is completed by the Discrete Fourier Transform(DFT).The Fast Discrete Fourier Transform is a commonly used analysis method for speech and image signal processing in frequency domain.It has the problem of adjusting window size to a for desired resolution.But the Fractional Fourier Transform can have both time domain and frequency domain processing capabilities.This paper performs global processing speech encryption by combining speech with image of Fractional Fourier Transform.The speech signal is embedded watermark image that is processed by fractional transformation,and the embedded watermark has the effect of rotation and superposition,which improves the security of the speech.The paper results show that the proposed speech encryption method has a higher security level by Fractional Fourier Transform.The technology is easy to extend to practical applications.
文摘Speech or Natural language contents are major tools of communication. This research paper presents a natural language processing based automated system for understanding speech language text. A new rule based model has been presented for analyzing the natural languages and extracting the relative meanings from the given text. User writes the natural language text in simple English in a few paragraphs and the designed system has a sound ability of analyzing the given script by the user. After composite analysis and extraction of associated information, the designed system gives particular meanings to an assortment of speech language text on the basis of its context. The designed system uses standard speech language rules that are clearly defined for all speech languages as English, Urdu, Chinese, Arabic, French, etc. The designed system provides a quick and reliable way to comprehend speech language context and generate respective meanings.
文摘Speech recognition rate will deteriorate greatly in human-machine interaction when the speaker's speech mixes with a bystander's voice. This paper proposes a time-frequency approach for Blind Source Seperation (BSS) for intelligent Human-Machine Interaction(HMI). Main idea of the algorithm is to simultaneously diagonalize the correlation matrix of the pre-whitened signals at different time delays for every frequency bins in time-frequency domain. The prososed method has two merits: (1) fast convergence speed; (2) high signal to interference ratio of the separated signals. Numerical evaluations are used to compare the performance of the proposed algorithm with two other deconvolution algorithms. An efficient algorithm to resolve permutation ambiguity is also proposed in this paper. The algorithm proposed saves more than 10% of computational time with properly selected parameters and achieves good performances for both simulated convolutive mixtures and real room recorded speeches.
基金supported by the National Key R&D Program of China(No.2020AAA0108904)the Science and Technology Plan of Shenzhen(No.JCYJ20200109140410340).
文摘Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods.
基金supported by the National Natural Science Foundation of China under Grant(No.61971015)Beijing Natural Science Foundation(No.L223033).
文摘Multisource localization occupies an important position in the field of acoustic signal processing and is widely applied in scenarios,such as human‐machine interaction and spatial acoustic parameter acquisition.The direction‐of‐arrival(DOA)of a sound source is convenient to render spatial sound in the audio metaverse.A multisource localization method in a reverberation environment is proposed based on the angle distribution of time-frequency(TF)points using a first‐order ambisonics(FOA)microphone.The method is implemented in three steps.1)By exploring the angle distribution of TF points,a single‐source zone(SSZ)detection method is proposed by using a standard deviation‐based measure,which reveals the degree of convergence of TF point angles in a zone.2)To reduce the effect of outliers on localization,an outlier removal method is designed to remove the TF points whose angles are far from the real DOAs,where the median angle of each detected zone is adopted to construct the outlier set.3)DOA estimates of multiple sources are obtained by postprocessing of the angle histogram.Experimental results in both the simulated and real scenarios verify the effectiveness of the proposed method in a reverberation environment,which also show that the proposed method outperforms reference methods.
文摘In this paper,we summarize recent progresses made in deep learning based acoustic models and the motivation and insights behind the surveyed techniques.We first discuss models such as recurrent neural networks(RNNs) and convolutional neural networks(CNNs) that can effectively exploit variablelength contextual information,and their various combination with other models.We then describe models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system,the connectionist temporal classification(CTC) criterion,and the attention-based sequenceto-sequence translation model.We further illustrate robustness issues in speech recognition systems,and discuss acoustic model adaptation,speech enhancement and separation,and robust training strategies.We also cover modeling techniques that lead to more efficient decoding and discuss possible future directions in acoustic model research.
基金Supported by the National High Technology Research and Development Program of China (863 Program,No.2006AA010102)
文摘Voice conversion algorithm aims to provide high level of similarity to the target voice with an acceptable level of quality.The main object of this paper was to build a nonlinear relationship between the parameters for the acoustical features of source and target speaker using Non-Linear Canonical Correlation Analysis(NLCCA) based on jointed Gaussian mixture model.Speaker indi-viduality transformation was achieved mainly by altering vocal tract characteristics represented by Line Spectral Frequencies(LSF).To obtain the transformed speech which sounded more like the target voices,prosody modification is involved through residual prediction.Both objective and subjective evaluations were conducted.The experimental results demonstrated that our proposed algorithm was effective and outperformed the conventional conversion method utilized by the Minimum Mean Square Error(MMSE) estimation.
文摘A novel algorithm for voice conversion is proposed in this paper. The mapping function of spectral vectors of the source and target speakers is calculated by the Canonical Correlation Analysis (CCA) estimation based on Gaussian mixture models. Since the spectral envelope feature remains a majority of second order statistical information contained in speech after Linear Prediction Coding (LPC) analysis, the CCA method is more suitable for spectral conversion than Minimum Mean Square Error (MMSE) because CCA explicitly considers the variance of each component of the spectral vectors during conversion procedure. Both objective evaluations and subjective listening tests are conducted. The experimental results demonstrate that the proposed scheme can achieve better per- formance than the previous method which uses MMSE estimation criterion.
文摘This paper presents a new online incremental training algorithm of Gaussian mixture model (GMM), which aims to perform the expectation-maximization(EM) training incrementally to update GMM model parameters online sample by sample, instead of waiting for a block of data with the sufficient size to start training as in the traditional EM procedure. The proposed method is extended from the split-and-merge EM procedure, so inherently it is also capable escaping from local maxima and reducing the chances of singularities. In the application domain, the algorithm is optimized in the context of speech processing applications. Experiments on the synthetic data show the advantage and efficiency of the new method and the results in a speech processing task also confirm the improvement of system performance.
文摘Sample entropy can reflect the change of level of new information in signal sequence as well as the size of the new information. Based on the sample entropy as the features of speech classification, the paper firstly extract the sample entropy of mixed signal, mean and variance to calculate each signal sample entropy, finally uses the K mean clustering to recognize. The simulation results show that: the recognition rate can be increased to 89.2% based on sample entropy.
文摘This paper represents current research in low-power Very Large Scale Integration (VLSI) domain. Nowadays low power has become more sought research topic in electronic industry. Power dissipation is the most important area while designing the VLSI chip. Today almost all of the high speed switching devices include the Ternary Content Addressable Memory (TCAM) as one of the most important features. When a device consumes less power that becomes reliable and it would work with more efficiency. Complementary Metal Oxide Semiconductor (CMOS) technology is best known for low power consumption devices. This paper aims at designing a router application device which consumes less power and works more efficiently. Various strategies, methodologies and power management techniques for low power circuits and systems are discussed in this research. From this research the challenges could be developed that might be met while designing low power high performance circuit. This work aims at developing Data Aware AND-type match line architecture for TCAM. A TCAM macro of 256 × 128 was designed using Cadence Advanced Development Environment (ADE) with 90 nm technology file from Taiwan Semiconductor Manufacturing Company (TSMC). The result shows that the proposed Data Aware architecture provides around 35% speed and 45% power improvement over existing architecture.
基金the key project of the National Nature Science Foundation of China (grants 69635052), the Nature Science Foundation of Beijing,
文摘An enhanced relative spectral (FLRASTA) technique for speech and speaker recognition is proposed. The new method consists of classical RASTA filtering in logarithmic spectral domain following by another additive RASTA filtering in the same domain. In this manner, both the channel distortion and additive noise are removed effectively. In speaker identification and speech recognition experiments on T146 database, the E_RASTA performs equal or better than J_RASTA method in both tasks. The E_RASTA does not need the speech SNR estimation in order to determinate the optimal value of J in J_RASTA, and the information of how the speech degrades. The choice of ERASTA filter also indicates that the low temporal modulation components in speech can deteriorate the performance of both recognition tasks. Besides, the speaker recognition needs less temporal modulation frequency band than that of the speech recognition.
文摘The Laboratory of Acoustics,Speech and Signal Processing(LASSP),theunique and superior national key laboratory of ASSP in China,has been foundedat the Inst.of Acoustics,Academia Sinica,Beijing PRC.After three years ofefforts,the construction of the LASSP has been completed successfully and thecertain capability of performing frontier research projects in fundamental theory andapplied technology of sound field and acoustic signal processing has ben formed.A fiexible and complete experimental acoustic signal processing system hasbeen set up in the LASSP.With the remarkable advantage of real time signalprocessing and resource sharing,a wide range of research projects in the field ofASSP can be conducted in the laboratory.The Signal Processing Center of theLASSP is well equipped with many computer research facilities including the
文摘The 4th National Conference on Speech,Image,Communication and Signal Pro-cessing,which was sponsored by the Institute of Speech,Hearing,and Music Acoustics,Acoustical Society of China and the Institute of Signal Processing,Electronic Society ofChina,was held,25—27 October,1989,at Beijing Institute of Post and Telecommun-ication.The conference drew a registration of 150 from different places in the country,which made it the largest conference in the last eight years.The president of Institute of Speech,Hearing,and Music Acoustics,ASC,professorZHANG Jialu made a openning speech at the openning session,and the honorary presi-dent of Acoustical Society of China,professor MAA Dah-You and the president of