期刊文献+

双模型语音识别中的听视觉合成和模型同步异步性实验研究 被引量:3

Experimental Research on Audio Visual Fusion and on Model Asynchrony for Raising Speech Recognition Rate
下载PDF
导出
摘要 研究了双模型语音识别系统中前合成和后合成两种听觉视觉合成方法 ;同时在后合成方法中引入了考虑听觉和视觉同步异步特点的复合模型。仿真实验证明了在声学噪音环境下 ,后合成方法能够带来比较理想的识别效果 ;考虑听觉和视觉同步异步性的模型可以有效地提高识别率。 Researchers have become increasingly interested in raising speech recognition rate under noisy acoustic environments. As human speech recognition rate is much higher than machine speech recognition rate because human's aural sensing is aided by visual sensing with a certain degree of asynchrony naturally existing between the two sensings, researchers, including us, have been studying audio visual fusion and model asynchrony for increasing speech recognition rate. This paper offers some progress in this research area. Section 1 uses Fig.1 to discuss two fusion methods of audio and visual sensors: early integration and late integration. Section 2 introduces 4 types of model topologies in Fig.2 to reflect the bimodal (audio visual) fusion of human speech perceptions. Section 2 also simulates asynchrony with Product HMMs (Hidden Markov Models). Fig.2(d) presents a simplified topology of Product HMM, which adopts stream state tying scheme to get robust parameter estimations while restricting the asynchrony to only a state of phoneme HMM. Section 3 gives detailed speech recognition experiments in various simulated SNR (Signal Noise Ratio) conditions based on the AVTC (Audio Visual Telephone Conversations) corpus for 6 systems including early and late integration systems. Experimental results as shown in Fig.3, in terms of recognition rate, indicate that late integration can bring better recognition performance than early integration does in noisy acoustic conditions. Fig.3 also shows that state asynchronous system outperforms state synchronous system when SNR>12dB, and state synchronous system outperforms state asynchronous system when SNR<12dB. We believe that our findings are of some help in raising the speech recognition rates under noisy acoustic environments.
出处 《西北工业大学学报》 EI CAS CSCD 北大核心 2004年第2期171-175,共5页 Journal of Northwestern Polytechnical University
基金 中国科技部与比利时弗拉芒大区科技合作项目 (国科外字 19990 2 0 9)
关键词 语音识别 双模型语音识别 听觉视觉合成 模型同步异步性 speech recognition, audio visual fusion, model asynchrony
  • 相关文献

参考文献8

  • 1Lippmann R P. Speech Recognition by Machines and Humans. Speech Communication, 1997, 22(1): 1-15
  • 2Chibelushi C C, et al. A Review of Speech-Based Bimodal Recognition. IEEE Trans on Multimedia, 2002, 4(1) : 23-37
  • 3Hall D L. Mathematical Techniques in Multisensor Data Fusion. Norwood: Artech House, 1992. 18-22
  • 4Bourlard H, et al. Multi-stream Speech Recognition. Technical Report IDIAP-RR96-07, IDIAP, 1996
  • 5Varga P, Moore R K. Hidden Markov Model Decomposition of Speech and Noise. Proc International Conference on Acoustics, Speech and Signal Processing, Albuquerque, USA: 1990, 845-848
  • 6Young S J, et al. The HTK Book.http ://htk. eng. cam. ac. uk/docs/docs. shtml, 2002
  • 7Ravyse I, Reinders M, Cornelis J, Sahli H. Eye Gesture Estimation. Proc Signal Processing Symposium of IEEE Benelux Signal Processing Chapter, Hilvarenbeek, The Netherlands: 2000, 4- 7
  • 8Gravier G, Potamianos G, Neti C. Asynchrony Modeling for Audio-Visual Speech Recognition. Proc Human Language Technology Conference, SanDiego, USA: 2002, 325-328

同被引文献23

  • 1刘鹏,王作英.多模式汉语连续语音识别中视觉特征的提取和应用[J].中文信息学报,2004,18(4):79-84. 被引量:6
  • 2谢磊,付中华,蒋冬梅,赵荣椿,Werner Verhelst,Hichem Sahli,Jan Conlenis.一种稳健的基于VisemicLDA的口形动态特征及听视觉语音识别[J].电子与信息学报,2005,27(1):64-68. 被引量:4
  • 3刘鹏,王作英.Stream Weight Training Based on MCE for Audio-Visual LVCSR[J].Tsinghua Science and Technology,2005,10(2):141-144. 被引量:1
  • 4DENG J, BOUCHARD M, YAEP T H. Feature enhancement for noisy speech recognition with a time-variant linear predictive HMM structure [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16(5) : 891 -899.
  • 5CUI X, ALWAN A. Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR [ J]. IEEE Transactions on Speech and Audio Processing, 2005, 13(6) : 1161 - 1172.
  • 6DUPONT S, LUETTIN J. Audio-visual speech modeling for continuous speech recognition [ J]. IEEE Transactions on Multimedia, 2000, 2(3) : 141 - 151.
  • 7LEE J S, PARK C H. Robust audio-visual speech recognition based on late integration [ J]. IEEE Transactions on Multimedia, 2008, 10 (5) : 767 -779.
  • 8KUMATANI K, NAKAMURA S, SHIKANO K. An adaptive integration based on product HMM for audio-visual speech recognition [ C]// Proceedings of the 2001 IEEE International Conference on Multimedia and Expo. Tokyo, Japan: [ s. n. ], 2001:813 -816.
  • 9ZHAO H, TANG C J, YU T. Fast thresholding segmentation for image with high noise [ C]// Proceedings of the 2008 IEEE International Conference on Information and Automation. Washington, DC: IEEE Computer Society, 2008:290 - 295.
  • 10Kumatani K,Nakamura S,Shikano K.An Adaptive Integration Based on Product HMM for Audio-visual Speech Recognition[C]// Proceedings of IEEE ICME'01.Tokyo,Japan:[s.n.],2001:1020-1023.

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部