摘要
语音和唇部运动的异步性是多模态融合语音识别的关键问题,该文首先引入一个多流异步动态贝叶斯网络(MS-ADBN)模型,在词的级别上描述了音频流和视频流的异步性,音视频流都采用了词-音素的层次结构。而多流多状态异步DBN(MM-ADBN)模型是MS-ADBN模型的扩展,音视频流都采用了词-音素-状态的层次结构。本质上,MS-ADBN是一个整词模型,而MM-ADBN模型是一个音素模型,适用于大词汇量连续语音识别。实验结果表明:基于连续音视频数据库,在纯净语音环境下,MM-ADBN比MS-ADBN模型和多流HMM识别率分别提高35.91%和9.97%。
Asynchrony of speech and lip motion is a key issue of multi-model fusion Audio-Visual Speech Recognition (AVSR). In this paper, a Multi-Stream Asynchrony Dynamic Bayesian Network (MS-ADBN) model is introduced, which looses the asynchrony of audio and visual streams to the word level, and both in audio stream and in visual stream, word-phone topology structure is used. However, Multi-stream Multi-states Asynchrony DBN (MM-ADBN) model is an augmentation of Multi-Stream DBN (MS-ADBN) model, is proposed for large vocabulary AVSR, which adopts word-phone-state topology structure in both audio stream and visual stream. In essential, MS-ADBN model is a word model, and while MM-ADBN model is a phone model whose recognition basic units are phones. The experiments are done on small vocabulary and large vocabulary audio-visual database, the results show that: for large vocabulary audio-visual database, comparing with MS-ADBN model and MSHMM, in clean speech environment, the improvements of 35.91 and 9.97% are obtained for MM-ADBN model respectively, which show the asynchrony description is important for AVSR systems.
出处
《电子与信息学报》
EI
CSCD
北大核心
2008年第12期2906-2911,共6页
Journal of Electronics & Information Technology
基金
中国科技部与比利时弗拉芒大区科技合作项目([2004]487)
西北工业大学英才培养计划项目(04XD0102)资助课题
关键词
语音识别
动态贝叶斯网络
音视频
多流异步
Speech recognition
Dynamic Bayesian Network (DBN)
Audio-visual
Multi-stream asynchrony