期刊文献+

基于Conv-TasNet的多特征融合音视频联合语音分离算法 被引量:5

Multi Feature Fusion Audio-visual Joint Speech Separation Algorithm Based on Conv-TasNet
下载PDF
导出
摘要 视听多模态建模已被验证在与语音分离的任务中性能表现优异,本文提出一种语音分离模型,对现有的时域音视频联合语音分离算法进行改进,增强音视频流之间的联系。针对现有音视频分离模型联合度不高的情况,作者提出一种在时域上将语音特征与额外输入的视觉特征进行多次融合,并加入纵向权值共享的端到端的语音分离模型。在GRID数据集上的实验结果表明,该网络与仅使用音频的时域语音卷积分离网络(Conv-TasNet)和音视频联合的Conv-TasNet相比,性能上分别获得了1.2 dB和0.4 dB的改善。 The audiovisual multimodal modeling has been verified to be effective in speech separation tasks.This paper proposes a speech separation model to improve the existing time-domain audio visual joint speech separation algorithm,and enhances the connection between audio and visual streams.Aiming at the situation that the existing audio-visual separation models are not highly integrated,authors propose a end to end model which combines audio features with additional input visual features multiple times in time domain,and adds the means of vertical weight sharing.The model was trained and evaluated on the GRID data set.Experiments show that compared with Conv-TasNet which only uses audio and Conv-TasNet combines with audio and video,the performance of our model is improved by 1.2 dB and 0.4 dB respectively.
作者 徐亮 王晶 杨文镜 罗逸雨 XU Liang;WANG Jing;YANG Wenjing;LUO Yiyu(School of Information and Electronics,Beijing Institute of Technology,Beijing 100081,China)
出处 《信号处理》 CSCD 北大核心 2021年第10期1799-1805,共7页 Journal of Signal Processing
基金 国家自然科学基金(62071039,61620106002)。
关键词 语音分离 深度神经网络 多特征融合 音视频联合 audio separation deep neural network multi feature fusion audio-visual joint
  • 相关文献

参考文献2

二级参考文献67

  • 1Kim G, Lu Y, Hu Y, Loizou P C. An algorithm that im- proves speech intelligibility in noise for normal-hearing lis- teners. The Journal of the Acoustical Society of America, 2009, 126(3): 1486-1494.
  • 2Dillon H. Hearing Aids. New York: Thieme, 2001.
  • 3Allen J B. Articulation and intelligibility. Synthesis Lectures on Speech and Audio Processing, 2005, 1(1): 1-124.
  • 4Seltzer M L, Raj B, Stern R M. A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition. Speech Communication, 2004, 43(4): 379-393.
  • 5Weninger F, Erdogan H, Watanabe S, Vincent E, Le Roux J, Hershey J R, Schuller B. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Proceedings of the 12th International Conference on Latent Variable Analysis and Signal Separation. Liberec, Czech Republic: Springer International Publishing, 2015.91 -99.
  • 6Weng C, Yu D, Seltzer M L, Droppo J. Deep neural networks for single-channel multi-talker speech recognition. IEEE/ ACM Transactions on Audio, Speech, and Language Pro- cessing, 2015, 23(10): 1670-1679.
  • 7Boll S F. Suppression of acoustic noise in speech using spec- tral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27(2): 113-120.
  • 8Chen J D, Benesty J, Huang Y T, Doclo S. New insights into the noise reduction wiener filter. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4): 1218 -1234.
  • 9Loizou P C. Speech Enhancement: Theory and Practice. New York: CRC Press, 2007.
  • 10Liang S, Liu W J, Jiang W. A new Bayesian method incor- porating with local correlation for IBM estimation. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(3): 476-487.

共引文献77

同被引文献43

引证文献5

二级引证文献4

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部