摘要
目前多数的视听分离模型,大多是基于视频特征和音频特征简单拼接,没有充分考虑各个模态的相互关系,导致视觉信息未被充分利用,该文针对此问题提出了新的模型。该文充分考虑视觉特征、音频特征之间的相互联系,采用多头注意力机制,结合稠密光流(Farneback)算法和U-Net网络,提出跨模态融合的光流-视听语音分离(Flow-AVSS)模型。该模型通过Farneback算法和轻量级网络ShuffleNet v2分别提取运动特征和唇部特征,然后将运动特征与唇部特征进行仿射变换,经过时间卷积模块(TCN)得到视觉特征。为充分利用到视觉信息,在进行特征融合时采用多头注意力机制,将视觉特征与音频特征进行跨模态融合,得到融合视听特征,最后融合视听特征经过U-Net分离网络得到分离语音。利用客观语音质量评估(PESQ)、短时客观可懂度(STOI)及源失真比(SDR)评价指标,在AVspeech数据集进行实验测试。研究表明,该文所提方法与纯语音分离网络和仅采用特征拼接的视听分离网络相比,性能上分别提高了2.23 dB和1.68 dB。由此表明,采用跨模态注意力进行特征融合,能更加充分利用各个模态相关性,增加的唇部运动特征,能有效提高视频特征的鲁棒性,提高分离效果。
Most of the current audiovisual separation models are mostly based on simple splicing of video features and audio features,without fully considering the interrelationship of each modality,resulting in the underutilization of visual information,a new model is proposed to address this issue.Hence,in this paper,the interrelationship of each modality is taken into consideration.In addition,a multi-headed attention mechanism is used to combine the Farneback algorithm and the U-Net network to propose a cross-modal fusion optical Flow-Audio Visual Speech Separation(Flow-AVSS)model.The motion features and lip features are respectively extracted by the Farneback algorithm and the lightweight network ShuffleNet v2.Furthermore,the motion features are affine transformed with the lip features,and the visual features are obtained by the Temporal CoNvolution module(TCN).In order to utilize sufficiently the visual information,the multi-headed attention mechanism is used in the feature fusion to fuse the visual features with the audio features across modalities.Finally,the fused audio-visual features are passed through the U-Net separation network to obtain the separated speech.Using Perceptual Evaluation of Speech Quality(PESQ),Short-Time Objective Intelligibility(STOI),and Source-to-Distortion Ratio(SDR)evaluation metrics,experimental tests are conducted on the AVspeech dataset.It is shown that the performance of the proposed method is improved by 2.23 dB and 1.68 dB compared with the pure speech separation network or the audio-visual separation network based on feature splicing.Thus,it is indicated that the feature fusion based on the cross-modal attention can make fuller use of the individual modal correlations.Besides,the increased lip motion features can effectively improve the robustness of video features and improve the separation effect.
作者
兰朝凤
蒋朋威
陈欢
韩闯
郭小霞
LAN Chaofeng;JIANG Pengwei;CHEN Huan;HAN Chuang;GUO Xiaoxia(School of Measurement and Control Technology and Communication Engineering,Harbin University of Science and Technology,Harbin 150080,China;China Ship Design and Research Center,Wuhan 430064,China)
出处
《电子与信息学报》
EI
CSCD
北大核心
2023年第10期3538-3546,共9页
Journal of Electronics & Information Technology
基金
国家自然科学基金(11804068)
黑龙江省自然科学基金(LH2020F033)。
关键词
视听语音分离
视听融合
跨模态注意力
光流算法
Audio-Visual Speech Separation(AVSS)
Audio-visual integration
Cross-modal attention
Optical flow algorithm