摘要
现有多数视频只包含单声道音频,缺乏双声道音频所带来的立体感。针对这一问题,本文提出了一种基于多模态感知的双声道音频生成方法,其在分析视频中视觉信息的基础上,将视频的空间信息与音频内容融合,自动为原始单声道音频添加空间化特征,生成更接近真实听觉体验的双声道音频。我们首先采用一种改进的音频视频融合分析网络,以编码器-解码器的结构,对单声道视频进行编码,接着对视频特征和音频特征进行多尺度融合,并对视频及音频信息进行协同分析,使得双声道音频拥有了原始单声道音频所没有的空间信息,最终生成得到视频对应的双声道音频。在公开数据集上的实验结果表明,本方法取得了优于现有模型的双声道音频生成效果,在STFT距离以及ENV距离两项指标上均取得提升。
Most existing videos only contain mono audio and lack the stereoscopic sense by dual-channel audio.To address this issue,this paper proposes a method for generating dual-channel audio based on multimodal perception.Based on the analysis of visual information in the video,it fuses the spatial information and the audio content of the video,and generates dual-channel audio that is closer to the real auditory experience.We first encode the mono video via an improved audio-video fusion analysis network with an encoder-decoder structure.Then we fuse the video features and audio features in multiple perspectives.Subsequently,we co-analyze the video and audio information,so that the dual-channel audio has spatial information that the original mono audio does not have.Finally,the corresponding dual-channel audio is generated by the audio-video fusion analysis network.Experimental results demonstrate that our method achieves better performance than existing models in the generation of two-channel audio,with improvements in both STFT distance and ENV distance.
作者
官丽
尹康
樊梦佳
薛昆
解凯
GUAN Li;YIN Kang;FAN Meng-jia;XUE Kun;XIE Kai(Beijing Electric Power Corporation,Beijing 100031,China;NR Electric Co.,Ltd.,Nanjing,Jiangsu 211102,China)
出处
《计算技术与自动化》
2022年第4期157-165,共9页
Computing Technology and Automation