期刊文献+

分层特征编解码驱动的视觉引导立体声生成方法

Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding
下载PDF
导出
摘要 视觉引导的立体声生成是多模态学习中具有广泛应用价值的重要任务之一,其目标是在给定视觉模态信息及单声道音频模态信息的情况下,生成符合视听一致性的立体声音频.针对现有视觉引导的立体声生成方法因编码阶段视听信息利用率不足、解码阶段忽视浅层特征导致的立体声生成效果不理想的问题,提出一种基于分层特征编解码的视觉引导的立体声生成方法,有效提升立体声生成的质量.其中,为了有效地缩小阻碍视听觉模态数据间关联融合的异构鸿沟,提出一种视听觉特征分层编码融合的编码器结构,提高视听模态数据在编码阶段的综合利用效率;为了实现解码过程中浅层结构特征信息的有效利用,构建一种由深到浅不同深度特征层间跳跃连接的解码器结构,实现了对视听觉模态信息的浅层细节特征与深度特征的充分利用.得益于对视听觉信息的高效利用以及对深层浅层结构特征的分层结合,所提方法可有效处理复杂视觉场景中的立体声合成,相较于现有方法,所提方法生成效果在真实感等方面性能提升超过6%. Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning.The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information.The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage.To solve the above problems,this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding,which effectively improves the quality of binaural audio generation.In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data,an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed,which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage.In order to realize the effective use of shallow structural feature information in the decoding process,a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed,which realizes the full use of shallow detail features and depth features of audiovisual modal information.Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features,the proposed method can effectively deal with binaural audio generation in complex visual scenes.Compared with the existing methods,the generation performance of the proposed method is improved by over 6%in terms of realism.
作者 王睿琦 程皓楠 叶龙 WANG Rui-Qi;CHENG Hao-Nan;YE Long(Key Laboratory of Media Audio&Video(Communication University of China),Ministry of Education,Beijing 100024,China;State Key Laboratory of Media Convergence and Communication(Communication University of China),Beijing 100024,China)
出处 《软件学报》 EI CSCD 北大核心 2024年第5期2165-2175,共11页 Journal of Software
基金 国家自然科学基金(61971383,62201524) 国家重点研发计划(2021YFF0900504)。
关键词 立体声 视觉引导的声音生成 分层特征编解码 多模态学习 跳跃连接 binaural audio visually guided audio generation hierarchical feature encoding and decoding multimodal learning skip connection
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部