分层特征编解码驱动的视觉引导立体声生成方法

Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding

下载PDF

导出

摘要视觉引导的立体声生成是多模态学习中具有广泛应用价值的重要任务之一,其目标是在给定视觉模态信息及单声道音频模态信息的情况下,生成符合视听一致性的立体声音频.针对现有视觉引导的立体声生成方法因编码阶段视听信息利用率不足、解码阶段忽视浅层特征导致的立体声生成效果不理想的问题,提出一种基于分层特征编解码的视觉引导的立体声生成方法,有效提升立体声生成的质量.其中,为了有效地缩小阻碍视听觉模态数据间关联融合的异构鸿沟,提出一种视听觉特征分层编码融合的编码器结构,提高视听模态数据在编码阶段的综合利用效率;为了实现解码过程中浅层结构特征信息的有效利用,构建一种由深到浅不同深度特征层间跳跃连接的解码器结构,实现了对视听觉模态信息的浅层细节特征与深度特征的充分利用.得益于对视听觉信息的高效利用以及对深层浅层结构特征的分层结合,所提方法可有效处理复杂视觉场景中的立体声合成,相较于现有方法,所提方法生成效果在真实感等方面性能提升超过6%. Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning.The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information.The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage.To solve the above problems,this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding,which effectively improves the quality of binaural audio generation.In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data,an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed,which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage.In order to realize the effective use of shallow structural feature information in the decoding process,a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed,which realizes the full use of shallow detail features and depth features of audiovisual modal information.Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features,the proposed method can effectively deal with binaural audio generation in complex visual scenes.Compared with the existing methods,the generation performance of the proposed method is improved by over 6%in terms of realism.

作者王睿琦程皓楠叶龙 WANG Rui-Qi;CHENG Hao-Nan;YE Long(Key Laboratory of Media Audio&Video(Communication University of China),Ministry of Education,Beijing 100024,China;State Key Laboratory of Media Convergence and Communication(Communication University of China),Beijing 100024,China)

机构地区媒介音视频教育部重点实验室(中国传媒大学) 媒体融合与传播国家重点实验室(中国传媒大学)

出处《软件学报》 EI CSCD 北大核心 2024年第5期2165-2175,共11页 Journal of Software

基金国家自然科学基金(61971383,62201524) 国家重点研发计划(2021YFF0900504)。

关键词立体声视觉引导的声音生成分层特征编解码多模态学习跳跃连接 binaural audio visually guided audio generation hierarchical feature encoding and decoding multimodal learning skip connection

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

1张鹉,王素云.物理教学中浅层学习与深度学习及其转化策略[J].复印报刊资料（中学物理教与学）,2022(6):31-33.
2马瑞,张敏,刘红,包娟,尤红.基于镜像神经元的镜像疗法对脑卒中功能障碍的康复治疗研究进展[J].中国医药导报,2024,21(3):55-59.
3姜广君,杨永吉,王赜.基于多模态集成卷积神经网络的数控机床齿轮箱故障诊断[J].机床与液压,2024,52(8):202-207.
4《地理标志认定产品分类与代码》国家标准发布[J].中华商标,2024(2):49-49.
5步入AI的当打之年[J].软件和集成电路,2024(4):1-1.
6张楠.热潮之下生成式AI的成熟之道[J].软件和集成电路,2024(4):18-30.
7陈浩楠,朱映映,赵骏骐,田奇.基于多模态关系建模的三维形状识别方法[J].软件学报,2024,35(5):2208-2219.
8应益华,陈嘉乐,黄百俊.教育生态重塑: ChatGPT的潜力、风险及治理[J].继续教育研究,2024(5):56-61.
9侯金程.基于双声道信息相关性的音频文件可逆水印算法[J].电声技术,2023,47(10):12-15.
10杨阳,薛丽惠.基于海空目标画像的大数据标签体系构建方法[J].电脑知识与技术,2024,20(7):74-76.

软件学报

2024年第5期

浏览历史

内容加载中请稍等...

分层特征编解码驱动的视觉引导立体声生成方法

相关作者

相关机构

相关主题

浏览历史