面向全局特征Transformer架构的单目深度估计

Global Feature Oriented Monocular Depth Estimation Based on Transformer Architecture

下载PDF

导出

摘要针对卷积神经网络(convolutional neural networks,CNN)全局特征提取不足导致深度估计错误的问题,提出了一种面向全局特征的深度学习网络用于单目深度估计。该网络采用编码器-解码器的端到端架构,其中,编码器为具有多阶段输出的Transformer网络,可提取多尺度的全局特征;解码器由CNN构成。此外,为抑制深度无关的细节信息影响,解码器末端采用了大卷积核注意力(large kernel attention,LKA)模块提升全局特征的提取能力。在室外场景数据集KITTI和室内场景数据集NYU Depth v2上的实验结果表明,面向全局特征的网络有助于生成高精度的、细节特征完整的深度图。与近期提出的同样基于CNN-Transformer的方法 AdaBins相比,所提出网络的参数量减少了42.31%,均方根误差减小了约2%。 For the problem of insufficient extraction of global features in convolutional neural networks(CNN),which leads to the error of depth estimation,a global feature-oriented deep learning network is proposed for monocular depth estimation.The network is an encoder-decoder end-to-end architecture.The encoder is a Transformer with multi-stage output,which can extract global features of different scales.The decoder is composed of CNN.In order to suppress the misleading of depth-independent information,a Large Kernel Attention(LKA) module is applied at the end of the decoder to extract global features.The experimental results on the outdoor scene dataset KITTI and the indoor scene dataset NYU Depth v2 shows that global feature-oriented network helps to generate high-accuracy,detailed feature-complete depth maps.Compared with AdaBins method proposed recently,which is also based on CNN-Transformer,our network parameters are reduced by 42.31%,and the RMSE error is reduced by about 2%.

作者吴冰源王永雄 WU Bingyuan;WANG Yongxiong(School of Optical-Electrical and Computer Engineering,University of Shanghai for Science and Technology,Shanghai 200093,China)

机构地区上海理工大学光电信息与计算机工程学院

出处《控制工程》 CSCD 北大核心 2024年第9期1619-1625,共7页 Control Engineering of China

基金上海市自然科学基金资助项目(22ZR1443700)。

关键词单目深度估计 TRANSFORMER 大卷积核注意力全局特征 Monocular depth estimation Transformer large kernel attention global feature

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]