Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input t...Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.展开更多
Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious pro...Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.展开更多
The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpow...The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.展开更多
Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based s...Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based scenes.However,for monocular 6D pose estimation,these methods are affected by the prediction results of the 2D-3D correspondences and the robustness of the per-spective-n-point(PnP)algorithm.There is still a difference in the distance from the expected estimation effect.To obtain a more effective feature representation result,edge enhancement is proposed to increase the shape information of the object by analyzing the influence of inaccurate 2D-3D matching on 6D pose regression and comparing the effectiveness of the intermediate representation.Furthermore,although the transformation matrix is composed of rotation and translation matrices from 3D model points to 2D pixel points,the two variables are essentially different and the same network cannot be used for both variables in the regression process.Therefore,to improve the effectiveness of the PnP algo-rithm,this paper designs a dual-branch PnP network to predict rotation and trans-lation information.Finally,the proposed method is verified on the public LM,LM-O and YCB-Video datasets.The ADD(S)values of the proposed method are 94.2 and 62.84 on the LM and LM-O datasets,respectively.The AUC of ADD(-S)value on YCB-Video is 81.1.These experimental results show that the performance of the proposed method is superior to that of similar methods.展开更多
针对单目3D目标检测在视角变化引起的物体大小变化以及物体遮挡等情况下效果不佳的问题,提出一种融合深度信息和实例分割掩码的新型单目3D目标检测方法。首先,通过深度-掩码注意力融合(DMAF)模块,将深度信息与实例分割掩码结合,以提供...针对单目3D目标检测在视角变化引起的物体大小变化以及物体遮挡等情况下效果不佳的问题,提出一种融合深度信息和实例分割掩码的新型单目3D目标检测方法。首先,通过深度-掩码注意力融合(DMAF)模块,将深度信息与实例分割掩码结合,以提供更准确的物体边界;其次,引入动态卷积,并利用DMAF模块得到的融合特征引导动态卷积核的生成,以处理不同尺度的物体;再次,在损失函数中引入2D-3D边界框一致性损失函数,调整预测的3D边界框与对应的2D检测框高度一致,以提高实例分割和3D目标检测任务的效果;最后,通过消融实验验证该方法的有效性,并在KITTI测试集上对该方法进行验证。实验结果表明,与仅使用深度估计图和实例分割掩码的方法相比,在中等难度下对车辆类别检测的平均精度提高了6.36个百分点,且3D目标检测和鸟瞰图目标检测任务的效果均优于D4LCN(Depth-guided Dynamic-Depthwise-Dilated Local Convolutional Network)、M3D-RPN(Monocular 3D Region Proposal Network)等对比方法。展开更多
单目三维视觉测量在视觉测量领域具有低成本、简便性、结构紧凑等优势,是以智能化、网络化制造为特征的先进制造典型技术之一。经过不断发展,单目三维视觉测量技术已成功应用于无人机导航、智能机器人、工业检测、医疗健康等领域,如今...单目三维视觉测量在视觉测量领域具有低成本、简便性、结构紧凑等优势,是以智能化、网络化制造为特征的先进制造典型技术之一。经过不断发展,单目三维视觉测量技术已成功应用于无人机导航、智能机器人、工业检测、医疗健康等领域,如今呈现出精准化、快捷化、微型化、自动化、动态化等发展趋势。以孔径数量为标准,将单目三维视觉测量技术分为单孔径及多孔径两大类,分别综述两类方法的研究现状和发展历程,重点论述了应用较广的运动恢复结构法(Structure From Motion,SFM)和光场三维测量方法,并对单目三维视觉测量技术的未来方向进行了展望。展开更多
基金supported in part by the Major Project for New Generation of AI (2018AAA0100400)the National Natural Science Foundation of China (61836014,U21B2042,62072457,62006231)the InnoHK Program。
文摘Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
基金supported in part by School Research Projects of Wuyi University (No.5041700175).
文摘Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.
基金supported by the National Key Research and Development Program of China(2020YFB1807500)the National Natural Science Foundation of China(62072360,62001357,62172438,61901367)+4 种基金the key research and development plan of Shaanxi province(2021ZDLGY02-09,2023-GHZD-44,2023-ZDLGY-54)the Natural Science Foundation of Guangdong Province of China(2022A1515010988)Key Project on Artificial Intelligence of Xi'an Science and Technology Plan(2022JH-RGZN-0003,2022JH-RGZN-0103,2022JH-CLCJ-0053)Xi'an Science and Technology Plan(20RGZN0005)the Proof-ofconcept fund from Hangzhou Research Institute of Xidian University(GNYZ2023QC0201).
文摘The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.
基金This work was supported by the National Natural Science Foundation of China(No.61871196 and 62001176)the Natural Science Foundation of Fujian Province of China(No.2019J01082 and 2020J01085)the Promotion Program for Young and Middle-aged Teachers in Science and Technology Research of Huaqiao University(ZQN-YX601).
文摘Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based scenes.However,for monocular 6D pose estimation,these methods are affected by the prediction results of the 2D-3D correspondences and the robustness of the per-spective-n-point(PnP)algorithm.There is still a difference in the distance from the expected estimation effect.To obtain a more effective feature representation result,edge enhancement is proposed to increase the shape information of the object by analyzing the influence of inaccurate 2D-3D matching on 6D pose regression and comparing the effectiveness of the intermediate representation.Furthermore,although the transformation matrix is composed of rotation and translation matrices from 3D model points to 2D pixel points,the two variables are essentially different and the same network cannot be used for both variables in the regression process.Therefore,to improve the effectiveness of the PnP algo-rithm,this paper designs a dual-branch PnP network to predict rotation and trans-lation information.Finally,the proposed method is verified on the public LM,LM-O and YCB-Video datasets.The ADD(S)values of the proposed method are 94.2 and 62.84 on the LM and LM-O datasets,respectively.The AUC of ADD(-S)value on YCB-Video is 81.1.These experimental results show that the performance of the proposed method is superior to that of similar methods.
文摘针对单目3D目标检测在视角变化引起的物体大小变化以及物体遮挡等情况下效果不佳的问题,提出一种融合深度信息和实例分割掩码的新型单目3D目标检测方法。首先,通过深度-掩码注意力融合(DMAF)模块,将深度信息与实例分割掩码结合,以提供更准确的物体边界;其次,引入动态卷积,并利用DMAF模块得到的融合特征引导动态卷积核的生成,以处理不同尺度的物体;再次,在损失函数中引入2D-3D边界框一致性损失函数,调整预测的3D边界框与对应的2D检测框高度一致,以提高实例分割和3D目标检测任务的效果;最后,通过消融实验验证该方法的有效性,并在KITTI测试集上对该方法进行验证。实验结果表明,与仅使用深度估计图和实例分割掩码的方法相比,在中等难度下对车辆类别检测的平均精度提高了6.36个百分点,且3D目标检测和鸟瞰图目标检测任务的效果均优于D4LCN(Depth-guided Dynamic-Depthwise-Dilated Local Convolutional Network)、M3D-RPN(Monocular 3D Region Proposal Network)等对比方法。
文摘单目三维视觉测量在视觉测量领域具有低成本、简便性、结构紧凑等优势,是以智能化、网络化制造为特征的先进制造典型技术之一。经过不断发展,单目三维视觉测量技术已成功应用于无人机导航、智能机器人、工业检测、医疗健康等领域,如今呈现出精准化、快捷化、微型化、自动化、动态化等发展趋势。以孔径数量为标准,将单目三维视觉测量技术分为单孔径及多孔径两大类,分别综述两类方法的研究现状和发展历程,重点论述了应用较广的运动恢复结构法(Structure From Motion,SFM)和光场三维测量方法,并对单目三维视觉测量技术的未来方向进行了展望。