Although VSLAM/VISLAM has achieved great success,it is still difficult to quantitatively evaluate the localization results of different kinds of SLAM systems from the aspect of augmented reality due to the lack of an ...Although VSLAM/VISLAM has achieved great success,it is still difficult to quantitatively evaluate the localization results of different kinds of SLAM systems from the aspect of augmented reality due to the lack of an appropriate benchmark.For AR applications in practice,a variety of challenging situations(e.g.,fast motion,strong rotation,serious motion blur,dynamic interference)may be easily encountered since a home user may not carefully move the AR device,and the real environment may be quite complex.In addition,the frequency of camera lost should be minimized and the recovery from the failure status should be fast and accurate for good AR experience.Existing SLAM datasets/benchmarks generally only provide the evaluation of pose accuracy and their camera motions are somehow simple and do not fit well the common cases in the mobile AR applications.With the above motivation,we build a new visual-inertial dataset as well as a series of evaluation criteria for AR.We also review the existing monocular VSLAM/VISLAM approaches with detailed analyses and comparisons.Especially,we select 8 representative monocular VSLAM/VISLAM approaches/systems and quantitatively evaluate them on our benchmark.Our dataset,sample code and corresponding evaluation tools are available at the benchmark website http://www.zjucvg.net/eval-vislam/.展开更多
Reliable and accurate calibration for camera,inertial measurement unit(IMU)and robot is a critical prerequisite for visual-inertial based robot pose estimation and surrounding environment perception.However,traditiona...Reliable and accurate calibration for camera,inertial measurement unit(IMU)and robot is a critical prerequisite for visual-inertial based robot pose estimation and surrounding environment perception.However,traditional calibrations suffer inaccuracy and inconsistency.To address these problems,this paper proposes a monocular visual-inertial and robotic-arm calibration in a unifying framework.In our method,the spatial relationship is geometrically correlated between the sensing units and robotic arm.The decoupled estimations on rotation and translation could reduce the coupled errors during the optimization.Additionally,the robotic calibration moving trajectory has been designed in a spiral pattern that enables full excitations on 6 DOF motions repeatably and consistently.The calibration has been evaluated on our developed platform.In the experiments,the calibration achieves the accuracy with rotation and translation RMSEs less than 0.7°and 0.01 m,respectively.The comparisons with state-of-the-art results prove our calibration consistency,accuracy and effectiveness.展开更多
This paper proposes a Visual-Inertial Odometry(VIO)algorithm that relies solely on monocular cameras and Inertial Measurement Units(IMU),capable of real-time self-position estimation for robots during movement.By inte...This paper proposes a Visual-Inertial Odometry(VIO)algorithm that relies solely on monocular cameras and Inertial Measurement Units(IMU),capable of real-time self-position estimation for robots during movement.By integrating the optical flow method,the algorithm tracks both point and line features in images simultaneously,significantly reducing computational complexity and the matching time for line feature descriptors.Additionally,this paper advances the triangulation method for line features,using depth information from line segment endpoints to determine their Plcker coordinates in three-dimensional space.Tests on the EuRoC datasets show that the proposed algorithm outperforms PL-VIO in terms of processing speed per frame,with an approximate 5%to 10%improvement in both relative pose error(RPE)and absolute trajectory error(ATE).These results demonstrate that the proposed VIO algorithm is an efficient solution suitable for low-computing platforms requiring real-time localization and navigation.展开更多
Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input t...Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.展开更多
Feature detection and Tracking, which heavily rely on the gray value information of images, is a very importance procedure for Visual-Inertial Odometry (VIO) and the tracking results significantly affect the accuracy ...Feature detection and Tracking, which heavily rely on the gray value information of images, is a very importance procedure for Visual-Inertial Odometry (VIO) and the tracking results significantly affect the accuracy of the estimation results and the robustness of VIO. In high contrast lighting condition environment, images captured by auto exposure camera shows frequently change with its exposure time. As a result, the gray value of the same feature in the image show vary from frame to frame, which poses large challenge to the feature detection and tracking procedure. Moreover, this problem further been aggravated by the nonlinear camera response function and lens attenuation. However, very few VIO methods take full advantage of photometric camera calibration and discuss the influence of photometric calibration to the VIO. In this paper, we proposed a robust monocular visual-inertial odometry, PC-VINS-Mono, which can be understood as an extension of the opens-source VIO pipeline, VINS-Mono, with the capability of photometric calibration. We evaluate the proposed algorithm with the public dataset. Experimental results show that, with photometric calibration, our algorithm achieves better performance comparing to the VINS-Mono.展开更多
Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious pro...Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.展开更多
The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpow...The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.展开更多
Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based s...Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based scenes.However,for monocular 6D pose estimation,these methods are affected by the prediction results of the 2D-3D correspondences and the robustness of the per-spective-n-point(PnP)algorithm.There is still a difference in the distance from the expected estimation effect.To obtain a more effective feature representation result,edge enhancement is proposed to increase the shape information of the object by analyzing the influence of inaccurate 2D-3D matching on 6D pose regression and comparing the effectiveness of the intermediate representation.Furthermore,although the transformation matrix is composed of rotation and translation matrices from 3D model points to 2D pixel points,the two variables are essentially different and the same network cannot be used for both variables in the regression process.Therefore,to improve the effectiveness of the PnP algo-rithm,this paper designs a dual-branch PnP network to predict rotation and trans-lation information.Finally,the proposed method is verified on the public LM,LM-O and YCB-Video datasets.The ADD(S)values of the proposed method are 94.2 and 62.84 on the LM and LM-O datasets,respectively.The AUC of ADD(-S)value on YCB-Video is 81.1.These experimental results show that the performance of the proposed method is superior to that of similar methods.展开更多
针对单目3D目标检测在视角变化引起的物体大小变化以及物体遮挡等情况下效果不佳的问题,提出一种融合深度信息和实例分割掩码的新型单目3D目标检测方法。首先,通过深度-掩码注意力融合(DMAF)模块,将深度信息与实例分割掩码结合,以提供...针对单目3D目标检测在视角变化引起的物体大小变化以及物体遮挡等情况下效果不佳的问题,提出一种融合深度信息和实例分割掩码的新型单目3D目标检测方法。首先,通过深度-掩码注意力融合(DMAF)模块,将深度信息与实例分割掩码结合,以提供更准确的物体边界;其次,引入动态卷积,并利用DMAF模块得到的融合特征引导动态卷积核的生成,以处理不同尺度的物体;再次,在损失函数中引入2D-3D边界框一致性损失函数,调整预测的3D边界框与对应的2D检测框高度一致,以提高实例分割和3D目标检测任务的效果;最后,通过消融实验验证该方法的有效性,并在KITTI测试集上对该方法进行验证。实验结果表明,与仅使用深度估计图和实例分割掩码的方法相比,在中等难度下对车辆类别检测的平均精度提高了6.36个百分点,且3D目标检测和鸟瞰图目标检测任务的效果均优于D4LCN(Depth-guided Dynamic-Depthwise-Dilated Local Convolutional Network)、M3D-RPN(Monocular 3D Region Proposal Network)等对比方法。展开更多
单目三维视觉测量在视觉测量领域具有低成本、简便性、结构紧凑等优势,是以智能化、网络化制造为特征的先进制造典型技术之一。经过不断发展,单目三维视觉测量技术已成功应用于无人机导航、智能机器人、工业检测、医疗健康等领域,如今...单目三维视觉测量在视觉测量领域具有低成本、简便性、结构紧凑等优势,是以智能化、网络化制造为特征的先进制造典型技术之一。经过不断发展,单目三维视觉测量技术已成功应用于无人机导航、智能机器人、工业检测、医疗健康等领域,如今呈现出精准化、快捷化、微型化、自动化、动态化等发展趋势。以孔径数量为标准,将单目三维视觉测量技术分为单孔径及多孔径两大类,分别综述两类方法的研究现状和发展历程,重点论述了应用较广的运动恢复结构法(Structure From Motion,SFM)和光场三维测量方法,并对单目三维视觉测量技术的未来方向进行了展望。展开更多
基金the National Key Research and Development Program of China(2016YFB1001501)NSF of China(61672457)+1 种基金the Fundamental Research Funds for the Central Universities(2018FZA5011)Zhejiang University-SenseTime Joint Lab of 3D Vision.
文摘Although VSLAM/VISLAM has achieved great success,it is still difficult to quantitatively evaluate the localization results of different kinds of SLAM systems from the aspect of augmented reality due to the lack of an appropriate benchmark.For AR applications in practice,a variety of challenging situations(e.g.,fast motion,strong rotation,serious motion blur,dynamic interference)may be easily encountered since a home user may not carefully move the AR device,and the real environment may be quite complex.In addition,the frequency of camera lost should be minimized and the recovery from the failure status should be fast and accurate for good AR experience.Existing SLAM datasets/benchmarks generally only provide the evaluation of pose accuracy and their camera motions are somehow simple and do not fit well the common cases in the mobile AR applications.With the above motivation,we build a new visual-inertial dataset as well as a series of evaluation criteria for AR.We also review the existing monocular VSLAM/VISLAM approaches with detailed analyses and comparisons.Especially,we select 8 representative monocular VSLAM/VISLAM approaches/systems and quantitatively evaluate them on our benchmark.Our dataset,sample code and corresponding evaluation tools are available at the benchmark website http://www.zjucvg.net/eval-vislam/.
基金This work was supported by the International Partnership Program of Chinese Academy of Sciences(173321KYSB20180020,173321KYSB20200002)the National Natural Science Foundation of China(61903357,62022088)+3 种基金Liaoning Provincial Natural Science Foundation of China(2020-MS-032,2019-YQ-09,2020JH2/10500002,2021JH6/10500114)LiaoNing Revitalization Talents Program(XLYC1902110)China Postdoctoral Science Foundation(2020M672600)the Swedish Foundation for Strategic Research(APR20-0023).
文摘Reliable and accurate calibration for camera,inertial measurement unit(IMU)and robot is a critical prerequisite for visual-inertial based robot pose estimation and surrounding environment perception.However,traditional calibrations suffer inaccuracy and inconsistency.To address these problems,this paper proposes a monocular visual-inertial and robotic-arm calibration in a unifying framework.In our method,the spatial relationship is geometrically correlated between the sensing units and robotic arm.The decoupled estimations on rotation and translation could reduce the coupled errors during the optimization.Additionally,the robotic calibration moving trajectory has been designed in a spiral pattern that enables full excitations on 6 DOF motions repeatably and consistently.The calibration has been evaluated on our developed platform.In the experiments,the calibration achieves the accuracy with rotation and translation RMSEs less than 0.7°and 0.01 m,respectively.The comparisons with state-of-the-art results prove our calibration consistency,accuracy and effectiveness.
文摘This paper proposes a Visual-Inertial Odometry(VIO)algorithm that relies solely on monocular cameras and Inertial Measurement Units(IMU),capable of real-time self-position estimation for robots during movement.By integrating the optical flow method,the algorithm tracks both point and line features in images simultaneously,significantly reducing computational complexity and the matching time for line feature descriptors.Additionally,this paper advances the triangulation method for line features,using depth information from line segment endpoints to determine their Plcker coordinates in three-dimensional space.Tests on the EuRoC datasets show that the proposed algorithm outperforms PL-VIO in terms of processing speed per frame,with an approximate 5%to 10%improvement in both relative pose error(RPE)and absolute trajectory error(ATE).These results demonstrate that the proposed VIO algorithm is an efficient solution suitable for low-computing platforms requiring real-time localization and navigation.
基金supported in part by the Major Project for New Generation of AI (2018AAA0100400)the National Natural Science Foundation of China (61836014,U21B2042,62072457,62006231)the InnoHK Program。
文摘Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
基金support from National Natural Science Foundation of China (No.61375086)Key Project (No.KZ201610005010) of S&T Plan of Beijing Municipal Commission of EducationBeijing Natural Science Foundation(4174083).
文摘Feature detection and Tracking, which heavily rely on the gray value information of images, is a very importance procedure for Visual-Inertial Odometry (VIO) and the tracking results significantly affect the accuracy of the estimation results and the robustness of VIO. In high contrast lighting condition environment, images captured by auto exposure camera shows frequently change with its exposure time. As a result, the gray value of the same feature in the image show vary from frame to frame, which poses large challenge to the feature detection and tracking procedure. Moreover, this problem further been aggravated by the nonlinear camera response function and lens attenuation. However, very few VIO methods take full advantage of photometric camera calibration and discuss the influence of photometric calibration to the VIO. In this paper, we proposed a robust monocular visual-inertial odometry, PC-VINS-Mono, which can be understood as an extension of the opens-source VIO pipeline, VINS-Mono, with the capability of photometric calibration. We evaluate the proposed algorithm with the public dataset. Experimental results show that, with photometric calibration, our algorithm achieves better performance comparing to the VINS-Mono.
基金supported in part by School Research Projects of Wuyi University (No.5041700175).
文摘Monocular depth estimation is the basic task in computer vision.Its accuracy has tremendous improvement in the decade with the development of deep learning.However,the blurry boundary in the depth map is a serious problem.Researchers find that the blurry boundary is mainly caused by two factors.First,the low-level features,containing boundary and structure information,may be lost in deep networks during the convolution process.Second,themodel ignores the errors introduced by the boundary area due to the few portions of the boundary area in the whole area,during the backpropagation.Focusing on the factors mentioned above.Two countermeasures are proposed to mitigate the boundary blur problem.Firstly,we design a scene understanding module and scale transformmodule to build a lightweight fuse feature pyramid,which can deal with low-level feature loss effectively.Secondly,we propose a boundary-aware depth loss function to pay attention to the effects of the boundary’s depth value.Extensive experiments show that our method can predict the depth maps with clearer boundaries,and the performance of the depth accuracy based on NYU-Depth V2,SUN RGB-D,and iBims-1 are competitive.
基金supported by the National Key Research and Development Program of China(2020YFB1807500)the National Natural Science Foundation of China(62072360,62001357,62172438,61901367)+4 种基金the key research and development plan of Shaanxi province(2021ZDLGY02-09,2023-GHZD-44,2023-ZDLGY-54)the Natural Science Foundation of Guangdong Province of China(2022A1515010988)Key Project on Artificial Intelligence of Xi'an Science and Technology Plan(2022JH-RGZN-0003,2022JH-RGZN-0103,2022JH-CLCJ-0053)Xi'an Science and Technology Plan(20RGZN0005)the Proof-ofconcept fund from Hangzhou Research Institute of Xidian University(GNYZ2023QC0201).
文摘The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.
基金This work was supported by the National Natural Science Foundation of China(No.61871196 and 62001176)the Natural Science Foundation of Fujian Province of China(No.2019J01082 and 2020J01085)the Promotion Program for Young and Middle-aged Teachers in Science and Technology Research of Huaqiao University(ZQN-YX601).
文摘Monocular 6D pose estimation is a functional task in the field of com-puter vision and robotics.In recent years,2D-3D correspondence-based methods have achieved improved performance in multiview and depth data-based scenes.However,for monocular 6D pose estimation,these methods are affected by the prediction results of the 2D-3D correspondences and the robustness of the per-spective-n-point(PnP)algorithm.There is still a difference in the distance from the expected estimation effect.To obtain a more effective feature representation result,edge enhancement is proposed to increase the shape information of the object by analyzing the influence of inaccurate 2D-3D matching on 6D pose regression and comparing the effectiveness of the intermediate representation.Furthermore,although the transformation matrix is composed of rotation and translation matrices from 3D model points to 2D pixel points,the two variables are essentially different and the same network cannot be used for both variables in the regression process.Therefore,to improve the effectiveness of the PnP algo-rithm,this paper designs a dual-branch PnP network to predict rotation and trans-lation information.Finally,the proposed method is verified on the public LM,LM-O and YCB-Video datasets.The ADD(S)values of the proposed method are 94.2 and 62.84 on the LM and LM-O datasets,respectively.The AUC of ADD(-S)value on YCB-Video is 81.1.These experimental results show that the performance of the proposed method is superior to that of similar methods.
文摘针对单目3D目标检测在视角变化引起的物体大小变化以及物体遮挡等情况下效果不佳的问题,提出一种融合深度信息和实例分割掩码的新型单目3D目标检测方法。首先,通过深度-掩码注意力融合(DMAF)模块,将深度信息与实例分割掩码结合,以提供更准确的物体边界;其次,引入动态卷积,并利用DMAF模块得到的融合特征引导动态卷积核的生成,以处理不同尺度的物体;再次,在损失函数中引入2D-3D边界框一致性损失函数,调整预测的3D边界框与对应的2D检测框高度一致,以提高实例分割和3D目标检测任务的效果;最后,通过消融实验验证该方法的有效性,并在KITTI测试集上对该方法进行验证。实验结果表明,与仅使用深度估计图和实例分割掩码的方法相比,在中等难度下对车辆类别检测的平均精度提高了6.36个百分点,且3D目标检测和鸟瞰图目标检测任务的效果均优于D4LCN(Depth-guided Dynamic-Depthwise-Dilated Local Convolutional Network)、M3D-RPN(Monocular 3D Region Proposal Network)等对比方法。
文摘单目三维视觉测量在视觉测量领域具有低成本、简便性、结构紧凑等优势,是以智能化、网络化制造为特征的先进制造典型技术之一。经过不断发展,单目三维视觉测量技术已成功应用于无人机导航、智能机器人、工业检测、医疗健康等领域,如今呈现出精准化、快捷化、微型化、自动化、动态化等发展趋势。以孔径数量为标准,将单目三维视觉测量技术分为单孔径及多孔径两大类,分别综述两类方法的研究现状和发展历程,重点论述了应用较广的运动恢复结构法(Structure From Motion,SFM)和光场三维测量方法,并对单目三维视觉测量技术的未来方向进行了展望。