This paper presents a complete system for scanning the geometry and texture of a large 3D object, then the automatic registration is performed to obtain a whole realistic 3D model. This system is composed of one line ...This paper presents a complete system for scanning the geometry and texture of a large 3D object, then the automatic registration is performed to obtain a whole realistic 3D model. This system is composed of one line strip laser and one color CCD camera. The scanned object is pictured twice by a color CCD camera. First, the texture of the scanned object is taken by a color CCD camera. Then the 3D information of the scanned object is obtained from laser plane equations. This paper presents a practical way to implement the three dimensional measuring method and the automatic registration of a large 3D object and a pretty good result is obtained after experiment verification.展开更多
Virtual globes(VGs)allow Internet users to view geographic data of heterogeneous quality created by other users.This article presents a new approach for collecting and visualizing information about the perceived quali...Virtual globes(VGs)allow Internet users to view geographic data of heterogeneous quality created by other users.This article presents a new approach for collecting and visualizing information about the perceived quality of 3D data in VGs.It aims atimproving users’awareness of the qualityof 3D objects.Instead of relying onthe existing metadata or on formal accuracy assessments that are often impossible in practice,we propose a crowd-sourced quality recommender system based on the five-star visualization method successful in other types of Web applications.Four alternative five-star visualizations were implemented in a Google Earth-based prototype and tested through a formal user evaluation.These tests helped identifying the most effective method for a 3D environment.Results indicate that while most websites use a visualization approach that shows a‘number of stars’,this method was the least preferred by participants.Instead,participants ranked the‘number within a star’method highest as it allowed reducing the visual clutter in urban settings,suggesting that 3D environments such as VGs require different designapproachesthan2Dornon-geographicapplications.Resultsalsoconfirmed that expert and non-expert users in geographic data share similar preferences for the most and least preferred visualization methods.展开更多
Aiming at the limitations of the existing railway foreign object detection methods based on two-dimensional(2D)images,such as short detection distance,strong influence of environment and lack of distance information,w...Aiming at the limitations of the existing railway foreign object detection methods based on two-dimensional(2D)images,such as short detection distance,strong influence of environment and lack of distance information,we propose Rail-PillarNet,a three-dimensional(3D)LIDAR(Light Detection and Ranging)railway foreign object detection method based on the improvement of PointPillars.Firstly,the parallel attention pillar encoder(PAPE)is designed to fully extract the features of the pillars and alleviate the problem of local fine-grained information loss in PointPillars pillars encoder.Secondly,a fine backbone network is designed to improve the feature extraction capability of the network by combining the coding characteristics of LIDAR point cloud feature and residual structure.Finally,the initial weight parameters of the model were optimised by the transfer learning training method to further improve accuracy.The experimental results on the OSDaR23 dataset show that the average accuracy of Rail-PillarNet reaches 58.51%,which is higher than most mainstream models,and the number of parameters is 5.49 M.Compared with PointPillars,the accuracy of each target is improved by 10.94%,3.53%,16.96%and 19.90%,respectively,and the number of parameters only increases by 0.64M,which achieves a balance between the number of parameters and accuracy.展开更多
Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input t...Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.展开更多
The self-attention networks and Transformer have dominated machine translation and natural language processing fields,and shown great potential in image vision tasks such as image classification and object detection.I...The self-attention networks and Transformer have dominated machine translation and natural language processing fields,and shown great potential in image vision tasks such as image classification and object detection.Inspired by the great progress of Transformer,we propose a novel general and robust voxel feature encoder for 3D object detection based on the traditional Transformer.We first investigate the permutation invariance of sequence data of the self-attention and apply it to point cloud processing.Then we construct a voxel feature layer based on the self-attention to adaptively learn local and robust context of a voxel according to the spatial relationship and context information exchanging between all points within the voxel.Lastly,we construct a general voxel feature learning framework with the voxel feature layer as the core for 3D object detection.The voxel feature with Transformer(VFT)can be plugged into any other voxel-based 3D object detection framework easily,and serves as the backbone for voxel feature extractor.Experiments results on the KITTI dataset demonstrate that our method achieves the state-of-the-art performance on 3D object detection.展开更多
In order to find better simplicity measurements for 3D object recognition, a new set of local regularities is developed and tested in a stepwise 3D reconstruction method, including localized minimizing standard deviat...In order to find better simplicity measurements for 3D object recognition, a new set of local regularities is developed and tested in a stepwise 3D reconstruction method, including localized minimizing standard deviation of angles(L-MSDA), localized minimizing standard deviation of segment magnitudes(L-MSDSM), localized minimum standard deviation of areas of child faces (L-MSDAF), localized minimum sum of segment magnitudes of common edges (L-MSSM), and localized minimum sum of areas of child face (L-MSAF). Based on their effectiveness measurements in terms of form and size distortions, it is found that when two local regularities: L-MSDA and L-MSDSM are combined together, they can produce better performance. In addition, the best weightings for them to work together are identified as 10% for L-MSDSM and 90% for L-MSDA. The test results show that the combined usage of L-MSDA and L-MSDSM with identified weightings has a potential to be applied in other optimization based 3D recognition methods to improve their efficacy and robustness.展开更多
In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection ...In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.展开更多
The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpow...The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.展开更多
3D object recognition is a challenging task for intelligent and robot systems in industrial and home indoor environments.It is critical for such systems to recognize and segment the 3D object instances that they encou...3D object recognition is a challenging task for intelligent and robot systems in industrial and home indoor environments.It is critical for such systems to recognize and segment the 3D object instances that they encounter on a frequent basis.The computer vision,graphics,and machine learning fields have all given it a lot of attention.Traditionally,3D segmentation was done with hand-crafted features and designed approaches that didn’t achieve acceptable performance and couldn’t be generalized to large-scale data.Deep learning approaches have lately become the preferred method for 3D segmentation challenges by their great success in 2D computer vision.However,the task of instance segmentation is currently less explored.In this paper,we propose a novel approach for efficient 3D instance segmentation using red green blue and depth(RGB-D)data based on deep learning.The 2D region based convolutional neural networks(Mask R-CNN)deep learning model with point based rending module is adapted to integrate with depth information to recognize and segment 3D instances of objects.In order to generate 3D point cloud coordinates(x,y,z),segmented 2D pixels(u,v)of recognized object regions in the RGB image are merged into(u,v)points of the depth image.Moreover,we conducted an experiment and analysis to compare our proposed method from various points of view and distances.The experimentation shows the proposed 3D object recognition and instance segmentation are sufficiently beneficial to support object handling in robotic and intelligent systems.展开更多
LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previou...LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previous object detection methods,due to the pre-processing of the original LIDAR point cloud into voxels or pillars,lose the coordinate information of the original point cloud,slow detection speed,and gain inaccurate bounding box positioning.To address the issues above,this study proposes a new two-stage network structure to extract point cloud features directly by PointNet++,which effectively preserves the original point cloud coordinate information.To improve the detection accuracy,a shell-based modeling method is proposed.It roughly determines which spherical shell the coordinates belong to.Then,the results are refined to ground truth,thereby narrowing the localization range and improving the detection accuracy.To improve the recall of 3D object detection with bounding boxes,this paper designs a self-attention module for 3D object detection with a skip connection structure.Some of these features are highlighted by weighting them on the feature dimensions.After training,it makes the feature weights that are favorable for object detection get larger.Thus,the extracted features are more adapted to the object detection task.Extensive comparison experiments and ablation experiments conducted on the KITTI dataset verify the effectiveness of our proposed method in improving recall and precision.展开更多
Light detection and ranging(LiDAR)sensors play a vital role in acquiring 3D point cloud data and extracting valuable information about objects for tasks such as autonomous driving,robotics,and virtual reality(VR).Howe...Light detection and ranging(LiDAR)sensors play a vital role in acquiring 3D point cloud data and extracting valuable information about objects for tasks such as autonomous driving,robotics,and virtual reality(VR).However,the sparse and disordered nature of the 3D point cloud poses significant challenges to feature extraction.Overcoming limitations is critical for 3D point cloud processing.3D point cloud object detection is a very challenging and crucial task,in which point cloud processing and feature extraction methods play a crucial role and have a significant impact on subsequent object detection performance.In this overview of outstanding work in object detection from the 3D point cloud,we specifically focus on summarizing methods employed in 3D point cloud processing.We introduce the way point clouds are processed in classical 3D object detection algorithms,and their improvements to solve the problems existing in point cloud processing.Different voxelization methods and point cloud sampling strategies will influence the extracted features,thereby impacting the final detection performance.展开更多
Background In this work,we focus on the label layout problem:specifying the positions of overlaid virtual annotations in Virtual/Augmented Reality scenarios.Methods Designing a layout of labels that does not violate d...Background In this work,we focus on the label layout problem:specifying the positions of overlaid virtual annotations in Virtual/Augmented Reality scenarios.Methods Designing a layout of labels that does not violate domain-specific design requirements,while at the same time satisfying aesthetic and functional principles of good design,can be a daunting task even for skilled visual designers.Presenting the annotations in 3D object space instead of projection space,allows for the preservation of spatial and depth cues.This results in stable layouts in dynamic environments,since the annotations are anchored in 3D space.Results In this paper we make two major contributions.First,we propose a technique for managing the layout and rendering of annotations in Virtual/Augmented Reality scenarios by manipulating the annotations directly in 3D space.For this,we make use of Artificial Potential Fields and use 3D geometric constraints to adapt them in 3D space.Second,we introduce PartLabeling:an open source platform in the form of a web application that acts as a much-needed generic framework allowing to easily add labeling algorithms and 3D models.This serves as a catalyst for researchers in this field to make their algorithms and implementations publicly available,as well as ensure research reproducibility.The PartLabeling framework relies on a dataset that we generate as a subset of the original PartNet dataset consisting of models suitable for the label management task.The dataset consists of 10003D models with part annotations.展开更多
In recent years,autonomous driving technology has made good progress,but the noncooperative intelligence of vehicle for autonomous driving still has many technical bottlenecks when facing urban road autonomous driving...In recent years,autonomous driving technology has made good progress,but the noncooperative intelligence of vehicle for autonomous driving still has many technical bottlenecks when facing urban road autonomous driving challenges.V2I(Vehicle-to-Infrastructure)communication is a potential solution to enable cooperative intelligence of vehicles and roads.In this paper,the RGB-PVRCNN,an environment perception framework,is proposed to improve the environmental awareness of autonomous vehicles at intersections by leveraging V2I communication technology.This framework integrates vision feature based on PVRCNN.The normal distributions transform(NDT)point cloud registration algorithm is deployed both on onboard and roadside to obtain the position of the autonomous vehicles and to build the local map objects detected by roadside multi-sensor system are sent back to autonomous vehicles to enhance the perception ability of autonomous vehicles for benefiting path planning and traffic efficiency at the intersection.The field-testing results show that our method can effectively extend the environmental perception ability and range of autonomous vehicles at the intersection and outperform the PointPillar algorithm and the VoxelRCNN algorithm in detection accuracy.展开更多
A method for deformation of 3D point clouds models was proposed with multi-constraints including arc-length constraints and multi-points position constraints. The energy function was built for the polyline which had b...A method for deformation of 3D point clouds models was proposed with multi-constraints including arc-length constraints and multi-points position constraints. The energy function was built for the polyline which had been converted from the curve. Based on the minimum energy curve method, the curve on the mesh was deformed. The test results show that the proposed method has good performance. Compared with the other method,shape preserving of the curve is better. Finally,this method is used for the deformation of the 3D mannequin model. Circumference changes of the mannequin model can be reflected by the arc-length change in the size of the cross section.展开更多
Road accident detection plays an important role in abnormal scene reconstruction for Intelligent Transportation Systems and abnormal events warning for autonomous driving.This paper presents a novel 3D object detector...Road accident detection plays an important role in abnormal scene reconstruction for Intelligent Transportation Systems and abnormal events warning for autonomous driving.This paper presents a novel 3D object detector and adaptive space partitioning algorithm to infer traffic accidents quantitatively.Using 2D region proposals in an RGB image,this method generates deformable frustums based on point cloud for each 2D region proposal and then frustum-wisely extracts features based on the farthest point sampling network(FPS-Net)and feature extraction network(FE-Net).Subsequently,the encoder-decoder network(ED-Net)implements 3D-oriented bounding box(OBB)regression.Meanwhile,the adaptive least square regression(ALSR)method is proposed to split 3D OBB.Finally,the reduced OBB intersection test is carried out to detect traffic accidents via separating surface theorem(SST).In the experiments of KITTI benchmark,our proposed 3D object detector outperforms other state-of-theartmethods.Meanwhile,collision detection algorithm achieves the satisfactory performance of 91.8%accuracy on our SHTA dataset.展开更多
In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is pro...In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.展开更多
Augmented Reality(AR)applications can be used to improve tasks and mitigate errors during facilities operation and maintenance.This article presents an AR system for facility management using a three-dimensional(3D)ob...Augmented Reality(AR)applications can be used to improve tasks and mitigate errors during facilities operation and maintenance.This article presents an AR system for facility management using a three-dimensional(3D)object tracking method.Through spatial mapping,the object of interest,a pipe trap underneath a sink,is tracked and mixed onto the AR visualization.From that,the maintenance steps are transformed into visible and animated instructions.Although some tracking issues related to the component parts were observed,the designed AR application results demonstrated the potential to improve facility management tasks.展开更多
Point clouds and RGB images are both critical data for 3D object detection. While recent multi-modal methods combine them directly and show remarkable performances, they ignore the distinct forms of these two types of...Point clouds and RGB images are both critical data for 3D object detection. While recent multi-modal methods combine them directly and show remarkable performances, they ignore the distinct forms of these two types of data. For mitigating the influence of this intrinsic difference on performance, we propose a novel but effective fusion model named LI-Attention model, which takes both RGB features and point cloud features into consideration and assigns a weight to each RGB feature by attention mechanism.Furthermore, based on the LI-Attention model, we propose a 3D object detection method called image attention transformer network(IAT-Net) specialized for indoor RGB-D scene. Compared with previous work on multi-modal detection, IAT-Net fuses elaborate RGB features from 2D detection results with point cloud features in attention mechanism, meanwhile generates and refines 3D detection results with transformer model. Extensive experiments demonstrate that our approach outperforms stateof-the-art performance on two widely used benchmarks of indoor 3D object detection, SUN RGB-D and NYU Depth V2, while ablation studies have been provided to analyze the effect of each module. And the source code for the proposed IAT-Net is publicly available at https://github.com/wisper181/IAT-Net.展开更多
Relation contexts have been proved to be useful for many challenging vision tasks.In the field of3D object detection,previous methods have been taking the advantage of context encoding,graph embedding,or explicit rela...Relation contexts have been proved to be useful for many challenging vision tasks.In the field of3D object detection,previous methods have been taking the advantage of context encoding,graph embedding,or explicit relation reasoning to extract relation contexts.However,there exist inevitably redundant relation contexts due to noisy or low-quality proposals.In fact,invalid relation contexts usually indicate underlying scene misunderstanding and ambiguity,which may,on the contrary,reduce the performance in complex scenes.Inspired by recent attention mechanism like Transformer,we propose a novel 3D attention-based relation module(ARM3D).It encompasses objectaware relation reasoning to extract pair-wise relation contexts among qualified proposals and an attention module to distribute attention weights towards different relation contexts.In this way,ARM3D can take full advantage of the useful relation contexts and filter those less relevant or even confusing contexts,which mitigates the ambiguity in detection.We have evaluated the effectiveness of ARM3D by plugging it into several state-of-the-art 3D object detectors and showing more accurate and robust detection results.Extensive experiments show the capability and generalization of ARM3D on 3D object detection.Our source code is available at https://github.com/lanlan96/ARM3D.展开更多
In recent years,camera-and lidar-based 3D object detection has achieved great progress.However,the related researches mainly focus on normal illumination conditions;the performance of their 3D detection algorithms wil...In recent years,camera-and lidar-based 3D object detection has achieved great progress.However,the related researches mainly focus on normal illumination conditions;the performance of their 3D detection algorithms will decrease under low lighting scenarios such as in the night.This work attempts to improve the fusion strategies on 3D vehicle detection accuracy in multiple lighting conditions.First,distance and uncertainty information is incorporated to guide the“painting”of semantic information onto point cloud during the data preprocessing.Moreover,a multitask framework is designed,which incorpo-rates uncertainty learning to improve detection accuracy under low-illumination scenarios.In the validation on KITTI and Dark-KITTI benchmark,the proposed method increases the vehicle detection accuracy on the KITTI benchmark by 1.35%and the generality of the model is validated on the proposed Dark-KITTI dataset,with a gain of 0.64%for vehicle detection.展开更多
文摘This paper presents a complete system for scanning the geometry and texture of a large 3D object, then the automatic registration is performed to obtain a whole realistic 3D model. This system is composed of one line strip laser and one color CCD camera. The scanned object is pictured twice by a color CCD camera. First, the texture of the scanned object is taken by a color CCD camera. Then the 3D information of the scanned object is obtained from laser plane equations. This paper presents a practical way to implement the three dimensional measuring method and the automatic registration of a large 3D object and a pretty good result is obtained after experiment verification.
文摘Virtual globes(VGs)allow Internet users to view geographic data of heterogeneous quality created by other users.This article presents a new approach for collecting and visualizing information about the perceived quality of 3D data in VGs.It aims atimproving users’awareness of the qualityof 3D objects.Instead of relying onthe existing metadata or on formal accuracy assessments that are often impossible in practice,we propose a crowd-sourced quality recommender system based on the five-star visualization method successful in other types of Web applications.Four alternative five-star visualizations were implemented in a Google Earth-based prototype and tested through a formal user evaluation.These tests helped identifying the most effective method for a 3D environment.Results indicate that while most websites use a visualization approach that shows a‘number of stars’,this method was the least preferred by participants.Instead,participants ranked the‘number within a star’method highest as it allowed reducing the visual clutter in urban settings,suggesting that 3D environments such as VGs require different designapproachesthan2Dornon-geographicapplications.Resultsalsoconfirmed that expert and non-expert users in geographic data share similar preferences for the most and least preferred visualization methods.
基金supported by a grant from the National Key Research and Development Project(2023YFB4302100)Key Research and Development Project of Jiangxi Province(No.20232ACE01011)Independent Deployment Project of Ganjiang Innovation Research Institute,Chinese Academy of Sciences(E255J001).
文摘Aiming at the limitations of the existing railway foreign object detection methods based on two-dimensional(2D)images,such as short detection distance,strong influence of environment and lack of distance information,we propose Rail-PillarNet,a three-dimensional(3D)LIDAR(Light Detection and Ranging)railway foreign object detection method based on the improvement of PointPillars.Firstly,the parallel attention pillar encoder(PAPE)is designed to fully extract the features of the pillars and alleviate the problem of local fine-grained information loss in PointPillars pillars encoder.Secondly,a fine backbone network is designed to improve the feature extraction capability of the network by combining the coding characteristics of LIDAR point cloud feature and residual structure.Finally,the initial weight parameters of the model were optimised by the transfer learning training method to further improve accuracy.The experimental results on the OSDaR23 dataset show that the average accuracy of Rail-PillarNet reaches 58.51%,which is higher than most mainstream models,and the number of parameters is 5.49 M.Compared with PointPillars,the accuracy of each target is improved by 10.94%,3.53%,16.96%and 19.90%,respectively,and the number of parameters only increases by 0.64M,which achieves a balance between the number of parameters and accuracy.
基金supported in part by the Major Project for New Generation of AI (2018AAA0100400)the National Natural Science Foundation of China (61836014,U21B2042,62072457,62006231)the InnoHK Program。
文摘Monocular 3D object detection is challenging due to the lack of accurate depth information.Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However,they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions.Different from these approaches,our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information.Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other.Furthermore,with the help of pixel-wise relative depth values in depth maps,we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection.The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
基金National Natural Science Foundation of China(No.61806006)Innovation Program for Graduate of Jiangsu Province(No.KYLX160-781)University Superior Discipline Construction Project of Jiangsu Province。
文摘The self-attention networks and Transformer have dominated machine translation and natural language processing fields,and shown great potential in image vision tasks such as image classification and object detection.Inspired by the great progress of Transformer,we propose a novel general and robust voxel feature encoder for 3D object detection based on the traditional Transformer.We first investigate the permutation invariance of sequence data of the self-attention and apply it to point cloud processing.Then we construct a voxel feature layer based on the self-attention to adaptively learn local and robust context of a voxel according to the spatial relationship and context information exchanging between all points within the voxel.Lastly,we construct a general voxel feature learning framework with the voxel feature layer as the core for 3D object detection.The voxel feature with Transformer(VFT)can be plugged into any other voxel-based 3D object detection framework easily,and serves as the backbone for voxel feature extractor.Experiments results on the KITTI dataset demonstrate that our method achieves the state-of-the-art performance on 3D object detection.
文摘In order to find better simplicity measurements for 3D object recognition, a new set of local regularities is developed and tested in a stepwise 3D reconstruction method, including localized minimizing standard deviation of angles(L-MSDA), localized minimizing standard deviation of segment magnitudes(L-MSDSM), localized minimum standard deviation of areas of child faces (L-MSDAF), localized minimum sum of segment magnitudes of common edges (L-MSSM), and localized minimum sum of areas of child face (L-MSAF). Based on their effectiveness measurements in terms of form and size distortions, it is found that when two local regularities: L-MSDA and L-MSDSM are combined together, they can produce better performance. In addition, the best weightings for them to work together are identified as 10% for L-MSDSM and 90% for L-MSDA. The test results show that the combined usage of L-MSDA and L-MSDSM with identified weightings has a potential to be applied in other optimization based 3D recognition methods to improve their efficacy and robustness.
基金The authors would like to thank the financial support of Natural Science Foundation of Anhui Province(No.2208085MF173)the key research and development projects of Anhui(202104a05020003)+2 种基金the anhui development and reform commission supports R&D and innovation project([2020]479)the national natural science foundation of China(51575001)Anhui university scientific research platform innovation team building project(2016-2018).
文摘In complex traffic environment scenarios,it is very important for autonomous vehicles to accurately perceive the dynamic information of other vehicles around the vehicle in advance.The accuracy of 3D object detection will be affected by problems such as illumination changes,object occlusion,and object detection distance.To this purpose,we face these challenges by proposing a multimodal feature fusion network for 3D object detection(MFF-Net).In this research,this paper first uses the spatial transformation projection algorithm to map the image features into the feature space,so that the image features are in the same spatial dimension when fused with the point cloud features.Then,feature channel weighting is performed using an adaptive expression augmentation fusion network to enhance important network features,suppress useless features,and increase the directionality of the network to features.Finally,this paper increases the probability of false detection and missed detection in the non-maximum suppression algo-rithm by increasing the one-dimensional threshold.So far,this paper has constructed a complete 3D target detection network based on multimodal feature fusion.The experimental results show that the proposed achieves an average accuracy of 82.60%on the Karlsruhe Institute of Technology and Toyota Technological Institute(KITTI)dataset,outperforming previous state-of-the-art multimodal fusion networks.In Easy,Moderate,and hard evaluation indicators,the accuracy rate of this paper reaches 90.96%,81.46%,and 75.39%.This shows that the MFF-Net network has good performance in 3D object detection.
基金supported by the National Key Research and Development Program of China(2020YFB1807500)the National Natural Science Foundation of China(62072360,62001357,62172438,61901367)+4 种基金the key research and development plan of Shaanxi province(2021ZDLGY02-09,2023-GHZD-44,2023-ZDLGY-54)the Natural Science Foundation of Guangdong Province of China(2022A1515010988)Key Project on Artificial Intelligence of Xi'an Science and Technology Plan(2022JH-RGZN-0003,2022JH-RGZN-0103,2022JH-CLCJ-0053)Xi'an Science and Technology Plan(20RGZN0005)the Proof-ofconcept fund from Hangzhou Research Institute of Xidian University(GNYZ2023QC0201).
文摘The high bandwidth and low latency of 6G network technology enable the successful application of monocular 3D object detection on vehicle platforms.Monocular 3D-object-detection-based Pseudo-LiDAR is a low-cost,lowpower solution compared to LiDAR solutions in the field of autonomous driving.However,this technique has some problems,i.e.,(1)the poor quality of generated Pseudo-LiDAR point clouds resulting from the nonlinear error distribution of monocular depth estimation and(2)the weak representation capability of point cloud features due to the neglected global geometric structure features of point clouds existing in LiDAR-based 3D detection networks.Therefore,we proposed a Pseudo-LiDAR confidence sampling strategy and a hierarchical geometric feature extraction module for monocular 3D object detection.We first designed a point cloud confidence sampling strategy based on a 3D Gaussian distribution to assign small confidence to the points with great error in depth estimation and filter them out according to the confidence.Then,we present a hierarchical geometric feature extraction module by aggregating the local neighborhood features and a dual transformer to capture the global geometric features in the point cloud.Finally,our detection framework is based on Point-Voxel-RCNN(PV-RCNN)with high-quality Pseudo-LiDAR and enriched geometric features as input.From the experimental results,our method achieves satisfactory results in monocular 3D object detection.
文摘3D object recognition is a challenging task for intelligent and robot systems in industrial and home indoor environments.It is critical for such systems to recognize and segment the 3D object instances that they encounter on a frequent basis.The computer vision,graphics,and machine learning fields have all given it a lot of attention.Traditionally,3D segmentation was done with hand-crafted features and designed approaches that didn’t achieve acceptable performance and couldn’t be generalized to large-scale data.Deep learning approaches have lately become the preferred method for 3D segmentation challenges by their great success in 2D computer vision.However,the task of instance segmentation is currently less explored.In this paper,we propose a novel approach for efficient 3D instance segmentation using red green blue and depth(RGB-D)data based on deep learning.The 2D region based convolutional neural networks(Mask R-CNN)deep learning model with point based rending module is adapted to integrate with depth information to recognize and segment 3D instances of objects.In order to generate 3D point cloud coordinates(x,y,z),segmented 2D pixels(u,v)of recognized object regions in the RGB image are merged into(u,v)points of the depth image.Moreover,we conducted an experiment and analysis to compare our proposed method from various points of view and distances.The experimentation shows the proposed 3D object recognition and instance segmentation are sufficiently beneficial to support object handling in robotic and intelligent systems.
基金This work was supported,in part,by the National Nature Science Foundation of China under grant numbers 62272236in part,by the Natural Science Foundation of Jiangsu Province under grant numbers BK20201136,BK20191401in part,by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fund.
文摘LIDAR point cloud-based 3D object detection aims to sense the surrounding environment by anchoring objects with the Bounding Box(BBox).However,under the three-dimensional space of autonomous driving scenes,the previous object detection methods,due to the pre-processing of the original LIDAR point cloud into voxels or pillars,lose the coordinate information of the original point cloud,slow detection speed,and gain inaccurate bounding box positioning.To address the issues above,this study proposes a new two-stage network structure to extract point cloud features directly by PointNet++,which effectively preserves the original point cloud coordinate information.To improve the detection accuracy,a shell-based modeling method is proposed.It roughly determines which spherical shell the coordinates belong to.Then,the results are refined to ground truth,thereby narrowing the localization range and improving the detection accuracy.To improve the recall of 3D object detection with bounding boxes,this paper designs a self-attention module for 3D object detection with a skip connection structure.Some of these features are highlighted by weighting them on the feature dimensions.After training,it makes the feature weights that are favorable for object detection get larger.Thus,the extracted features are more adapted to the object detection task.Extensive comparison experiments and ablation experiments conducted on the KITTI dataset verify the effectiveness of our proposed method in improving recall and precision.
文摘Light detection and ranging(LiDAR)sensors play a vital role in acquiring 3D point cloud data and extracting valuable information about objects for tasks such as autonomous driving,robotics,and virtual reality(VR).However,the sparse and disordered nature of the 3D point cloud poses significant challenges to feature extraction.Overcoming limitations is critical for 3D point cloud processing.3D point cloud object detection is a very challenging and crucial task,in which point cloud processing and feature extraction methods play a crucial role and have a significant impact on subsequent object detection performance.In this overview of outstanding work in object detection from the 3D point cloud,we specifically focus on summarizing methods employed in 3D point cloud processing.We introduce the way point clouds are processed in classical 3D object detection algorithms,and their improvements to solve the problems existing in point cloud processing.Different voxelization methods and point cloud sampling strategies will influence the extracted features,thereby impacting the final detection performance.
文摘Background In this work,we focus on the label layout problem:specifying the positions of overlaid virtual annotations in Virtual/Augmented Reality scenarios.Methods Designing a layout of labels that does not violate domain-specific design requirements,while at the same time satisfying aesthetic and functional principles of good design,can be a daunting task even for skilled visual designers.Presenting the annotations in 3D object space instead of projection space,allows for the preservation of spatial and depth cues.This results in stable layouts in dynamic environments,since the annotations are anchored in 3D space.Results In this paper we make two major contributions.First,we propose a technique for managing the layout and rendering of annotations in Virtual/Augmented Reality scenarios by manipulating the annotations directly in 3D space.For this,we make use of Artificial Potential Fields and use 3D geometric constraints to adapt them in 3D space.Second,we introduce PartLabeling:an open source platform in the form of a web application that acts as a much-needed generic framework allowing to easily add labeling algorithms and 3D models.This serves as a catalyst for researchers in this field to make their algorithms and implementations publicly available,as well as ensure research reproducibility.The PartLabeling framework relies on a dataset that we generate as a subset of the original PartNet dataset consisting of models suitable for the label management task.The dataset consists of 10003D models with part annotations.
基金This research was supported by the National Key Research and Development Program of China under Grant No.2017YFB0102502the Beijing Municipal Natural Science Foundation No.L191001+2 种基金the National Natural Science Foundation of China under Grant No.61672082 and 61822101the Newton Advanced Fellowship under Grant No.62061130221the Young Elite Scientists Sponsorship Program by Hunan Provincial Department of Education under Grant No.18B142.
文摘In recent years,autonomous driving technology has made good progress,but the noncooperative intelligence of vehicle for autonomous driving still has many technical bottlenecks when facing urban road autonomous driving challenges.V2I(Vehicle-to-Infrastructure)communication is a potential solution to enable cooperative intelligence of vehicles and roads.In this paper,the RGB-PVRCNN,an environment perception framework,is proposed to improve the environmental awareness of autonomous vehicles at intersections by leveraging V2I communication technology.This framework integrates vision feature based on PVRCNN.The normal distributions transform(NDT)point cloud registration algorithm is deployed both on onboard and roadside to obtain the position of the autonomous vehicles and to build the local map objects detected by roadside multi-sensor system are sent back to autonomous vehicles to enhance the perception ability of autonomous vehicles for benefiting path planning and traffic efficiency at the intersection.The field-testing results show that our method can effectively extend the environmental perception ability and range of autonomous vehicles at the intersection and outperform the PointPillar algorithm and the VoxelRCNN algorithm in detection accuracy.
基金the Key Project of the National Nature Science Foundation of China(No.61134009)Program for Changjiang Scholars and Innovation Research Team in University from the Ministry of Education,China(No.IRT1220)+2 种基金Specialized Research Funds for Shanghai Leading Talents,Project of the Shanghai Committee of Science and Technology,China(Nos.13JC1400200,11JC1400200)Innovation Program of Shanghai Municipal Education Commission,China(No.14ZZ067)the Fundamental Research Funds for the Central Universities,China(No.2232012A3-04)
文摘A method for deformation of 3D point clouds models was proposed with multi-constraints including arc-length constraints and multi-points position constraints. The energy function was built for the polyline which had been converted from the curve. Based on the minimum energy curve method, the curve on the mesh was deformed. The test results show that the proposed method has good performance. Compared with the other method,shape preserving of the curve is better. Finally,this method is used for the deformation of the 3D mannequin model. Circumference changes of the mannequin model can be reflected by the arc-length change in the size of the cross section.
基金National Natural Science Foundation of China(No.51805312)in part by Shanghai Sailing Program(No.18YF1409400)+4 种基金in part by Training and Funding Program of Shanghai College young teachers(No.ZZGCD15102)in part by Scientific Research Project of Shanghai University of Engineering Science(No.2016-19)in part by Science and Technology Commission of Shanghai Municipality(No.19030501100)in part by the Shanghai University of Engineering Science Innovation Fund for Graduate Students(No.18KY0613)in part by National Key R&D Program of China(No.2016YFC0802900).
文摘Road accident detection plays an important role in abnormal scene reconstruction for Intelligent Transportation Systems and abnormal events warning for autonomous driving.This paper presents a novel 3D object detector and adaptive space partitioning algorithm to infer traffic accidents quantitatively.Using 2D region proposals in an RGB image,this method generates deformable frustums based on point cloud for each 2D region proposal and then frustum-wisely extracts features based on the farthest point sampling network(FPS-Net)and feature extraction network(FE-Net).Subsequently,the encoder-decoder network(ED-Net)implements 3D-oriented bounding box(OBB)regression.Meanwhile,the adaptive least square regression(ALSR)method is proposed to split 3D OBB.Finally,the reduced OBB intersection test is carried out to detect traffic accidents via separating surface theorem(SST).In the experiments of KITTI benchmark,our proposed 3D object detector outperforms other state-of-theartmethods.Meanwhile,collision detection algorithm achieves the satisfactory performance of 91.8%accuracy on our SHTA dataset.
基金National Youth Natural Science Foundation of China(No.61806006)Innovation Program for Graduate of Jiangsu Province(No.KYLX160-781)Jiangsu University Superior Discipline Construction Project。
文摘In order to solve difficult detection of far and hard objects due to the sparseness and insufficient semantic information of LiDAR point cloud,a 3D object detection network with multi-modal data adaptive fusion is proposed,which makes use of multi-neighborhood information of voxel and image information.Firstly,design an improved ResNet that maintains the structure information of far and hard objects in low-resolution feature maps,which is more suitable for detection task.Meanwhile,semantema of each image feature map is enhanced by semantic information from all subsequent feature maps.Secondly,extract multi-neighborhood context information with different receptive field sizes to make up for the defect of sparseness of point cloud which improves the ability of voxel features to represent the spatial structure and semantic information of objects.Finally,propose a multi-modal feature adaptive fusion strategy which uses learnable weights to express the contribution of different modal features to the detection task,and voxel attention further enhances the fused feature expression of effective target objects.The experimental results on the KITTI benchmark show that this method outperforms VoxelNet with remarkable margins,i.e.increasing the AP by 8.78%and 5.49%on medium and hard difficulty levels.Meanwhile,our method achieves greater detection performance compared with many mainstream multi-modal methods,i.e.outperforming the AP by 1%compared with that of MVX-Net on medium and hard difficulty levels.
文摘Augmented Reality(AR)applications can be used to improve tasks and mitigate errors during facilities operation and maintenance.This article presents an AR system for facility management using a three-dimensional(3D)object tracking method.Through spatial mapping,the object of interest,a pipe trap underneath a sink,is tracked and mixed onto the AR visualization.From that,the maintenance steps are transformed into visible and animated instructions.Although some tracking issues related to the component parts were observed,the designed AR application results demonstrated the potential to improve facility management tasks.
基金supported by the National Natural Science Foundation of China (Grant No. 61803004)Aeronautical Science Foundation of China (Grant No. 20161375002)。
文摘Point clouds and RGB images are both critical data for 3D object detection. While recent multi-modal methods combine them directly and show remarkable performances, they ignore the distinct forms of these two types of data. For mitigating the influence of this intrinsic difference on performance, we propose a novel but effective fusion model named LI-Attention model, which takes both RGB features and point cloud features into consideration and assigns a weight to each RGB feature by attention mechanism.Furthermore, based on the LI-Attention model, we propose a 3D object detection method called image attention transformer network(IAT-Net) specialized for indoor RGB-D scene. Compared with previous work on multi-modal detection, IAT-Net fuses elaborate RGB features from 2D detection results with point cloud features in attention mechanism, meanwhile generates and refines 3D detection results with transformer model. Extensive experiments demonstrate that our approach outperforms stateof-the-art performance on two widely used benchmarks of indoor 3D object detection, SUN RGB-D and NYU Depth V2, while ablation studies have been provided to analyze the effect of each module. And the source code for the proposed IAT-Net is publicly available at https://github.com/wisper181/IAT-Net.
基金National Nature Science Foundation of China(62132021,62102435,62002375,62002376)National Key R&D Program of China(2018AAA0102200)NUDT Research Grants(ZK19-30)。
文摘Relation contexts have been proved to be useful for many challenging vision tasks.In the field of3D object detection,previous methods have been taking the advantage of context encoding,graph embedding,or explicit relation reasoning to extract relation contexts.However,there exist inevitably redundant relation contexts due to noisy or low-quality proposals.In fact,invalid relation contexts usually indicate underlying scene misunderstanding and ambiguity,which may,on the contrary,reduce the performance in complex scenes.Inspired by recent attention mechanism like Transformer,we propose a novel 3D attention-based relation module(ARM3D).It encompasses objectaware relation reasoning to extract pair-wise relation contexts among qualified proposals and an attention module to distribute attention weights towards different relation contexts.In this way,ARM3D can take full advantage of the useful relation contexts and filter those less relevant or even confusing contexts,which mitigates the ambiguity in detection.We have evaluated the effectiveness of ARM3D by plugging it into several state-of-the-art 3D object detectors and showing more accurate and robust detection results.Extensive experiments show the capability and generalization of ARM3D on 3D object detection.Our source code is available at https://github.com/lanlan96/ARM3D.
基金supported by the National Natural Science Foundation of China(No.52002285)the Shanghai Pujiang Program(No.2020PJD075)the Natural Science Foundation of Shanghai(No.21ZR1467400).
文摘In recent years,camera-and lidar-based 3D object detection has achieved great progress.However,the related researches mainly focus on normal illumination conditions;the performance of their 3D detection algorithms will decrease under low lighting scenarios such as in the night.This work attempts to improve the fusion strategies on 3D vehicle detection accuracy in multiple lighting conditions.First,distance and uncertainty information is incorporated to guide the“painting”of semantic information onto point cloud during the data preprocessing.Moreover,a multitask framework is designed,which incorpo-rates uncertainty learning to improve detection accuracy under low-illumination scenarios.In the validation on KITTI and Dark-KITTI benchmark,the proposed method increases the vehicle detection accuracy on the KITTI benchmark by 1.35%and the generality of the model is validated on the proposed Dark-KITTI dataset,with a gain of 0.64%for vehicle detection.