摘要
在数字仿真技术应用领域,特别是在自动驾驶技术的发展中,目标检测是至关重要的一个环节,它涉及对周围环境中物体的感知,为智能装备的决策和规划提供了关键信息。近年来,随着传感器技术的进步,图像和点云成为两种主要的感知数据源,它们各自在基于深度学习技术的目标检测方法研究中具有独特的优势。为了更加全面地对现有基于点云和图像的目标检测方法进行研究,本文对基于图像、点云及两者联合的3类目标检测算法进行系统的梳理和总结,旨在探索如何将这两种数据源融合起来,促进提高目标检测的准确性、稳定性和鲁棒性,并对融合点云和图像的环境目标检测发展方向进行展望。
In the field of digital simulation technology applications,especially in the development of autonomous driving,object detection is a crucial component.It involves the perception of objects in the surrounding environment,which pro⁃vides essential information for the decision-making process and planning of intelligent systems.Traditional object detectionmethods typically involve steps such as feature extraction,object classification,and position regression on images.How⁃ever,these methods are limited by manually designed features and the performance of classifiers,which restrict their effec⁃tiveness in complex scenes and for objects with significant variations.The advent of deep learning technology has led to thewidespread adoption of object detection methods based on deep neural networks.Notably,the convolutional neural network(CNN)has emerged as one of the most prominent approaches in this field.By leveraging multiple layers of convolution andpooling operations,CNNs are capable of automatically extracting meaningful feature representations from image data.Inaddition to image data,light detection and ranging(LiDAR)data play a crucial role in object detection tasks,particularly for 3D object detection.LiDAR data represent objects through a set of unordered and discrete points on their surfaces.Accurately detecting point cloud clusters representing objects and providing their pose estimation from these unorderedpoints is a challenging task.LiDAR data,with their unique characteristics,offer high-precision obstacle detection and dis⁃tance measurement,which contributes to the perception of surrounding roadways,vehicles,and pedestrian targets.In realworld autonomous driving and related environmental perception scenarios,using a single modality often presents numerouschallenges.For instance,while image data can provide a wide variety of high-resolution visual information such as color,texture,and shape,it is susceptible to lighting conditions.In addition,models may struggle to handle occlusions causedby objects obstructing the view due to inherent limitations in camera perspectives.Fortunately,LiDAR exhibits exceptionalperformance in challenging lighting conditions and excels at accurately spatially locating objects in diverse and harshweather scenarios.However,it possesses certain limitations.Specifically,the low resolution of LiDAR input data resultsin sparse point cloud when detecting distant targets.Extracting semantic information from LiDAR data is also more chal⁃lenging than that from image data.Thus,an increasing number of researchers are emphasizing multimodal environmentalobject detection.A robust multimodal perception algorithm can offer richer feature information,enhanced adaptability todiverse environments,and improved detection accuracy.Such capabilities empower the perception system to deliver reli⁃able results across various environmental conditions.Certainly,multimodal object detection algorithms also face certainlimitations and pressing challenges that require immediate attention.One challenge is the difficulty in data annotation.Annotating point cloud and image data is relatively complex and time consuming,particularly for large-scale datasets.Moreover,accurately labeling point cloud data is challenging due to their sparsity and the presence of noisy points.Addressing these issues is crucial for further advancements in multimodal object detection.Moreover,the data structureand feature representation of point cloud and image data,as two distinct perception modalities,differ significantly.Thecurrent research focus lies in effectively integrating the information from the two modalities and extracting accurate and com⁃prehensive features that can be utilized effectively.Furthermore,processing large-scale point cloud data are equally chal⁃lenging.Point cloud data typically encompass a substantial number of 3D coordinates,which necessitates greater demandson computing resources and algorithmic efficiency compared with pure image data.This study aims to summarize and refineexisting approaches to facilitate researchers in gaining a deeper and more efficient understanding of object detection algo⁃rithms that integrate images and point clouds.It classifies object detection algorithms based on multimodal fusion of pointclouds,images,and combinations of both.Furthermore,we analyze the strengths and weaknesses of various methodswhile discussing potential solutions.Moreover,we provide a comprehensive review of the development of object detectionalgorithms that fuse point clouds and images,with considerations of aspects such as data collection,representation,andmodel design.Ultimately,we give a perspective on the future development direction of environmental target detection,andthe goal is to enhance overall capabilities in autonomous systems.
作者
贾明达
杨金明
孟维亮
郭建伟
张吉光
张晓鹏
Jia Mingda;Yang Jinming;Meng Weiliang;Guo Jianwei;Zhang Jiguang;Zhang Xiaopeng(State Key Laboratory of Multimodal Artificial Intelligence Systems,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China;School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《中国图象图形学报》
CSCD
北大核心
2024年第6期1765-1784,共20页
Journal of Image and Graphics
基金
北京市自然科学基金-丰台轨道交通前沿研究联合基金(L231013)
国家自然科学基金(U21A20515,62376271,62172416,52175493,U22B2034,62365014)
北京航空航天大学虚拟实现国家重点实验室开放课题(VRLAB2023B01)。
关键词
点云
自动驾驶
多模态
目标检测
融合
point cloud
autonomous driving
multimodal
object detection
fusion