摘要
目的针对虚拟到现实驾驶场景翻译中成对的数据样本匮乏、翻译结果不精准以及模型训练不稳定等问题,提出一种多模态数据融合的条件扩散模型。方法首先,为解决目前主流的基于生成对抗网络的图像翻译方法中存在的模式崩塌、训练不稳定等问题,以生成多样性强、训练稳定性好的扩散模型为基础,构建图像翻译模型;其次,为解决传统扩散模型无法融入先验信息从而无法控制图像生成这一问题,提出基于多头自注意力机制的多模态特征融合方法,该方法能将多模态信息融入扩散模型的去噪过程,从而起到条件控制的作用;最后,基于语义分割图和深度图能分别表征物体的轮廓信息和深度信息这一特点,将其与噪声图像进行融合后输入去噪网络,以此构建多模态数据融合的条件扩散模型,从而实现更精准的驾驶场景图像翻译。结果在Cityscapes数据集上训练本文提出的模型,并且将本文方法与先进方法进行比较,结果表明,本文方法可以实现轮廓细节更细致、距离远近更一致的驾驶场景图像翻译,在弗雷歇初始距离(Fréchet inception distance,FID)和学习感知图像块相似度(learned perceptual image patch similarity,LPIPS)等指标上均取得了更好的结果,分别为44.20和0.377。结论本文方法能有效解决现有图像翻译方法中数据样本匮乏、翻译结果不精准以及模型训练不稳定等问题,提高驾驶场景的翻译精确度,为实现安全实用的自动驾驶提供理论支撑和数据基础。
Objective Safety is the most important consideration for autonomous driving vehicles.New autonomous driving methods need numerous training and testing processes before their application in real vehicles.However,training and testing autonomous driving methods directly in real-world scenarios is a costly and risky task.Many researchers first train and test their methods in simulated-world scenarios and then transfer the trained knowledge to real-world scenarios.However,many differences in scene modeling,light,and vehicle dynamics are observed between the two-world scenarios.Therefore,the autonomous driving model trained in simulated-world scenarios cannot be effectively generalized to real-world scenarios.With the development of deep learning technologies,image translation,which aims to transform the content of an image from one presentation form to another,has made considerable achievements in many fields,such as image beautification,style transfer,scene design,and video special effects.If image translation technology is applied to the translation of simulated driving scenarios to real ones,then this technology can not only solve the problem of poor generalization capability of autonomous driving models but can also effectively reduce the cost and risk of training in the real scenarios.Unfortunately,existing image translation methods applied in autonomous driving lack datasets of paired simulated and real scenarios,and most of the mainstream image translation methods are based on generative adversarial network(GAN),which have problems of mode collapse and unstable training.The generated images also suffer from numerous detail problems,such as distorted object contours and unnatural small objects in the scene.These problems will not only further affect the perception of automatic driving,which will then impact the decision regarding automatic driving,but will also influence the evaluation metrics of image translation.In this paper,a multimodal conditional diffusion model based on the denoising diffusion probabilistic model(DDPM),which has achieved remarkable success in various image generation tasks,is proposed to address the problems of insufficient paired simulation-real data,mode collapse,unstable training,and inadequate diversity of generated data in existing image translation.Method First,an image translation method based on the diffusion model with good training stability and generative diversity is proposed to solve the problems of mode collapse and unstable training in existing mainstream image translation methods based on GAN.Second,a multimodal feature fusion method based on a multihead self-attention mechanism is developed in this paper to address the problem of traditional diffusion models,which cannot integrate prior information without controlling the image generation process.The proposed method can send the early fused data to the convolutional layer,extract the high-level features,and then obtain the high-level fused feature vectors through the multihead self-attention mechanism.Finally,considering the semantic segmentation and depth maps,which can precisely represent the contour and depth information,respectively,the conditional diffusion model(CDM)is designed by fusing the semantic segmentation and depth maps with the noise image before sending them to the denoising network.In this model,the semantic segmentation map,depth map,and noise image can perceive each other through the proposed multimodal feature fusion method.The output fusion features will then be fed to the next sublayer in the network.After the denoising iterative process,the final output of the denoising network contains semantic and depth information;thus,the semantic segmentation and depth maps can play a conditional guiding role in the diffusion model.According to the settings in the DDPM,the U-Net network is utilized as the denoising network.Compared with the U-Net in DDPM,the self-attention layer is modified to match the improved self-attention proposed in this paper for effectively learning the fusion features.The proposed model can be applied to the image translation of simulated-to-real scenarios after training the denoising network in the CDM.Noise is first added to the simulated images collected from the Carla simulator,and paired semantic segmentation and depth maps are then sent to the denoising network to perform a step-by-step denoising process.Finally,real driving scene images are obtained to realize image translation with highly precise contour details and consistent distance in simulated and real images.Result The model is trained on the Cityscapes dataset and compared with state-of-the-art(SOTA)methods in recent years.Experimental results indicate that the proposed approach achieves a superior translation result with improved semantic precision and additional contour details.The evaluation metrics include Fréchet inception distance(FID)and the learned perceptual image patch similarity(LPIPS),which indicate the similarity between the generated and original images,and the difference between the generated images,respectively.A lower FID score represents better generation quality with a smaller gap between the generated and real image distributions,while a higher LPIPS value indicates better generation diversity.Compared with the comparative SOTA methods,the proposed method can achieve better results in the FID and LPIPS indicators,revealing scores of 44.20 and 0.377,respectively.Conclusion In this paper,a novel image-to-image translation method based on a conditional diffusion model and a multimodal fusion method with a multihead attention mechanism for autonomous driving scenarios are proposed.Experimental results show that the proposed method can effectively solve the problems of insufficient paired datasets,imprecise translation results,unstable training,and insufficient generation diversity in existing image translation methods.Thus,this method improves the image translation precision of driving scenarios and provides theoretical support and a data basis to realize safe and practical autonomous driving systems.
作者
徐映芬
胡学敏
黄婷玉
李燊
陈龙
Xu Yingfen;Hu Xuemin;Huang Tingyu;Li Shen;Chen Long(School of Artificial Intelligence,Hubei University,Wuhan 430062,China;Key Laboratory of Intelligent Sensing System and Security,Ministry of Education,Wuhan 430062,China;Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China)
出处
《中国图象图形学报》
CSCD
北大核心
2024年第11期3305-3318,共14页
Journal of Image and Graphics
基金
国家自然科学基金项目(62273135)
湖北省大学生创新创业训练计划基金资助项目(S202310512042,S202310512025)
湖北大学研究生教育教学改革研究项目(1190017755)
湖北大学原创探索种子专项项目(202416403000001)。
关键词
虚拟到现实
图像翻译
扩散模型
多模态融合
驾驶场景
simulation to reality
image translation
diffusion model
multi-modal fusion
driving scenario