面向驾驶场景精准图像翻译的条件扩散模型

Precise image translation based on conditional diffusion model for driving scenarios

导出

摘要目的针对虚拟到现实驾驶场景翻译中成对的数据样本匮乏、翻译结果不精准以及模型训练不稳定等问题,提出一种多模态数据融合的条件扩散模型。方法首先,为解决目前主流的基于生成对抗网络的图像翻译方法中存在的模式崩塌、训练不稳定等问题,以生成多样性强、训练稳定性好的扩散模型为基础,构建图像翻译模型;其次,为解决传统扩散模型无法融入先验信息从而无法控制图像生成这一问题,提出基于多头自注意力机制的多模态特征融合方法,该方法能将多模态信息融入扩散模型的去噪过程,从而起到条件控制的作用;最后,基于语义分割图和深度图能分别表征物体的轮廓信息和深度信息这一特点,将其与噪声图像进行融合后输入去噪网络,以此构建多模态数据融合的条件扩散模型,从而实现更精准的驾驶场景图像翻译。结果在Cityscapes数据集上训练本文提出的模型,并且将本文方法与先进方法进行比较,结果表明,本文方法可以实现轮廓细节更细致、距离远近更一致的驾驶场景图像翻译,在弗雷歇初始距离(Fréchet inception distance,FID)和学习感知图像块相似度(learned perceptual image patch similarity,LPIPS)等指标上均取得了更好的结果,分别为44.20和0.377。结论本文方法能有效解决现有图像翻译方法中数据样本匮乏、翻译结果不精准以及模型训练不稳定等问题,提高驾驶场景的翻译精确度,为实现安全实用的自动驾驶提供理论支撑和数据基础。 Objective Safety is the most important consideration for autonomous driving vehicles.New autonomous driving methods need numerous training and testing processes before their application in real vehicles.However,training and testing autonomous driving methods directly in real-world scenarios is a costly and risky task.Many researchers first train and test their methods in simulated-world scenarios and then transfer the trained knowledge to real-world scenarios.However,many differences in scene modeling,light,and vehicle dynamics are observed between the two-world scenarios.Therefore,the autonomous driving model trained in simulated-world scenarios cannot be effectively generalized to real-world scenarios.With the development of deep learning technologies,image translation,which aims to transform the content of an image from one presentation form to another,has made considerable achievements in many fields,such as image beautification,style transfer,scene design,and video special effects.If image translation technology is applied to the translation of simulated driving scenarios to real ones,then this technology can not only solve the problem of poor generalization capability of autonomous driving models but can also effectively reduce the cost and risk of training in the real scenarios.Unfortunately,existing image translation methods applied in autonomous driving lack datasets of paired simulated and real scenarios,and most of the mainstream image translation methods are based on generative adversarial network(GAN),which have problems of mode collapse and unstable training.The generated images also suffer from numerous detail problems,such as distorted object contours and unnatural small objects in the scene.These problems will not only further affect the perception of automatic driving,which will then impact the decision regarding automatic driving,but will also influence the evaluation metrics of image translation.In this paper,a multimodal conditional diffusion model based on the denoising diffusion probabilistic model(DDPM),which has achieved remarkable success in various image generation tasks,is proposed to address the problems of insufficient paired simulation-real data,mode collapse,unstable training,and inadequate diversity of generated data in existing image translation.Method First,an image translation method based on the diffusion model with good training stability and generative diversity is proposed to solve the problems of mode collapse and unstable training in existing mainstream image translation methods based on GAN.Second,a multimodal feature fusion method based on a multihead self-attention mechanism is developed in this paper to address the problem of traditional diffusion models,which cannot integrate prior information without controlling the image generation process.The proposed method can send the early fused data to the convolutional layer,extract the high-level features,and then obtain the high-level fused feature vectors through the multihead self-attention mechanism.Finally,considering the semantic segmentation and depth maps,which can precisely represent the contour and depth information,respectively,the conditional diffusion model(CDM)is designed by fusing the semantic segmentation and depth maps with the noise image before sending them to the denoising network.In this model,the semantic segmentation map,depth map,and noise image can perceive each other through the proposed multimodal feature fusion method.The output fusion features will then be fed to the next sublayer in the network.After the denoising iterative process,the final output of the denoising network contains semantic and depth information;thus,the semantic segmentation and depth maps can play a conditional guiding role in the diffusion model.According to the settings in the DDPM,the U-Net network is utilized as the denoising network.Compared with the U-Net in DDPM,the self-attention layer is modified to match the improved self-attention proposed in this paper for effectively learning the fusion features.The proposed model can be applied to the image translation of simulated-to-real scenarios after training the denoising network in the CDM.Noise is first added to the simulated images collected from the Carla simulator,and paired semantic segmentation and depth maps are then sent to the denoising network to perform a step-by-step denoising process.Finally,real driving scene images are obtained to realize image translation with highly precise contour details and consistent distance in simulated and real images.Result The model is trained on the Cityscapes dataset and compared with state-of-the-art(SOTA)methods in recent years.Experimental results indicate that the proposed approach achieves a superior translation result with improved semantic precision and additional contour details.The evaluation metrics include Fréchet inception distance(FID)and the learned perceptual image patch similarity(LPIPS),which indicate the similarity between the generated and original images,and the difference between the generated images,respectively.A lower FID score represents better generation quality with a smaller gap between the generated and real image distributions,while a higher LPIPS value indicates better generation diversity.Compared with the comparative SOTA methods,the proposed method can achieve better results in the FID and LPIPS indicators,revealing scores of 44.20 and 0.377,respectively.Conclusion In this paper,a novel image-to-image translation method based on a conditional diffusion model and a multimodal fusion method with a multihead attention mechanism for autonomous driving scenarios are proposed.Experimental results show that the proposed method can effectively solve the problems of insufficient paired datasets,imprecise translation results,unstable training,and insufficient generation diversity in existing image translation methods.Thus,this method improves the image translation precision of driving scenarios and provides theoretical support and a data basis to realize safe and practical autonomous driving systems.

作者徐映芬胡学敏黄婷玉李燊陈龙 Xu Yingfen;Hu Xuemin;Huang Tingyu;Li Shen;Chen Long(School of Artificial Intelligence,Hubei University,Wuhan 430062,China;Key Laboratory of Intelligent Sensing System and Security,Ministry of Education,Wuhan 430062,China;Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China)

机构地区湖北大学人工智能学院智能感知系统与安全教育部重点实验室中国科学院自动化研究所

出处《中国图象图形学报》 CSCD 北大核心 2024年第11期3305-3318,共14页 Journal of Image and Graphics

基金国家自然科学基金项目(62273135) 湖北省大学生创新创业训练计划基金资助项目(S202310512042,S202310512025) 湖北大学研究生教育教学改革研究项目(1190017755) 湖北大学原创探索种子专项项目(202416403000001)。

关键词虚拟到现实图像翻译扩散模型多模态融合驾驶场景 simulation to reality image translation diffusion model multi-modal fusion driving scenario

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1黄凌智,黄喜海.食品质量控制与风险评估研究[J].中外食品工业,2024(5):54-56.
2韩翔,李玉强,高昂,马静怡,宫庆福,宋月鹏.基于改进生成对抗网络的甜樱桃数据增强方法[J].农业机械学报,2024,55(10):252-262.
3尹蒋松,李飒.基于弗雷歇距离的海上沉桩土阻力推荐方法[J].地震工程学报,2024,46(6):1345-1354.
4冯晨阳,胡术,张轶,易凯.基于卷积注意力引导的多路径语义分割网络[J].四川大学学报（自然科学版）,2024,61(6):71-82.
5刘美玲,甘娇娇,曾莹,王双双,周继云.基于增量学习的不平衡虚假评论处理研究[J].数据分析与知识发现,2024,8(8):85-95.
6孟祥凤,王洪君,于唤理.具有防欺骗能力的核心参与者视觉密码方案[J].智能计算机与应用,2024,14(11):67-73.
7张德城,刘毅志,赵肄江,廖祝华.面向GPS数据的出租车载客路线层次化推荐模型[J].计算机工程,2024,50(12):163-173.
8朱橙.正视绘画本体论——弗雷德论作为“1863一代”的马奈及其艺术的现代使命[J].美术研究,2024(6):123-130.
9韩雪亮.《周易》与家族企业异质性研究[J].安阳师范学院学报,2024,26(6):20-27.
10丁维龙,朱伟,廖婉茵,刘津龙,汪春年,祝行琴.感受野扩增的轻量级病理图像聚焦质量评估网络[J].中国图象图形学报,2024,29(11):3447-3461.

中国图象图形学报

2024年第11期

浏览历史

内容加载中请稍等...

面向驾驶场景精准图像翻译的条件扩散模型

相关作者

相关机构

相关主题

浏览历史