摘要
尽管生成对抗网络在人脸图像生成和编辑领域取得了巨大的成功,但在其潜在编码空间中寻找可以操作人脸语义属性的方向仍然是计算机视觉的一大挑战,这一挑战的实现需要大量标记数据不断进行网络调优,而搜集、标注类似数据存在诸多难点,比如较高的技术门槛以及大量的人工成本.最近的一些工作都在试图借助预训练模型来克服标记数据短缺的问题.虽然这种做法已经被验证能够完成上述任务,但在操作的准确性和结果的真实性上都无法满足真实人脸编辑场景的需求.借助对比语言-图像预训练模型(CLIP)的图像文本联合表示能力将图像和文本内容编码在一个共享的潜在编码空间中,借助于精心设计的网络结构和损失函数,所提框架可以精准识别相关面部属性并学习一个多级残差映射网络,所提网络可根据图像和文本内容编码预测潜在编码残差,再借助图像生成预训练模型StyleGAN2完成高质量的人脸图像生成和编辑任务.大量实验也证明了所提方法在操作准确性、视觉真实性和无关属性保留方面的优异表现.
Although generative adversarial networks(GANs)have achieved great success in face image generation and manipulation,discovering meaningful directions in the latent encoding space of GANs to manipulate semantic attributes of faces is a great challenge in computer vision.The solution to this challenge requires a large amount of labeled data and several hours of network fine-tuning.However,many difficulties are confronted in the collection and annotation of similar data,such as great technical barriers and high labor costs.Recent studies have been attempting to overcome the problem of lacking labeled data by pre-trained models.Such efforts are proved capable of accomplishing the above task,but the accuracy of the manipulation and the authenticity of the results cannot meet the needs of real face editing scenarios.To address these problems,this study encodes the image and text descriptions into a shared latent encoding space by leveraging the joint representation capability of contrastive language-image pre-training(CLIP).With carefully designed network structures and loss functions,the proposed framework can accurately recognize relevant face attributes and learn a residual mapping network.The network can predict the latent code residuals according to image and text description codes and perform high-quality image generation and manipulation by the pre-trained model StyleGAN2.Extensive experiments demonstrate the superiority of the proposed approach in terms of manipulation accuracy,visual realism,and irrelevant attribute preservation.
作者
李宗霖
张盛平
刘杨
张兆心
张维刚
黄庆明
LI Zong-Lin;ZHANG Sheng-Ping;LIU Yang;ZHANG Zhao-Xin;ZHANG Wei-Gang;HUANG Qing-Ming(School of Computer Science and Technology,Harbin Institute of Technology,Weihai 264209,China;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China)
出处
《软件学报》
EI
CSCD
北大核心
2023年第5期2101-2115,共15页
Journal of Software
基金
国家自然科学基金(61872112,61976069)。
关键词
多模态学习
预训练模型
人脸图像生成
人脸图像编辑
对抗生成网络
multimodal learning
pre-trained model
face image generation
face image manipulation
generative adversarial network(GAN)