摘要
针对文本生成人脸方法中生成图像与文本描述不一致、图像分辨率较低等问题,提出一种跨模态文本生成人脸图像网络框架。首先,采用CLIP预训练模型对文本进行特征提取,通过条件增强模块增强文本语义特征并生成隐藏向量;然后,将隐藏向量通过映射网络投影到预训练模型StyleGAN的隐式空间中获得解纠缠隐藏向量,将该向量输入到StyleGAN生成器中生成高分辨率人脸图像;最后,采用文本重建模块将人脸图像重新生成文本,计算重建文本和输入文本之间的语义对齐损失,并将其作为语义监督指导网络训练。在Multi-Modal CelebA-HQ和CelebAText-HQ两个数据集上进行训练与测试,实验结果表明,相比其他方法,该方法能生成更加符合文本描述的高分辨率人脸图像。
To address the problems of inconsistency between generated images and text descriptions and low image resolution in text-generated face methods,this paper proposes a cross-modal text-generated face image network framework.Firstly,the CLIP pre-training model is adopted to extract features from the text,and the text semantic features are enhanced by the conditional enhancement module to generate hidden vectors;then the hidden vector is projected into the implicit space of the pre-trained model Style‑GAN by the mapping network to obtain the untangled hidden vector,which is input to the StyleGAN generator to generate high-resolution face images;finally,the text reconstruction module is adopted to regenerate the face images into text,and the semantic alignment loss between the reconstructed text and the input text is calculated and utilized as semantic supervision to guide the network training.The training and testing are performed on two datasets,Multi-Modal CelebA-HQ and CelebAText-HQ,and the experimental results show that compared with other methods,the method in this paper can generate highresolution face images that are more consistent with the text description.
作者
李源凡
张丽红
LI Yuanfan;ZHANG Lihong(College of Physics and Electronic Engineering,Shanxi University,Taiyuan 030006,China)
出处
《测试技术学报》
2024年第2期154-160,共7页
Journal of Test and Measurement Technology
基金
山西省高等学校教学改革创新项目(J2021086)
山西省研究生创新项目(2021Y154)。