摘要
随着生成式对抗网络的出现,从文本描述合成图像最近成为一个活跃的研究领域.然而,目前文本描述往往使用英文,生成的对象也大多是人脸和花鸟等,专门针对中文和中国画的研究较少.同时,文本生成图像任务往往需要大量标注好的图像文本对,制作数据集的代价昂贵.随着多模态预训练的出现与推进,使得能够以一种优化的方式来指导生成对抗网络的生成过程,大大减少了对数据集和计算资源的需求.提出一种多域VQGAN模型来同时生成多种域的中国画,并利用多模态预训练模型WenLan来计算生成图像和文本描述之间的距离损失,通过优化输入多域VQGAN的隐空间变量来达到图片与文本语义一致的效果.对模型进行了消融实验,详细比较了不同结构的多域VQGAN的FID及R-precisoin指标,并进行了用户调查研究.结果表示,使用完整的多域VQGAN模型在图像质量和文本图像语义一致性上均超过原VQGAN模型的生成结果.
With the development of generative adversarial networks(GANs),synthesizing images from textual descriptions has become an active research area.However,textual descriptions used for image generation are often in English,and the generated objects are mostly faces,flowers,birds,etc.Few studies have been conducted on the generation of Chinese paintings with Chinese descriptions.The text-toimage generation often requires an enormous number of labeled image-text pairs,and the cost of dataset production is high.With the advance in multimodal pre-training,the GAN generation process can be guided in an optimized way,which significantly reduces the demand for datasets and computational resources.In this study,a multi-domain vector quatization generative adversarial network(VQGAN)model is proposed to simultaneously generate Chinese paintings in multiple domains.Furthermore,a multimodal pre-trained model WenLan is used to calculate the distance loss between generated images and textual descriptions.The semantic consistency between images and texts is achieved by optimization of the hidden space variables input into multi-domain VQGAN.Finally,an ablation experiment is conducted to compare different variants of multi-domain VQGAN in terms of the FID and R-precision metrics,and a user investigation is carried out.The results demonstrate that the complete multi-domain VQGAN model outperforms the original VQGAN model in terms of image quality and text-image semantic consistency.
作者
孙泽龙
杨国兴
温静远
费楠益
卢志武
文继荣
SUN Ze-Long;YANG Guo-Xing;WEN Jing-Yuan;FEI Nan-Yi;LU Zhi-Wu;WEN Ji-Rong(Gaoling School of Artificial Intelligence,Renmin University of China,Beijing 100872,China;School of Information,Renmin University of China,Beijing 100872,China)
出处
《软件学报》
EI
CSCD
北大核心
2023年第5期2116-2133,共18页
Journal of Software
基金
国家自然科学基金(61976220,61832017)
北京高等学校卓越青年科学家计划(BJJWZYJH012019100020098)。
关键词
文本生成图像
多域生成
中国画生成
text-to-image generation
multi-domain generation
Chinese painting generation