摘要
现有的层级式文本生成图像的方法在初始图像生成阶段仅使用上采样进行特征提取,上采样过程本质是卷积运算,卷积运算的局限性会造成全局信息被忽略并且远程语义无法交互。虽然已经有方法在模型中加入自注意力机制,但依然存在图像细节缺失、图像结构性错误等问题。针对上述存在的问题,提出一种基于自监督注意和图像特征融合的生成对抗网络模型SAF-GAN。将基于ContNet的自监督模块加入到初始特征生成阶段,利用注意机制进行图像特征之间的自主映射学习,通过特征的上下文关系引导动态注意矩阵,实现上下文挖掘和自注意学习的高度结合,提高低分辨率图像特征的生成效果,后续通过不同阶段网络的交替训练实现高分辨率图像的细化生成。同时加入了特征融合增强模块,通过将模型上一阶段的低分辨率特征与当前阶段的特征进行融合,生成网络可以充分利用低层特征的高语义信息和高层特征的高分辨率信息,更加保证了不同分辨率特征图的语义一致性,从而实现高分辨率的逼真的图像生成。实验结果表明,相较于基准模型(AttnGAN),SAF-GAN模型在IS和FID指标上均有改善,在CUB数据集上的IS分数提升了0.31,FID指标降低了3.45;在COCO数据集上的IS分数提升了2.68,FID指标降低了5.18。SAF-GAN模型能够有效生成更加真实的图像,证明了该方法的有效性。
Current hierarchical text-to-image generation methods only use up-sampling for feature extraction during the initial image generation stage,but up-sampling process is essentially convolutional operations,and the limitations of convolutional operations can cause global information to be ignored and remote semantics to be unable to interact.Although there have been methods to add self-attention mechanisms to models,there are still problems such as lack of image details,image structural errors,and so on.In response to the above existing problems,a generation countermeasure network model SAF-GAN based on self-supervised attention and image feature fusion is proposed.A self-supervised module based on ContNet is added to the initial feature generation stage,and attention mechanism is used for autonomous mapping learning between image features.The dynamic attention matrix is guided by the context relationship of features,achieving a high combination of context mining and self-attention learning,which improves the feature generation effect of low resolution images,and subsequently refines and generates high-resolution images through alternating training of networks at different stages.At the same time,the feature fusion enhancement module is added.By fusing low resolution features of previous stage of the model with features of the current stage,the generation network can make full use of the high semantic information of low level features and high resolution information of the high level features.The semantic consistency of feature maps with different resolutions is further guaranteed,so as to achieve the high-resolution realistic image generation.Experimental results show that in comparison with benchmark model(AttnGAN),the IS score of the SAF-GAN model is increased by 0.31 and the FID index is decreased by 3.45 on the CUB dataset,while the IS score of the SAFGAN model is increased by 2.68 and the FID index is decreased by 5.18 on the COCO dataset.It is concluded that the proposed model can effectively generate more realistic images,which proves the effectiveness of the proposed method.
作者
廖涌卉
张海涛
金海波
LIAO Yonghui;ZHANG Haitao;JIN Haibo(School of Software,Liaoning Technical University,Huludao 125105,China;Computer Department,Shantou Polytrchnic,Shantou 515071,China)
出处
《液晶与显示》
CAS
CSCD
北大核心
2024年第2期180-191,共12页
Chinese Journal of Liquid Crystals and Displays
基金
国家自然科学基金(No.62173171)
辽宁省科技厅面上项目(No.2022-MS-397)。