期刊文献+

基于跨媒体解纠缠表示学习的风格化图像描述生成 被引量:1

A Stylized Image Caption Approach Based on Cross-Media Disentangled Representation Learning
下载PDF
导出
摘要 风格化图像描述生成的文本不仅被要求在语义上与给定的图像一致,而且还要与给定的语言风格保持一致.随着神经网络在计算机视觉和自然语言生成领域的技术发展,有关这个主题的最新研究取得了显著进步.但是,神经网络模型作为一种黑盒系统,人类仍然很难理解其隐层空间中参数所代表的风格、事实及它们之间的关系.为了提高对隐层空间中包含的事实内容和语言风格属性的理解以及增强对两者的控制能力,提高神经网络的可控性和可解释性,本文提出了一种使用解纠缠技术的新型风格化图像描述生成模型Disentangled Stylized Image Caption(DSIC).该模型分别从图像和描述文本中非对齐地学习解纠缠表示,具体使用了两个解纠缠表示学习模块——D-Images和D-Captions来分别学习图像和图像描述中解纠缠的事实信息和风格信息.在推理阶段,DSIC模型利用图像描述生成解码器以及一种特别设计的基于胶囊网络的信息聚合方法来充分利用先前学习的跨媒体信息表示,并通过直接控制隐层向量来生成目标风格的图像描述.本文在SentiCap数据集和FlickrStyle10K数据集上进行了相关实验.解纠缠表示学习的实验结果证明了模型解纠缠的有效性,而风格化图像描述生成实验结果则证明了聚合的跨媒体解纠缠表示可以带来更好的风格化图像描述生成性能,相对于对比的风格化图像描述生成模型,本文方法在多个指标上的性能提升了17%至86%. The task of stylized image caption aims to generate a natural language description that is semantically related to a given image and consistent with a given linguistic style.Both requirements make this task significantly more difficult than the traditional image caption task.However,with the availability of the large-scale image-text corpora and advances in deep learning techniques of computer vision and natural language processing,stylized image caption research has made significant advances in recent years.Widely adopted neural networks have demonstrated their powerful abilities to handle the complexities and challenges of the stylized image caption task.A typical stylized image caption model is usually an encoder-decoder architecture.The model inputs go through many layers of non-linear transformations,e.g.ReLU layer in the Convolutional Neural Networks(CNNs),to yield latent representations.This makes the latent representations and parameters of model lack interpretability and controllability,which can restrict the understanding of this task and its further improvement.In this paper,we focus on the problem of understanding and controlling the latent representations of linguistic style and factual content in stylized image caption models by learning disentangled representations.Existing disentanglement methods mainly work on single modal data,such as computer vision or natural language processing.However,in stylized image caption,there are two types of media,images and texts,involved to learn a representation that is faithful to the underlying data structure.How to disentangle the latent space of cross-media data still needs to be explored.Inspired by the successful applications of disentangled representation learning on Computer Vision and Natural Language Processing,we propose a novel approach,Disentangled Stylized Image Caption(DSIC),to learn the disentangled representations on unparallel cross-media data.With the help of the VAE framework,two latent space filter modules,style filter and fact filter,are designed to enhance the disentangling performance.These filters slice the latent representation to different segments.Each filter is going to retain the style-specific or fact-specific information in the image,by minimizing the proposed auxiliary classifier loss,and screen out other irrelevant information by another auxiliary discriminator loss.Concretely,we use two modules,D-Images and D-Captions,to disentangle the stylistic and factual latent information in the images and captions respectively.To fully utilize obtained cross-media disentangled latent information from both images and captions,we adopt an aggregation method using capsule network with routing-by-agreement.This makes it possible for the LSTM based caption generator to generate stylized captions with target linguistic styles by directly controlling the learnt latent vectors.To validate the effectiveness of our approach,we conduct two groups of experiments:the disentanglement performance test and the stylized image caption test,on two popular public image caption datasets,SentiCap and FlickrStyle10K.Experimental results for disentanglement performance show that our model can successfully disentangle the stylistic and factual information and reveal that style information existing in both human beings’experience and images themselves.Experimental results on stylized image caption datasets show that our model significantly outperforms the competitive baseline models and prove that the aggregated cross-media disentangled representations lead to around 17%to 86%improvements in terms of multiple performance metrics for stylized image caption.
作者 蔺泽浩 李国趸 曾祥极 邓悦 张寅 庄越挺 LIN Ze-Hao;LI Guo-Dun;ZENG Xiang-Ji;DENG Yue;ZHANG Yin;ZHUANG Yue-Ting(Department of Computer Science and Technology,Zhejiang University,Hangzhou 310027)
出处 《计算机学报》 EI CAS CSCD 北大核心 2022年第12期2510-2527,共18页 Chinese Journal of Computers
基金 国家自然科学基金(62072399,61402403,U19B2042) 中国工程科技知识中心 数字图书馆教育部工程研究中心 中国工程科技数据和知识技术研究中心 中央高校基本科研业务费和百度人工智能课题基金资助.
关键词 跨媒体 机器学习 解纠缠表示学习 风格化图像描述生成 自然语言生成 cross-media machine learning disentangled representation learning stylized image caption natural language generation
  • 相关文献

参考文献5

二级参考文献7

共引文献50

同被引文献16

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部