结合视觉特征和场景语义的图像描述生成被引量：23

Combine Visual Features and Scene Semantics for Image Captioning

下载PDF

导出

摘要现有的图像描述生成方法大多只使用图像的视觉信息来指导描述的生成,缺乏有效的场景语义信息的指导,而且目前的视觉注意机制也无法调整对图像注意的聚焦强度.针对这些问题,本文首先提出了一种改进的视觉注意模型,引入聚焦强度系数自动调整注意强度.在解码器的每个时间步,通过模型的上下文信息和图像信息计算注意机制的聚焦强度系数,并通过该系数自动调整注意机制的“软”、“硬”强度,从而提取到更准确的图像视觉信息.此外,本文利用潜在狄利克雷分布模型与多层感知机提取出一系列与图像场景相关的主题词来表示图像场景语义信息,并将这些信息添加到语言生成模型中来指导单词的生成.由于图像的场景主题信息是通过分析描述文本获得,包含描述的全局信息,所以模型可以生成一些适合图像场景的重要单词.最后,本文利用注意机制来确定模型在解码的每一时刻所关注的图像视觉信息和场景语义信息,并将它们结合起来共同指导模型生成更加准确且符合场景主题的描述.实验评估在MSCOCO和Flickr30k两个标准数据集上进行,实验结果表明本文方法能够生成更加准确的描述,并且在整体的评价指标上与基线方法相比有3%左右的性能提升. Most of the existing image captioning methods only use the visual information of the image to guide the generation of the captions,lacking the guidance of effective scene semantic information.In addition,the current visual attention mechanism cannot adjust the focus intensity on the image effectively.In order to solve these problems,this paper firstly proposes an improved visual attention model,which introduces a focus intensity coefficient so as to adjust attention intensity automatically.Specifically,the focus intensity coefficient of the attention mechanism is a learnable scaling factor.It can be calculated by the image information and the context information of the model at each time step of the language model decoding procedure.When using the attention mechanism to calculate the attention weight distribution on the image,the“soft”or“hard”intensity of attention mechanism can be adjusted automatically by adaptively scaling the input value of softmax function through the focus intensity coefficient.Then the concentration and dispersion of the visual attention can be achieved.Therefore,the proposed attention model can make the extracted image visual information more accurate.Furthermore,we combine unsupervised and supervised learning methods to extract a series of topic words related to the image scene to represent scene semantic information of the image,which is added to the language model to guide the generation of captions.We believe that each image contains several scene topic concepts,and each topic concept can be represented by some topic words.Specifically,we use the latent Direchlet allocation(LDA)model to cluster all the caption texts in the dataset.Then the topic category of the caption text is used to represent the scene category of corresponding image.What is more,we train a multi-layer perceptron(MLP)to classify the image into topic concepts.As a result,each topic category is represented by a series of topic words obtained from clustering.Then the scene semantic information of each image can be represented by these topic words,which are very relevant to the image scene.We add these topic words to the language model so that it can obtain more prior knowledge.Since the topic information of the image scene is obtained through analyzing the captions,it contains some global information of the captions to be generated.Therefore,our model can predict some important words that suitable for image scene.Finally,we use the attention mechanism to determine the visual information of the image and the semantic information of the scene that the model pays attention to at each time step of the decoding procedure,and use the gating mechanism to control the proportion of the input of these two information.Afterwards,both information is combined to guide the model to generate more accurate and scene-specific captions.In the experimental section,we evaluate our model on two standard datasets,i.e.MSCOCO and Flickr30k.The experimental results show that our approach can generate more accurate captions than many state-of-the-art approaches.In addition,compared with the baseline approach,our approach achieves about 3%improvement on overall evaluation metrics.

作者李志欣魏海洋黄飞成张灿龙马慧芳史忠植 LI Zhi-Xin;WEI Hai-Yang;HUANG Fei-Cheng;ZHANG Can-Long;MA Hui-Fang;SHI Zhong-Zhi(Guangxi Key Lab of Multi-source Information Mining&Security,Guangxi Normal University,Guilin 541004;College of Computer Science and Engineering,Northwest Normal University,Lanzhou 730070;Key Laboratory of Intelligent Information Processing,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190)

机构地区广西师范大学广西多源信息挖掘与安全重点实验室西北师范大学计算机科学与工程学院中国科学院计算技术研究所智能信息处理重点实验室

出处《计算机学报》 EI CSCD 北大核心 2020年第9期1624-1640,共17页 Chinese Journal of Computers

基金国家自然科学基金(61966004,61663004,61866004,61762078) 广西自然科学基金(2019GXNSFDA245018,2018GXNSFDA281009,2017GXNSFAA198365) 广西多源信息挖掘与安全重点实验室基金(16-A-03-02,MIMS18-08,MIMS 19-02)资助.

关键词图像描述生成注意机制场景语义编码器-解码器框架强化学习 image captioning attention mechanism scene semantics encoder-decoder framework reinforcement learning

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

同被引文献79

1魏忠钰,范智昊,王瑞泽,承怡菁,赵王榕,黄萱菁.从视觉到文本:图像描述生成的研究进展综述[J].中文信息学报,2020(7):19-29. 被引量：14
2李志欣,施智平,李志清,史忠植.图像检索中语义映射方法综述[J].计算机辅助设计与图形学学报,2008,20(8):1085-1096. 被引量：36
3苏波展,吴文玲,张文涛.Security of the SMS4 Block Cipher Against Differential Cryptanalysis[J].Journal of Computer Science & Technology,2011,26(1):130-138. 被引量：15
4李志欣,施智平,李志清,史忠植.融合语义主题的图像自动标注[J].软件学报,2011,22(4):801-812. 被引量：50
5乔少杰,李天瑞,韩楠,高云君,元昌安,王晓腾,唐常杰.大数据环境下移动对象自适应轨迹预测模型[J].软件学报,2015,26(11):2869-2883. 被引量：29
6廖宏勇.“图像”的“图像”——论信息图表的视觉表征与建构[J].中南大学学报（社会科学版）,2016,22(1):208-213. 被引量：3
7陆中秋,侯振杰,陈宸,梁久祯.基于深度图像与骨骼数据的行为识别[J].计算机应用,2016,36(11):2979-2984. 被引量：7
8杨璇.信息可视化静态图像和动画视觉表征形式选择的依据与判断[J].装饰,2016(11):121-123. 被引量：11
9汤鹏杰,谭云兰,李金忠.融合图像场景及物体先验知识的图像描述生成模型[J].中国图象图形学报,2017,22(9):1251-1260. 被引量：16
10曹洁,苏哲,李晓旭.基于Corr-LDA模型的图像标注方法[J].吉林大学学报（工学版）,2018,48(4):1237-1243. 被引量：3

引证文献23

1王慧娇,丛鹏,蒋华,韦永壮.基于深度学习的SIMON3264安全性分析[J].计算机研究与发展,2021,58(5):1056-1064. 被引量：1
2黄欣,顾梦丹,易玉根,曹远龙.基于深度学习的X线胸片肺部描述自动生成[J].模式识别与人工智能,2021,34(6):552-560.
3李志欣,魏海洋,张灿龙,马慧芳,史忠植.图像描述生成研究进展[J].计算机研究与发展,2021,58(9):1951-1974. 被引量：7
4卢颖,吕希凡,郭良杰,仇乐,路越茗.基于Kinect的地铁乘客不安全行为识别方法与实验[J].中国安全生产科学技术,2021,17(12):162-168. 被引量：6
5李志欣,凌锋,唐振军,马慧芳,施智平.基于多头注意力网络的无监督跨媒体哈希检索[J].中国科学：信息科学,2021,51(12):2053-2068. 被引量：3
6朱鹏飞,张琬迎,王煜,胡清华.考虑多粒度类相关性的对比式开放集识别方法[J].软件学报,2022,33(4):1156-1169. 被引量：3
7李志欣,侯传文,谢秀敏.利用多重相似度矩阵增强跨模态哈希检索[J].计算机辅助设计与图形学学报,2022,34(6):933-945. 被引量：3
8王宇航,张灿龙,李志欣,王智文.体现用户意图和风格的图像描述生成[J].广西师范大学学报（自然科学版）,2022,40(4):91-103.
9刘茂福,施琦,聂礼强.基于视觉关联与上下文双注意力的图像描述生成方法[J].软件学报,2022,33(9):3210-3222. 被引量：9
10肖雄,徐伟峰,王洪涛,苏攀,高思华.基于Transformer的细粒度图像中文描述[J].吉林大学学报（理学版）,2022,60(5):1103-1112. 被引量：1

二级引证文献41

1张炫,刘茂福,邱晨,胡慧君.基于图文双向引导注意力的新闻图集描述生成方法[J].武汉大学学报（理学版）,2023,69(2):223-232.
2白玉,蒋治强,王建峰,赵海忠,李伟,刘哲.作业人员的不安全行为干预和预警技术现状分析[J].化工安全与环境,2022,35(19):14-17. 被引量：1
3卓亚琦,魏家辉,李志欣.基于双注意模型的图像描述生成方法研究[J].电子学报,2022,50(5):1123-1130. 被引量：2
4李志欣,苏强.基于知识辅助的图像描述生成[J].广西师范大学学报（自然科学版）,2022,40(5):418-432.
5黄界生.基于深度学习的计算机视觉中图像检索算法研究[J].信息技术与信息化,2022(9):181-184. 被引量：2
6武光利,郭振洲,李雷霆.融合自上而下和自下而上注意力的图像描述生成[J].科学技术与工程,2022,22(32):14313-14320. 被引量：3
7王春源,胡愚,张红璐,王冬梅.地铁车站疏散瓶颈疏散优化共性方案仿真研究[J].中国安全生产科学技术,2023,19(2):180-186. 被引量：4
8郇战,周帮文,王澄,董晨辉,刘艳,王佳晖.基于开集类增量学习的人类活动识别研究[J].实验技术与管理,2023,40(2):40-47.
9金仕奇.推荐算法的船舶电子海图数据相似性检索方法[J].舰船科学技术,2023,45(5):148-151.
10俞艺文,施水才,王洪俊.基于Bert词向量与有序记忆网络的图像描述[J].软件导刊,2023,22(3):125-133.

1肖丽华.浅谈幼儿抗挫能力的培养[J].教师,2019,0(33):122-123.
2谭剪梅,高山,刘猛猛,李维炼.顾及多类型用户需求的灾害场景语义相关度计算[J].测绘与空间地理信息,2020,43(7):165-167.
3秦燕,徐衍明,侯可军,李延河,陈蕾.铁同位素分析测试技术研究进展[J].岩矿测试,2020,39(2):151-161. 被引量：10
4王世星,许凯,唐金良.岩溶缝洞储集体地震聚焦成像与充填速度检测[J].石油物探,2019,58(6):898-910. 被引量：3
5田娜,周驿.基于MOOC课程评论的话题挖掘与情感分析研究[J].软件导刊,2020,19(8):19-23. 被引量：4

计算机学报

2020年第9期

浏览历史

内容加载中请稍等...

结合视觉特征和场景语义的图像描述生成被引量：23

同被引文献79

引证文献23

二级引证文献41

相关作者

相关机构

相关主题

浏览历史

结合视觉特征和场景语义的图像描述生成 被引量：23

同被引文献79

引证文献23

二级引证文献41

相关作者

相关机构

相关主题

浏览历史

结合视觉特征和场景语义的图像描述生成被引量：23