融合ELMo词嵌入的多模态Transformer的图像描述算法被引量：1

Image Caption with ELMo Embedding and Multimodal Transformer

下载PDF

导出

摘要图像描述任务旨在针对一张给出的图像产生其对应描述。针对现有算法中语义信息理解不够全面的问题,提出了一个针对图像描述领域的多模态Transformer模型。该模型在注意模块中同时捕捉模态内和模态间的相互作用;更进一步使用ELMo获得包含上下文信息的文本特征,使模型获得更加丰富的语义描述输入。该模型可以对复杂的多模态信息进行更好地理解与推断并且生成更为准确的自然语言描述。该模型在Microsoft COCO数据集上进行了广泛的实验,实验结果表明,相比于使用bottom-up注意力机制以及LSTM进行图像描述的基线模型具有较大的效果提升,模型在BLEU-1、BLEU-2、BLEU-3、BLEU-4、ROUGE-L、CIDEr-D上分别有0.7、0.4、0.9、1.3、0.6、4.9个百分点的提高。 The task of image caption is aim to generate the corresponding description of a given image.In order to solve the problem of incomplete understanding of semantic information in existing algorithms,a multimodal Transformer model for image description is proposed.In the attention module,the model captures the interaction within and between modes simultaneously,and further uses ELMo to obtain word embeddings which containing context information,so that the model can obtain more rich semantic description as input.This model can better understand and infer complex multimodal information and generate more accurate natural language description.The model has been widely tested on Microsoft COCO dataset,and the experimental results show that it has a great improvement compared with the baseline model using bottom-up attention and LSTM.The model has an improvement of 0.7,0.4,0.9,1.3,0.6,4.9 percentage points on BLEU-1,BLEU-2,BLEU-3,BLEU-4,ROUGE-L,CIDEr-D respectively.

作者杨文瑞沈韬朱艳曾凯刘英莉 YANG Wenrui;SHEN Tao;ZHU Yan;ZENG Kai;LIU Yingli(Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China;Yunnan Key Laboratory of Computer Technologies Application,Kunming University of Science and Technology,Kunming 650500,China)

机构地区昆明理工大学信息工程与自动化学院昆明理工大学云南省计算机重点实验室

出处《计算机工程与应用》 CSCD 北大核心 2022年第21期223-231,共9页 Computer Engineering and Applications

基金国家自然科学基金(61971208,61671225,52061020,61702128) 云南省应用基础研究计划项目重点项目(2018FA034) 昆明理工大学人才培养项目(KKSY201703016)。

关键词 TRANSFORMER 图像描述 ELMo 注意力机制 Transformer image caption ELMo attention mechanism

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

同被引文献3

1盛豪,易尧华,汤梓伟.融合图像场景与目标显著性特征的图像描述生成方法[J].计算机应用研究,2021,38(12):3776-3780. 被引量：5
2Tian-Xiang Sun,Xiang-Yang Liu,Xi-Peng Qiu,Xuan-Jing Huang.Paradigm Shift in Natural Language Processing[J].Machine Intelligence Research,2022,19(3):169-183. 被引量：10
3Meng-Hao Guo,Tian-Xing Xu,Jiang-Jiang Liu,Zheng-Ning Liu,Peng-Tao Jiang,Tai-Jiang Mu,Song-Hai Zhang,Ralph R.Martin,Ming-Ming Cheng,Shi-Min Hu.Attention mechanisms in computer vision:A survey[J].Computational Visual Media,2022,8(3):331-368. 被引量：125

引证文献1

1陈善学,王程.融合多时间维度视觉与语义信息的图像描述方法[J].数据采集与处理,2024,39(4):922-932.

1姜文晖,占锟,程一波,夏雪,方玉明.结合多层级解码器和动态融合机制的图像描述[J].中国图象图形学报,2022,27(9):2775-2787. 被引量：3
2方仲俊,张静,李冬冬.基于空间和多层级联合编码的图像描述算法[J].计算机科学,2022,49(10):151-158. 被引量：1
3王国英.基于多粒度与动态词向量的机器翻译关键技术研究[J].自动化与仪器仪表,2022(9):181-185. 被引量：1
4刘皓,洪宇,朱巧明.无监督的领域自适应机器阅读理解方法[J].计算机学报,2022,45(10):2133-2150. 被引量：1
5皮洲,奚雪峰,崔志明,周国栋.一种面向长文本小数据集自动摘要任务的数据增强策略[J].中文信息学报,2022,36(9):46-56. 被引量：1
6Dengyong Zhang,Jiawei Hu,Feng Li,Xiangling Ding,Arun Kumar Sangaiah,Victor SSheng.Small Object Detection via Precise Region-Based Fully Convolutional Networks[J].Computers, Materials & Continua,2021(11):1503-1517. 被引量：9
7杨涛,解庆,刘永坚,刘平峰.主题感知的长文本自动摘要算法[J].计算机工程与应用,2022,58(20):165-173. 被引量：1
8郝建军,邴振凯,杨淑华,杨杰,孙磊.采用改进YOLOv3算法检测青皮核桃[J].农业工程学报,2022,38(14):183-190. 被引量：5
9韩莹,孙凯强,张栋,王乐豪,谈昊然.一种基于深度学习的海表温度混合预测方法[J].海洋环境科学,2022,41(5):791-798. 被引量：2
10WANG CHONGCHEN.Une plongée dans les réseaux d’eau sinueux de l’Axe central de Beijing[J].今日中国（法文版）,2022,60(8):20-22.

计算机工程与应用

2022年第21期

浏览历史

内容加载中请稍等...

融合ELMo词嵌入的多模态Transformer的图像描述算法被引量：1

同被引文献3

引证文献1

相关作者

相关机构

相关主题

浏览历史

融合ELMo词嵌入的多模态Transformer的图像描述算法 被引量：1

同被引文献3

引证文献1

相关作者

相关机构

相关主题

浏览历史

融合ELMo词嵌入的多模态Transformer的图像描述算法被引量：1