摘要
图像描述生成任务要求机器自动生成自然语言文本来描述图像所呈现的语义内容,从而将视觉信息转化为文本描述,便于对图像进行管理、检索、分类等工作。图像差异描述生成是图像描述生成任务的延伸,其难点在于如何确定2张图像之间的视觉语义差别,并将视觉差异信息转换成对应的文本描述。基于此,提出了一种引入文本信息辅助训练的模型框架TA-IDC。采取多任务学习的方法,在传统的编码器-解码器结构上增加文本编码器,在训练阶段通过文本辅助解码和混合解码2种方法引入文本信息,建模视觉和文本2个模态间的语义关联,以获得高质量的图像差别描述。实验证明,TA-IDC模型在3个图像差异描述数据集上的主要指标分别超越已有模型最佳结果12%、2%和3%。
The image captioning task requires the machine to automatically generate natural language text to describe the semantic content of the image,thus transforming visual information into textual descriptions that facilitate image management,retrieval,classification,and other tasks.Image difference captioning is an extension of the image captioning task,which requires generating natural language sentences to describe the differences between two similar images.The difficulty of this task is how to determine the visual semantic difference between two images and convert the visual difference information into the corresponding textual descriptions.Previous studies do not make full use of textual information in the training stage to model cross-modal semantic associations between visual difference information and text.In this regard,the proposed framework named TA-IDC uses textual information to assist training.It adopts a multi-task learning method,adding a text encoder to the encoder-decoder structure and introducing textual information by text-assisted decoding and mixed decoding during the training stage.This aids in the modeling of semantic relationships between visual and text modalities,resulting in more accurate picture difference captions.Experimentally,TA-IDC outperforms the best results of existing models on main metricsby 12%,2%,and 3%on three image difference caption datasets,respectively.
作者
陈玮婧
王维莹
金琴
CHEN Weijing;WANG Weiying;JIN Qin(School of Information,Renmin University of China,Beijing 100872,China)
出处
《北京航空航天大学学报》
EI
CAS
CSCD
北大核心
2022年第8期1436-1444,共9页
Journal of Beijing University of Aeronautics and Astronautics
基金
国家自然科学基金(61772535,62072462)
北京市自然科学基金(4192028)。
关键词
图像差异描述
模态融合
图像描述
计算机视觉
自然语言处理
image difference captioning
modal fusion
image captioning
computer vision
natural language processing