摘要
由目标检测算法提取的目标特征在图像描述生成任务中发挥重要作用,但仅使用对图像进行目标检测的特征作为图像描述任务的输入会导致除关键目标信息以外的其余信息获取缺失,且生成的文本描述对图像内目标之间的关系缺乏准确表达。针对上述不足,提出用于编码图像内目标特征的目标Transformer编码器,以及用于编码图像内关系特征的转换窗口Transformer编码器,从不同角度对图像内不同方面的信息进行联合编码。通过拼接方法将目标Transformer编码的目标特征与转换窗口Transformer编码的关系特征相融合,达到图像内部关系特征和局部目标特征融合的目的,最终使用Transformer解码器将融合后的编码特征解码生成对应的图像描述。在MS-COCO数据集上进行实验,结果表明,所构建模型性能明显优于基线模型,BLEU-4、METEOR、ROUGE-L、CIDEr指标分别达到38.6%、28.7%、58.2%和127.4%,优于传统图像描述网络模型,能够生成更详细准确的图像描述。
Object features extracted by object detection algorithms play an increasingly critical role in the generation of image captions.However,only using the features of object detection as the input of an image caption task can lead to the loss of other information except the key object information and generation of a caption that lacks an accurate expression of its relationship with the image object.To solve these disadvantages,an object Transformer encoder for encoding object features in an image and a shift window Transformer for encoding relational features in an image are proposed to make joint efforts to encode different aspects of information in an image.The object features of the object Transformer encoder are fused with the relational features of the shift window Transformer by splicing method,to achieve the purpose of fusion of the internal relational and local object features.Finally,a Transformer decoder is utilized to decode the fused coding features and generate the corresponding image caption.Extensive experiments on the Common Objects in COntext(MS-COCO) dataset and comparison with the current classical model algorithm show that the performance of the proposed model is significantly better than that of the baseline model.The experimental results indicate that the scores of BiLingual Evaluation Understudy 4-gram(BLEU-4),Metric for Evaluation of Translation with Explicit ORdering(METEOR),Recall-Oriented Understudy for Gisting Evaluation-Longest common subsequence(ROUGE-L),and Consensus-based Image Description Evaluation(CIDEr) metrics can reach 38.6%,28.7%,58.2% and 127.4%respectively,better than those of the traditional image caption algorithm.Moreover,it can generate more detailed and accurate captions.
作者
衡红军
范昱辰
王家亮
HENG Hongjun;FAN Yuchen;WANG Jialiang(School of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China)
出处
《计算机工程》
CAS
CSCD
北大核心
2023年第2期199-205,共7页
Computer Engineering
基金
国家自然科学基金(U1333109)。