摘要
随着互联网多媒体数据的不断增长,文本图像检索已成为研究热点。在图文检索中,通常使用相互注意力机制,通过将图像和文本特征进行交互,来实现较好的图文匹配结果。但是,这种方法不能获取单独的图像特征和文本特征,在大规模检索后期需要对图像文本特征进行交互,消耗了大量的时间,无法做到快速检索匹配。然而基于Transformer的跨模态图像文本特征学习取得了良好的效果,受到了越来越多的关注。文中设计了一种新颖的基于Transformer的文本图像检索网络结构(HAS-Net),该结构主要有以下几点改进:1)设计了一种分层Transformer编码结构,以更好地利用底层的语法信息和高层的语义信息;2)改进了传统的全局特征聚合方式,利用自注意力机制设计了一种新的特征聚合方式;3)通过共享Transformer编码层,使图片特征和文本特征映射到公共的特征编码空间。在MS-COCO数据集和Flickr30k数据集上进行实验,结果表明跨模态检索性能均得到提升,在同类算法中处于领先地位,证明了所设计的网络结构的有效性。
With the growth of Internet multimedia data,text image retrieval has become a research hotspot.In image and text retrieval,the mutual attention mechanism is used to achieve better image-text matching results by interacting image and text features.However,this method cannot obtain image features and text features separately,and requires interaction of image and text features in the later stage of large-scale retrieval,which consumes a lot of time and is not able to achieve fast retrieval and ma-tching.However,the cross-modal image text feature learning based on Transformer has achieved good results and has received more and more attention from researchers.This paper designs a novel Transformer-based text image retrieval network structure(HAS-Net),which mainly has the following improvements:a hierarchical Transformer coding structure is designed to better utilize the underlying grammatical information and high-level semantic information;the traditional global feature aggregation method is improved,and the self-attention mechanism is used to design a new feature aggregation method;by sharing the Transformer coding layer,image features and text features are mapped to a common feature coding space.Finally,experiments are conducted on the MS-COCO and Flickr30k datasets,the cross-modal retrieval performance has been improved,and it is in a leading position among similar algorithms.It is proved that the designed network structure is effective.
作者
杨晓宇
李超
陈舜尧
李浩亮
殷光强
YANG Xiaoyu;LI Chao;CHEN Shunyao;LI Haoliang;YIN Guangqiang(Center for Public Security Technology,University of Electronic Science and Technology of China,Chengdu 611731,China)
出处
《计算机科学》
CSCD
北大核心
2023年第4期141-148,共8页
Computer Science
基金
深圳市科技计划项目(JSGG20220301090405009)。