Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

导出

摘要 Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval.To this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text retrieval.Specifically,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective.Besides,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities.Subsequently,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs.Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively.

作者 Xue-Yang Qin Li-Shuang Li Jing-Yao Tang Fei Hao Mei-Ling Ge Guang-Yao Pang 秦雪洋;李丽双;唐婧尧;郝飞;盖枚岭;庞光垚(School of Computer Science and Technology,Dalian University of Technology,Dalian 116024,China;School of Computer Science,Shaanxi Normal University,Xi’an 710119,China;School of Computer Engineering,Weifang University,Weifang 261061,China;Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software,Wuzhou University,Wuzhou 543002 China)

机构地区 School of Computer Science and Technology School of Computer Science School of Computer Engineering Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2024年第4期811-826,共16页 计算机科学技术学报（英文版）

基金 supported by the National Natural Science Foundation of China under Grant No.62076048.

关键词 image-text retrieval cross-modal retrieval multi-task learning graph convolutional network

分类号 TP391.41 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1赵梦宇,孙京博,魏遵天,辛现伟,宋继华.基于BiLSTM和CNN的序贯三支情感分类模型研究[J].南京大学学报（自然科学版）,2024,60(3):502-510.
2WANG Kangping,ZHAO Mingbo.Region-Aware Fashion Contrastive Learning for Unified Attribute Recognition and Composed Retrieval[J].Journal of Donghua University(English Edition),2024,41(4):405-415.
3GE Yisu,YE Wenjie,ZHANG Guodao,LIN Mengying.Multi-level temporal feature fusion with feature exchange strategy for multiple object tracking[J].Optoelectronics Letters,2024,20(8):505-512.
4Let's Learn New Words![J].英语角,2024(27):52-53.
5Haoran Shen,Junjie Cao,Lin Zhang,Jing Li,Jianghong Liu,Zhiyuan Chu,Shifeng Wang,Yanjiang Qiao.Classification research of TCM pulse conditions based on multi-label voice analysis[J].Journal of Traditional Chinese Medical Sciences,2024,11(2):172-179.
6张辉煌,王鸿硕.基于PaddleOCR与Style-Text的金融票据手写体文本识别[J].科技创新与应用,2024,14(30):68-71.
7文静,陈琰,杨丹旎,陈硕,董薇.多模态脑机接口及其在军事领域中的应用[J].北京生物医学工程,2024,43(5):537-541.
8Manuel Milling,Shuo Liu,Andreas Triantafyllopoulos,Ilhan Aslan,Björn W.Schuller.Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance[J].Journal of Computer Science & Technology,2024,39(4):895-911.
9Tao Zhou,Yunfeng Pan,Huiling Lu,Pei Dang,Yujie Guo,Yaxing Wang.Guided-YNet: Saliency Feature-Guided Interactive Feature Enhancement Lung Tumor Segmentation Network[J].Computers, Materials & Continua,2024,80(9):4813-4832.
10赵娜娜,韩向敏,王祥,高曼,刘士远.基于多任务对比自监督双通道网络的肺腺癌CT图像分类[J].北京生物医学工程,2024,43(5):478-485.

Journal of Computer Science & Technology

2024年第4期

浏览历史

内容加载中请稍等...

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

相关作者

相关机构

相关主题

浏览历史