DI-VTR:Dual inter-modal interaction model for video-text retrieval

导出

摘要 Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.

作者 Jie Guo Mengying Wang Wenwei Wang Yan Zhou Bin Song

机构地区 State Key Laboratory of Integrated Services Networks Hangzhou Institute of Technology

出处《Journal of Information and Intelligence》 2024年第5期388-403,共16页 信息与智能学报（英文）

基金 supported by the Key Research and Development Program of Shaanxi(2023-YBGY-218) the National Natural Science Foundation of China under Grant(62372357 and 62201424) the Fundamental Research Funds for the Central Universities(QTZX23072),and also supported by the ISN State Key Laboratory.

关键词 Video-text retrieval Multilingual text Dual interaction Contrastivelanguage-image pretraining(CLIP) Cross-modal retrieval

分类号 TP3 [自动化与计算机技术—计算机科学与技术]

引文网络
相关文献

1Jiangsu Kaigong Machinery: Promoting smart comber series[J].China Textile,2017(11):54-55.
2Lu Sun,Xiaona Li,Mingyue Zhang,Liangtian Wan,Yun Lin,Xianpeng Wang,Gang Xu.Multi-layer network embedding on scc-based network with motif[J].Digital Communications and Networks,2024,10(3):546-556.
3ZHAO XianMing,YU JianJun,YANG XiongWei,WEI Yi,LI WeiPing,TAN JingWen,WANG MingXu,ZHANG QiuTong,ZHANG Bing,BIAN ChengZhen,TIAN Peng,HAN Yang,ZHANG QingYi,YU JianGuo,ZHAO Feng.D-band 4600 m wireless transmission with rates exceeding 100 Gbit/s based on photonics aided technologies[J].Science China(Technological Sciences),2024,67(9):2968-2970.
4王良君,季宇航,顾维杰.基于字符连接的场景文本检测[J].计算机与数字工程,2024,52(7):2108-2114.
5房运涛,李爽,韩晓琴,翟强,庄顺胥,齐伟,宋丽娟.基于YOLOv8的发动机缸内异物检测算法开发与应用[J].内燃机与动力装置,2024,41(4):33-40.
6张轩,王远红,周佩韦.基于网络药理学探究玉屏风散增强免疫力的作用机制[J].临床医学进展,2024,14(8):691-703.
7Jianye Li,Hao Wang,Yibing Luo,Zijing Zhou,He Zhang,Huizhi Chen,Kai Tao,Chuan Liu,Lingxing Zeng,Fengwei Huo,Jin Wu.Design of AI-Enhanced and Hardware-Supported Multimodal E-Skin for Environmental Object Recognition and Wireless Toxic Gas Alarm[J].Nano-Micro Letters,2024,16(12):1-22.
8WANG Xi,HOU Miaole,CHEN Fan.A Fine-grained Building Height Control Method for Heritage Areas from the Perspective of Visual Integrity[J].Journal of Geodesy and Geoinformation Science,2024,7(2):66-79.
9陈康,骆雨琪,胡慧颖,王皓捷.基于风格编辑的智能服饰样式生成方法[J].人工智能与机器人研究,2024,13(3):636-647.
10Yunfan Liu,Qi Li,Zhenan Sun.One-shot Face Reenactment with Dense Correspondence Estimation[J].Machine Intelligence Research,2024,21(5):941-953.

Journal of Information and Intelligence

2024年第5期

浏览历史

内容加载中请稍等...

DI-VTR:Dual inter-modal interaction model for video-text retrieval

相关作者

相关机构

相关主题

浏览历史