期刊文献+

DI-VTR:Dual inter-modal interaction model for video-text retrieval

原文传递
导出
摘要 Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.
出处 《Journal of Information and Intelligence》 2024年第5期388-403,共16页 信息与智能学报(英文)
基金 supported by the Key Research and Development Program of Shaanxi(2023-YBGY-218) the National Natural Science Foundation of China under Grant(62372357 and 62201424) the Fundamental Research Funds for the Central Universities(QTZX23072),and also supported by the ISN State Key Laboratory.
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部