期刊文献+

Adequate alignment and interaction for cross-modal retrieval

下载PDF
导出
摘要 Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.
出处 《Virtual Reality & Intelligent Hardware》 EI 2023年第6期509-522,共14页 虚拟现实与智能硬件(中英文)
基金 Supported by the National Natural Science Foundation of China (62172109,62072118) the National Science Foundation of Guangdong Province (2022A1515010322) the Guangdong Basic and Applied Basic Research Foundation (2021B1515120010) the Huangpu International Sci&Tech Cooperation foundation of Guangzhou (2021GH12)。
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部