期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
1
作者 Haoyu Lu Yuqi Huo +2 位作者 Mingyu Ding nanyi fei Zhiwu Lu 《Machine Intelligence Research》 EI CSCD 2023年第4期569-582,共14页
Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often... Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient. 展开更多
关键词 Image-text retrieval multimodal modeling contrastive learning weak correlation computer vision
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部