In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalm...In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalmedical processing,etc.The existing main method is to use amulti-label matching paradigm to finish the retrieval tasks.However,such methods do not use fine-grained information in the multi-modal data,which may lead to suboptimal results.To avoid cross-modal matching turning into label matching,this paper proposes an end-to-end fine-grained cross-modal hash retrieval method,which can focus more on the fine-grained semantic information of multi-modal data.First,the method refines the image features and no longer uses multiple labels to represent text features but uses BERT for processing.Second,this method uses the inference capabilities of the transformer encoder to generate global fine-grained features.Finally,in order to better judge the effect of the fine-grained model,this paper uses the datasets in the image text matching field instead of the traditional label-matching datasets.This article experiment on Microsoft COCO(MS-COCO)and Flickr30K datasets and compare it with the previous classicalmethods.The experimental results show that this method can obtain more advanced results in the cross-modal hash retrieval field.展开更多
In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)...In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)to process image and text information,respectively.This makes images or texts subject to local constraints,and inherent label matching cannot capture finegrained information,often leading to suboptimal results.Driven by the development of the transformer model,we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs.Specifically,we use a BERT network to extract text features and use the vision transformer as the image network of the model.Finally,the features are transformed into hash codes for efficient and fast retrieval.We conduct extensive experiments on Microsoft COCO(MS-COCO)and Flickr30K,comparing with baselines of some hashing methods and image-text matching methods,showing that our method has better performance.展开更多
Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing...Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.展开更多
In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and un...In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.展开更多
Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data.Currently,many cross-modal retrieval methods have been proposed and have achieve...Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data.Currently,many cross-modal retrieval methods have been proposed and have achieved excellent results;however,these are trained with clean cross-modal pairs,which are semantically matched but costly,compared with easily available data with noise alignment(i.e.,paired but mismatched in semantics).When training these methods with noise-aligned data,the performance degrades dramatically.Therefore,we propose a robust cross-modal retrieval with alignment refurbishment(RCAR),which significantly reduces the impact of noise on the model.Specifically,RCAR first conducts multi-task learning to slow down the overfitting to the noise to make data separable.Then,RCAR uses a two-component beta-mixture model to divide them into clean and noise alignments and refurbishes the label according to the posterior probability of the noise-alignment component.In addition,we define partial and complete noises in the noise-alignment paradigm.Experimental results show that,compared with the popular cross-modal retrieval methods,RCAR achieves more robust performance with both types of noise.展开更多
Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as wit...Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.展开更多
Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing m...Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval.To this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text retrieval.Specifically,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective.Besides,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities.Subsequently,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs.Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively.展开更多
Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by ...Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by language.In this work,we address this issue by developing models with multimodal and multilingual capabilities.We explore two types of methods to extend multimodal pretraining model from monolingual to multilingual.Specifically,we propose a pretraining-based model named multilingual multimodal pretraining(MLMM),and two generalization-based models named multilingual CLIP(M-CLIP)and multilingual acquisition(MLA).In addition,we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision,language,and audio(CLIP4VLA).Our models achieve state-of-the-art performances on multilingual vision-text retrieval,visual question answering,and image captioning benchmarks.Based on the experimental results,we discuss the pros and cons of the two types of models and their potential practical applications.展开更多
基金This work was partially supported by Chongqing Natural Science Foundation of China(Grant No.CSTB2022NSCQ-MSX1417)the Science and Technology Research Program of Chongqing Municipal Education Commission(Grant No.KJZD-K202200513)+2 种基金Chongqing Normal University Fund(Grant No.22XLB003)Chongqing Education Science Planning Project(Grant No.2021-GX-320)Humanities and Social Sciences Project of Chongqing Education Commission of China(Grant No.22SKGH100).
文摘In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalmedical processing,etc.The existing main method is to use amulti-label matching paradigm to finish the retrieval tasks.However,such methods do not use fine-grained information in the multi-modal data,which may lead to suboptimal results.To avoid cross-modal matching turning into label matching,this paper proposes an end-to-end fine-grained cross-modal hash retrieval method,which can focus more on the fine-grained semantic information of multi-modal data.First,the method refines the image features and no longer uses multiple labels to represent text features but uses BERT for processing.Second,this method uses the inference capabilities of the transformer encoder to generate global fine-grained features.Finally,in order to better judge the effect of the fine-grained model,this paper uses the datasets in the image text matching field instead of the traditional label-matching datasets.This article experiment on Microsoft COCO(MS-COCO)and Flickr30K datasets and compare it with the previous classicalmethods.The experimental results show that this method can obtain more advanced results in the cross-modal hash retrieval field.
基金This work was partially supported by Science and Technology Project of Chongqing Education Commission of China(KJZD-K202200513)National Natural Science Foundation of China(61370205)+1 种基金Chongqing Normal University Fund(22XLB003)Chongqing Education Science Planning Project(2021-GX-320).
文摘In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)to process image and text information,respectively.This makes images or texts subject to local constraints,and inherent label matching cannot capture finegrained information,often leading to suboptimal results.Driven by the development of the transformer model,we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs.Specifically,we use a BERT network to extract text features and use the vision transformer as the image network of the model.Finally,the features are transformed into hash codes for efficient and fast retrieval.We conduct extensive experiments on Microsoft COCO(MS-COCO)and Flickr30K,comparing with baselines of some hashing methods and image-text matching methods,showing that our method has better performance.
基金Supported by the National Natural Science Foundation of China (62172109,62072118)the National Science Foundation of Guangdong Province (2022A1515010322)+1 种基金the Guangdong Basic and Applied Basic Research Foundation (2021B1515120010)the Huangpu International Sci&Tech Cooperation foundation of Guangzhou (2021GH12)。
文摘Background Cross-modal retrieval has attracted widespread attention in many cross-media similarity search applications,particularly image-text retrieval in the fields of computer vision and natural language processing.Recently,visual and semantic embedding(VSE)learning has shown promising improvements in image text retrieval tasks.Most existing VSE models employ two unrelated encoders to extract features and then use complex methods to contextualize and aggregate these features into holistic embeddings.Despite recent advances,existing approaches still suffer from two limitations:(1)without considering intermediate interactions and adequate alignment between different modalities,these models cannot guarantee the discriminative ability of representations;and(2)existing feature aggregators are susceptible to certain noisy regions,which may lead to unreasonable pooling coefficients and affect the quality of the final aggregated features.Methods To address these challenges,we propose a novel cross-modal retrieval model containing a well-designed alignment module and a novel multimodal fusion encoder that aims to learn the adequate alignment and interaction of aggregated features to effectively bridge the modality gap.Results Experiments on the Microsoft COCO and Flickr30k datasets demonstrated the superiority of our model over state-of-the-art methods.
文摘In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.
基金supported by the National Natural Science Foundation of China(No.12172186)。
文摘Cross-modal retrieval tries to achieve mutual retrieval between modalities by establishing consistent alignment for different modal data.Currently,many cross-modal retrieval methods have been proposed and have achieved excellent results;however,these are trained with clean cross-modal pairs,which are semantically matched but costly,compared with easily available data with noise alignment(i.e.,paired but mismatched in semantics).When training these methods with noise-aligned data,the performance degrades dramatically.Therefore,we propose a robust cross-modal retrieval with alignment refurbishment(RCAR),which significantly reduces the impact of noise on the model.Specifically,RCAR first conducts multi-task learning to slow down the overfitting to the noise to make data separable.Then,RCAR uses a two-component beta-mixture model to divide them into clean and noise alignments and refurbishes the label according to the posterior probability of the noise-alignment component.In addition,we define partial and complete noises in the noise-alignment paradigm.Experimental results show that,compared with the popular cross-modal retrieval methods,RCAR achieves more robust performance with both types of noise.
基金supported by the Key Research and Development Program of Shaanxi(2023-YBGY-218)the National Natural Science Foundation of China under Grant(62372357 and 62201424)the Fundamental Research Funds for the Central Universities(QTZX23072),and also supported by the ISN State Key Laboratory.
文摘Video-text retrieval is a challenging task for multimodal information processing due to the semantic gap between different modalities.However,most existing methods do not fully mine the intra-modal interactions,as with the temporal correlation of video frames,which results in poor matching performance.Additionally,the imbalanced semantic information between videos and texts also leads to difficulty in the alignment of the two modalities.To this end,we propose a dual inter-modal interaction network for video-text retrieval,i.e.,DI-vTR.To learn the intra-modal interaction of video frames,we design a contextual-related video encoder to obtain more fine-grained content-oriented video representations.We also propose a dual inter-modal interaction module to accomplish accurate multilingual alignment between the video and text modalities by introducing multilingual text to improve the representation ability of text semantic features.Extensive experimental results on commonly-used video-text retrieval datasets,including MSR-VTT,MSVD and VATEX,show that the proposed method achieves significantly improved performance compared with state-of-the-art methods.
基金supported by the National Natural Science Foundation of China under Grant No.62076048.
文摘Image-text retrieval aims to capture the semantic correspondence between images and texts,which serves as a foundation and crucial component in multi-modal recommendations,search systems,and online shopping.Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval.To this end,a multi-task visual semantic embedding network(MVSEN)is proposed for image-text retrieval.Specifically,we design two auxiliary tasks,including text-text matching and multi-label classification,for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective.Besides,we present an intra-and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities.Subsequently,we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs.Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets,Flickr30K and MSCOCO,with rSum improvements of 8.2%and 3.0%,respectively.
基金supported by the National Natural Science Foundation of China(No.62072462)the National Key R&D Program of China(No.2020AAA0108600)the Large-scale Pretraining Program 468 of Beijing Academy of Artificial Intelligence(BAAI).
文摘Multimodal pretraining has made convincing achievements in various downstream tasks in recent years.However,since the majority of the existing works construct models based on English,their applications are limited by language.In this work,we address this issue by developing models with multimodal and multilingual capabilities.We explore two types of methods to extend multimodal pretraining model from monolingual to multilingual.Specifically,we propose a pretraining-based model named multilingual multimodal pretraining(MLMM),and two generalization-based models named multilingual CLIP(M-CLIP)and multilingual acquisition(MLA).In addition,we further extend the generalization-based models to incorporate the audio modality and develop the multilingual CLIP for vision,language,and audio(CLIP4VLA).Our models achieve state-of-the-art performances on multilingual vision-text retrieval,visual question answering,and image captioning benchmarks.Based on the experimental results,we discuss the pros and cons of the two types of models and their potential practical applications.