Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
In the past few years,the emergence of pre-training models has brought uni-modal fields such as computer vision(CV)and natural language processing(NLP)to a new era.Substantial works have shown that they are beneficial...In the past few years,the emergence of pre-training models has brought uni-modal fields such as computer vision(CV)and natural language processing(NLP)to a new era.Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.So can such pre-trained models be applied to multi-modal tasks?Researchers have ex-plored this problem and made significant progress.This paper surveys recent advances and new frontiers in vision-language pre-training(VLP),including image-text and video-text pre-training.To give readers a better overall grasp of VLP,we first review its recent ad-vances in five aspects:feature extraction,model architecture,pre-training objectives,pre-training datasets,and downstream tasks.Then,we summarize the specific VLP models in detail.Finally,we discuss the new frontiers in VLP.To the best of our knowledge,this is the first survey focused on VLP.We hope that this survey can shed light on future research in the VLP field.展开更多
With the significant breakthrough in the research of single-modal related deep learning tasks,more and more works begin to focus on multi-modal tasks.Multi-modal tasks usually involve more than one different modalitie...With the significant breakthrough in the research of single-modal related deep learning tasks,more and more works begin to focus on multi-modal tasks.Multi-modal tasks usually involve more than one different modalities,and a modality represents a type of behavior or state.Common multi-modal information includes vision,hearing,language,touch,and smell.Vision and language are two of the most common modalities in human daily life,and many typical multi-modal tasks focus on these two modalities,such as visual captioning and visual grounding.In this paper,we conduct in-depth research on typical tasks of vision and language from the perspectives of generation,analysis,and reasoning.First,the analysis and summary with the typical tasks and some pretty classical methods are introduced,which will be generalized from the aspects of different algorithmic concerns,and be further discussed frequently used datasets and metrics.Then,some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework.Finally,we further discuss the development of pre-training related research and make an outlook for future research.We hope this survey can help relevant researchers to understand the latest progress,existing problems,and exploration directions of vision and language multi-modal related tasks,and provide guidance for future research.展开更多
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
基金supported by the Key Research Program of the Chinese Academy of Sciences(No.ZDBSSSW-JSC006)the Strategic Priority Research Program of the Chinese Academy of Sciences(No.XDA 27030300).
文摘In the past few years,the emergence of pre-training models has brought uni-modal fields such as computer vision(CV)and natural language processing(NLP)to a new era.Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.So can such pre-trained models be applied to multi-modal tasks?Researchers have ex-plored this problem and made significant progress.This paper surveys recent advances and new frontiers in vision-language pre-training(VLP),including image-text and video-text pre-training.To give readers a better overall grasp of VLP,we first review its recent ad-vances in five aspects:feature extraction,model architecture,pre-training objectives,pre-training datasets,and downstream tasks.Then,we summarize the specific VLP models in detail.Finally,we discuss the new frontiers in VLP.To the best of our knowledge,this is the first survey focused on VLP.We hope that this survey can shed light on future research in the VLP field.
基金supported in part by the National Natural Science Foundation of China(No.61831005).
文摘With the significant breakthrough in the research of single-modal related deep learning tasks,more and more works begin to focus on multi-modal tasks.Multi-modal tasks usually involve more than one different modalities,and a modality represents a type of behavior or state.Common multi-modal information includes vision,hearing,language,touch,and smell.Vision and language are two of the most common modalities in human daily life,and many typical multi-modal tasks focus on these two modalities,such as visual captioning and visual grounding.In this paper,we conduct in-depth research on typical tasks of vision and language from the perspectives of generation,analysis,and reasoning.First,the analysis and summary with the typical tasks and some pretty classical methods are introduced,which will be generalized from the aspects of different algorithmic concerns,and be further discussed frequently used datasets and metrics.Then,some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework.Finally,we further discuss the development of pre-training related research and make an outlook for future research.We hope this survey can help relevant researchers to understand the latest progress,existing problems,and exploration directions of vision and language multi-modal related tasks,and provide guidance for future research.