Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and sha...Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and share such multimodal data.However,due to professional discrepancies among annotators and lax quality control,noisy labels might be introduced.Recent research suggests that deep neural networks(DNNs)will overfit noisy labels,leading to the poor performance of the DNNs.To address this challenging problem,we present a Multimodal Robust Meta Learning framework(MRML)for multimodal sentiment analysis to resist noisy labels and correlate distinct modalities simultaneously.Specifically,we propose a two-layer fusion net to deeply fuse different modalities and improve the quality of the multimodal data features for label correction and network training.Besides,a multiple meta-learner(label corrector)strategy is proposed to enhance the label correction approach and prevent models from overfitting to noisy labels.We conducted experiments on three popular multimodal datasets to verify the superiority of ourmethod by comparing it with four baselines.展开更多
Purpose:Nowadays,public opinions during public emergencies involve not only textual contents but also contain images.However,the existing works mainly focus on textual contents and they do not provide a satisfactory a...Purpose:Nowadays,public opinions during public emergencies involve not only textual contents but also contain images.However,the existing works mainly focus on textual contents and they do not provide a satisfactory accuracy of sentiment analysis,lacking the combination of multimodal contents.In this paper,we propose to combine texts and images generated in the social media to perform sentiment analysis.Design/methodology/approach:We propose a Deep Multimodal Fusion Model(DMFM),which combines textual and visual sentiment analysis.We first train word2vec model on a large-scale public emergency corpus to obtain semantic-rich word vectors as the input of textual sentiment analysis.BiLSTM is employed to generate encoded textual embeddings.To fully excavate visual information from images,a modified pretrained VGG16-based sentiment analysis network is used with the best-performed fine-tuning strategy.A multimodal fusion method is implemented to fuse textual and visual embeddings completely,producing predicted labels.Findings:We performed extensive experiments on Weibo and Twitter public emergency datasets,to evaluate the performance of our proposed model.Experimental results demonstrate that the DMFM provides higher accuracy compared with baseline models.The introduction of images can boost the performance of sentiment analysis during public emergencies.Research limitations:In the future,we will test our model in a wider dataset.We will also consider a better way to learn the multimodal fusion information.Practical implications:We build an efficient multimodal sentiment analysis model for the social media contents during public emergencies.Originality/value:We consider the images posted by online users during public emergencies on social platforms.The proposed method can present a novel scope for sentiment analysis during public emergencies and provide the decision support for the government when formulating policies in public emergencies.展开更多
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
Targeted multimodal sentiment classification(TMSC)aims to identify the sentiment polarity of a target mentioned in a multimodal post.The majority of current studies on this task focus on mapping the image and the text...Targeted multimodal sentiment classification(TMSC)aims to identify the sentiment polarity of a target mentioned in a multimodal post.The majority of current studies on this task focus on mapping the image and the text to a high-dimensional space in order to obtain and fuse implicit representations,ignoring the rich semantic information contained in the images and not taking into account the contribution of the visual modality in the multimodal fusion representation,which can potentially influence the results of TMSC tasks.This paper proposes a general model for Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images(ITMSC)as a way to tackle these issues and improve the accu-racy of multimodal sentiment analysis.Specifically,the ITMSC model can automatically adjust the contribution of images in the fusion representation through the exploitation of semantic descriptions of images and text similarity relations.Further,we propose a target-based attention module to capture the target-text relevance,an image-based attention module to capture the image-text relevance,and a target-image matching module based on the former two modules to properly align the target with the image so that fine-grained semantic information can be extracted.Our experimental results demonstrate that our model achieves comparable performance with several state-of-the-art approaches on two multimodal sentiment datasets.Our findings indicate that incorporating semantic descriptions of images can enhance our understanding of multimodal content and lead to improved sentiment analysis performance.展开更多
For the existing aspect category sentiment analysis research,most of the aspects are given for sentiment extraction,and this pipeline method is prone to error accumulation,and the use of graph convolutional neural net...For the existing aspect category sentiment analysis research,most of the aspects are given for sentiment extraction,and this pipeline method is prone to error accumulation,and the use of graph convolutional neural network for aspect category sentiment analysis does not fully utilize the dependency type information between words,so it cannot enhance feature extraction.This paper proposes an end-to-end aspect category sentiment analysis(ETESA)model based on type graph convolutional networks.The model uses the bidirectional encoder representation from transformers(BERT)pretraining model to obtain aspect categories and word vectors containing contextual dynamic semantic information,which can solve the problem of polysemy;when using graph convolutional network(GCN)for feature extraction,the fusion operation of word vectors and initialization tensor of dependency types can obtain the importance values of different dependency types and enhance the text feature representation;by transforming aspect category and sentiment pair extraction into multiple single-label classification problems,aspect category and sentiment can be extracted simultaneously in an end-to-end way and solve the problem of error accumulation.Experiments are tested on three public datasets,and the results show that the ETESA model can achieve higher Precision,Recall and F1 value,proving the effectiveness of the model.展开更多
The aspect-based sentiment analysis(ABSA)consists of two subtasksaspect term extraction and aspect sentiment prediction.Most methods conduct the ABSA task by handling the subtasks in a pipeline manner,whereby problems...The aspect-based sentiment analysis(ABSA)consists of two subtasksaspect term extraction and aspect sentiment prediction.Most methods conduct the ABSA task by handling the subtasks in a pipeline manner,whereby problems in performance and real application emerge.In this study,we propose an end-to-end ABSA model,namely,SSi-LSi,which fuses the syntactic structure information and the lexical semantic information,to address the limitation that existing end-to-end methods do not fully exploit the text information.Through two network branches,the model extracts syntactic structure information and lexical semantic information,which integrates the part of speech,sememes,and context,respectively.Then,on the basis of an attention mechanism,the model further realizes the fusion of the syntactic structure information and the lexical semantic information to obtain higher quality ABSA results,in which way the text information is fully used.Subsequent experiments demonstrate that the SSi-LSi model has certain advantages in using different text information.展开更多
Multimodal Sentiment analysis refers to analyzing emotions in infor-mation carriers containing multiple modalities.To better analyze the features within and between modalities and solve the problem of incomplete multi...Multimodal Sentiment analysis refers to analyzing emotions in infor-mation carriers containing multiple modalities.To better analyze the features within and between modalities and solve the problem of incomplete multimodal feature fusion,this paper proposes a multimodal sentiment analysis model MIF(Modal Interactive Feature Encoder For Multimodal Sentiment Analysis).First,the global features of three modalities are obtained through unimodal feature extraction networks.Second,the inter-modal interactive feature encoder and the intra-modal interactive feature encoder extract similarity features between modal-ities and intra-modal special features separately.Finally,unimodal special features and the interaction information between modalities are decoded to get the fusion features and predict sentimental polarity results.We conduct extensive experi-ments on three public multimodal datasets,including one in Chinese and two in English.The results show that the performance of our approach is significantly improved compared with benchmark models.展开更多
Multimodal Sentiment Analysis(SA)is gaining popularity due to its broad application potential.The existing studies have focused on the SA of single modalities,such as texts or photos,posing challenges in effectively h...Multimodal Sentiment Analysis(SA)is gaining popularity due to its broad application potential.The existing studies have focused on the SA of single modalities,such as texts or photos,posing challenges in effectively handling social media data with multiple modalities.Moreover,most multimodal research has concentrated on merely combining the two modalities rather than exploring their complex correlations,leading to unsatisfactory sentiment classification results.Motivated by this,we propose a new visualtextual sentiment classification model named Multi-Model Fusion(MMF),which uses a mixed fusion framework for SA to effectively capture the essential information and the intrinsic relationship between the visual and textual content.The proposed model comprises three deep neural networks.Two different neural networks are proposed to extract the most emotionally relevant aspects of image and text data.Thus,more discriminative features are gathered for accurate sentiment classification.Then,a multichannel joint fusion modelwith a self-attention technique is proposed to exploit the intrinsic correlation between visual and textual characteristics and obtain emotionally rich information for joint sentiment classification.Finally,the results of the three classifiers are integrated using a decision fusion scheme to improve the robustness and generalizability of the proposed model.An interpretable visual-textual sentiment classification model is further developed using the Local Interpretable Model-agnostic Explanation model(LIME)to ensure the model’s explainability and resilience.The proposed MMF model has been tested on four real-world sentiment datasets,achieving(99.78%)accuracy on Binary_Getty(BG),(99.12%)on Binary_iStock(BIS),(95.70%)on Twitter,and(79.06%)on the Multi-View Sentiment Analysis(MVSA)dataset.These results demonstrate the superior performance of our MMF model compared to single-model approaches and current state-of-the-art techniques based on model evaluation criteria.展开更多
The aspect-based sentiment analysis(ABSA) consists of two subtasks—aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies...The aspect-based sentiment analysis(ABSA) consists of two subtasks—aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies some problems in performance and real application. This study investigates the end-to-end ABSA and proposes a novel multitask multiview network(MTMVN) architecture. Specifically, the architecture takes the unified ABSA as the main task with the two subtasks as auxiliary tasks. Meanwhile, the representation obtained from the branch network of the main task is regarded as the global view, whereas the representations of the two subtasks are considered two local views with different emphases. Through multitask learning, the main task can be facilitated by additional accurate aspect boundary information and sentiment polarity information. By enhancing the correlations between the views under the idea of multiview learning, the representation of the global view can be optimized to improve the overall performance of the model. The experimental results on three benchmark datasets show that the proposed method exceeds the existing pipeline methods and end-to-end methods, proving the superiority of our MTMVN architecture.展开更多
基金supported by STI 2030-Major Projects 2021ZD0200400National Natural Science Foundation of China(62276233 and 62072405)Key Research Project of Zhejiang Province(2023C01048).
文摘Multimodal sentiment analysis utilizes multimodal data such as text,facial expressions and voice to detect people’s attitudes.With the advent of distributed data collection and annotation,we can easily obtain and share such multimodal data.However,due to professional discrepancies among annotators and lax quality control,noisy labels might be introduced.Recent research suggests that deep neural networks(DNNs)will overfit noisy labels,leading to the poor performance of the DNNs.To address this challenging problem,we present a Multimodal Robust Meta Learning framework(MRML)for multimodal sentiment analysis to resist noisy labels and correlate distinct modalities simultaneously.Specifically,we propose a two-layer fusion net to deeply fuse different modalities and improve the quality of the multimodal data features for label correction and network training.Besides,a multiple meta-learner(label corrector)strategy is proposed to enhance the label correction approach and prevent models from overfitting to noisy labels.We conducted experiments on three popular multimodal datasets to verify the superiority of ourmethod by comparing it with four baselines.
基金This paper is supported by the National Natural Science Foundation of China under contract No.71774084,72274096the National Social Science Fund of China under contract No.16ZDA224,17ZDA291.
文摘Purpose:Nowadays,public opinions during public emergencies involve not only textual contents but also contain images.However,the existing works mainly focus on textual contents and they do not provide a satisfactory accuracy of sentiment analysis,lacking the combination of multimodal contents.In this paper,we propose to combine texts and images generated in the social media to perform sentiment analysis.Design/methodology/approach:We propose a Deep Multimodal Fusion Model(DMFM),which combines textual and visual sentiment analysis.We first train word2vec model on a large-scale public emergency corpus to obtain semantic-rich word vectors as the input of textual sentiment analysis.BiLSTM is employed to generate encoded textual embeddings.To fully excavate visual information from images,a modified pretrained VGG16-based sentiment analysis network is used with the best-performed fine-tuning strategy.A multimodal fusion method is implemented to fuse textual and visual embeddings completely,producing predicted labels.Findings:We performed extensive experiments on Weibo and Twitter public emergency datasets,to evaluate the performance of our proposed model.Experimental results demonstrate that the DMFM provides higher accuracy compared with baseline models.The introduction of images can boost the performance of sentiment analysis during public emergencies.Research limitations:In the future,we will test our model in a wider dataset.We will also consider a better way to learn the multimodal fusion information.Practical implications:We build an efficient multimodal sentiment analysis model for the social media contents during public emergencies.Originality/value:We consider the images posted by online users during public emergencies on social platforms.The proposed method can present a novel scope for sentiment analysis during public emergencies and provide the decision support for the government when formulating policies in public emergencies.
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
文摘Targeted multimodal sentiment classification(TMSC)aims to identify the sentiment polarity of a target mentioned in a multimodal post.The majority of current studies on this task focus on mapping the image and the text to a high-dimensional space in order to obtain and fuse implicit representations,ignoring the rich semantic information contained in the images and not taking into account the contribution of the visual modality in the multimodal fusion representation,which can potentially influence the results of TMSC tasks.This paper proposes a general model for Improving Targeted Multimodal Sentiment Classification with Semantic Description of Images(ITMSC)as a way to tackle these issues and improve the accu-racy of multimodal sentiment analysis.Specifically,the ITMSC model can automatically adjust the contribution of images in the fusion representation through the exploitation of semantic descriptions of images and text similarity relations.Further,we propose a target-based attention module to capture the target-text relevance,an image-based attention module to capture the image-text relevance,and a target-image matching module based on the former two modules to properly align the target with the image so that fine-grained semantic information can be extracted.Our experimental results demonstrate that our model achieves comparable performance with several state-of-the-art approaches on two multimodal sentiment datasets.Our findings indicate that incorporating semantic descriptions of images can enhance our understanding of multimodal content and lead to improved sentiment analysis performance.
基金Supported by the National Key Research and Development Program of China(No.2018YFB1702601).
文摘For the existing aspect category sentiment analysis research,most of the aspects are given for sentiment extraction,and this pipeline method is prone to error accumulation,and the use of graph convolutional neural network for aspect category sentiment analysis does not fully utilize the dependency type information between words,so it cannot enhance feature extraction.This paper proposes an end-to-end aspect category sentiment analysis(ETESA)model based on type graph convolutional networks.The model uses the bidirectional encoder representation from transformers(BERT)pretraining model to obtain aspect categories and word vectors containing contextual dynamic semantic information,which can solve the problem of polysemy;when using graph convolutional network(GCN)for feature extraction,the fusion operation of word vectors and initialization tensor of dependency types can obtain the importance values of different dependency types and enhance the text feature representation;by transforming aspect category and sentiment pair extraction into multiple single-label classification problems,aspect category and sentiment can be extracted simultaneously in an end-to-end way and solve the problem of error accumulation.Experiments are tested on three public datasets,and the results show that the ETESA model can achieve higher Precision,Recall and F1 value,proving the effectiveness of the model.
基金This work was supported by the National Natural Science Foundation of China(No.61976247).
文摘The aspect-based sentiment analysis(ABSA)consists of two subtasksaspect term extraction and aspect sentiment prediction.Most methods conduct the ABSA task by handling the subtasks in a pipeline manner,whereby problems in performance and real application emerge.In this study,we propose an end-to-end ABSA model,namely,SSi-LSi,which fuses the syntactic structure information and the lexical semantic information,to address the limitation that existing end-to-end methods do not fully exploit the text information.Through two network branches,the model extracts syntactic structure information and lexical semantic information,which integrates the part of speech,sememes,and context,respectively.Then,on the basis of an attention mechanism,the model further realizes the fusion of the syntactic structure information and the lexical semantic information to obtain higher quality ABSA results,in which way the text information is fully used.Subsequent experiments demonstrate that the SSi-LSi model has certain advantages in using different text information.
文摘Multimodal Sentiment analysis refers to analyzing emotions in infor-mation carriers containing multiple modalities.To better analyze the features within and between modalities and solve the problem of incomplete multimodal feature fusion,this paper proposes a multimodal sentiment analysis model MIF(Modal Interactive Feature Encoder For Multimodal Sentiment Analysis).First,the global features of three modalities are obtained through unimodal feature extraction networks.Second,the inter-modal interactive feature encoder and the intra-modal interactive feature encoder extract similarity features between modal-ities and intra-modal special features separately.Finally,unimodal special features and the interaction information between modalities are decoded to get the fusion features and predict sentimental polarity results.We conduct extensive experi-ments on three public multimodal datasets,including one in Chinese and two in English.The results show that the performance of our approach is significantly improved compared with benchmark models.
文摘Multimodal Sentiment Analysis(SA)is gaining popularity due to its broad application potential.The existing studies have focused on the SA of single modalities,such as texts or photos,posing challenges in effectively handling social media data with multiple modalities.Moreover,most multimodal research has concentrated on merely combining the two modalities rather than exploring their complex correlations,leading to unsatisfactory sentiment classification results.Motivated by this,we propose a new visualtextual sentiment classification model named Multi-Model Fusion(MMF),which uses a mixed fusion framework for SA to effectively capture the essential information and the intrinsic relationship between the visual and textual content.The proposed model comprises three deep neural networks.Two different neural networks are proposed to extract the most emotionally relevant aspects of image and text data.Thus,more discriminative features are gathered for accurate sentiment classification.Then,a multichannel joint fusion modelwith a self-attention technique is proposed to exploit the intrinsic correlation between visual and textual characteristics and obtain emotionally rich information for joint sentiment classification.Finally,the results of the three classifiers are integrated using a decision fusion scheme to improve the robustness and generalizability of the proposed model.An interpretable visual-textual sentiment classification model is further developed using the Local Interpretable Model-agnostic Explanation model(LIME)to ensure the model’s explainability and resilience.The proposed MMF model has been tested on four real-world sentiment datasets,achieving(99.78%)accuracy on Binary_Getty(BG),(99.12%)on Binary_iStock(BIS),(95.70%)on Twitter,and(79.06%)on the Multi-View Sentiment Analysis(MVSA)dataset.These results demonstrate the superior performance of our MMF model compared to single-model approaches and current state-of-the-art techniques based on model evaluation criteria.
基金supported by the National Natural Science Foundation of China(No.61976247)
文摘The aspect-based sentiment analysis(ABSA) consists of two subtasks—aspect term extraction and aspect sentiment prediction. Existing methods deal with both subtasks one by one in a pipeline manner, in which there lies some problems in performance and real application. This study investigates the end-to-end ABSA and proposes a novel multitask multiview network(MTMVN) architecture. Specifically, the architecture takes the unified ABSA as the main task with the two subtasks as auxiliary tasks. Meanwhile, the representation obtained from the branch network of the main task is regarded as the global view, whereas the representations of the two subtasks are considered two local views with different emphases. Through multitask learning, the main task can be facilitated by additional accurate aspect boundary information and sentiment polarity information. By enhancing the correlations between the views under the idea of multiview learning, the representation of the global view can be optimized to improve the overall performance of the model. The experimental results on three benchmark datasets show that the proposed method exceeds the existing pipeline methods and end-to-end methods, proving the superiority of our MTMVN architecture.