摘要
针对微博行文自由性大,情感倾向识别困难的问题,提出了一种基于混合高斯分布伪样本生成技术和条件随机场模型的新方法。该方法首先利用混合高斯分布模型来为训练集中的少数类生成伪样本从而构建一个情感倾向分布平衡的训练集,然后通过使用Word2vec来扩展微博句子以丰富它的情感信息,从而缓解情感词典不足够大对情感分类的负面影响;最后将条件随机场模型应用在上面已经平衡和扩展后的训练集上.实验结果表明该方法比现有方法在数据集情感倾向分布不平衡时能更有效地识别微博的情感倾向.
Since informal words and expressions are widely used in miscroblogs , sentiment analysis of the microblogs is a difficult scientific problem , especially with the data in imbalanced sentiment distribution . GWCRF (Gaussian Mixture Distribution Word2vec CRF), a method based on pseudo-sample generation technique and Conditional Random Field ( CRF) for sentiment analysis of microblogs in imbalance distri-bution is presented .In the proposed method , firstly, the Gaussian Mixture Distribution is leveraged to generate pseudo-samples , which can increase the samples of minor classes for balancing the train data sets.Secondly, Word2vec technology is leveraged to enrich the microblog message and overcome the problem that sentiment lexicon is not large enough .Moveover , the CRF model is proposed to apply in the above balanced and extended train data sets .Experimental results on the microblog data demonstrate that this method outperforms the state-of-art methods in sentiment analysis of the microblog data sets with im-balanced sentiment distribution .
出处
《广东工业大学学报》
CAS
2016年第6期85-90,共6页
Journal of Guangdong University of Technology
基金
国家自然科学基金资助项目(61472089
61572143)
关键词
情感分析
混合高斯分布
条件随机场
情感倾向
不平衡性
Word2vec
Word2 vec
sentiment analysis
Gaussian mixture distribution
conditional random field
sentiment
im-balance
Word2 vec