摘要
分析和监测微博文本中所包含的情感信息,能够挖掘用户行为,为微博舆情监管提供借鉴。但微博文本具有长度较短、不规范、存在大量变形词和新词等特点,仅以情感词为特征对微博进行分类的方法准确率较低,难以满足实际使用。为此,基于微博语料构建二元搭配词库,并根据PMI-IR算法结合语料库统计信息,提出搭配词组情感权值的计算方法PMI-IR-P。结合情感词典,采用统计方法生成微博情感特征向量,利用机器学习中的C4.5算法构建分类模型,对微博文本进行情感倾向分类。分别使用不同的数据集用于构建搭配词库及分类模型,并与基于情感词典的分类方法以及朴素贝叶斯分类方法进行对比。实验结果表明,提出的情感特征通过运用C4.5算法对微博文本情感分类的准确率达到87%,具有较好的效果。
Analysis and monitoring of emotion information in micro-blog texts can help mine user behavior and offer the reference for the micro-blog public opinion supervision. However, micro-blog texts have the characteristics of short length, non-standardization, existence of a large number of anagrams and new words, etc. To classify micro-blog texts based on sentimental feature only lead poor accuracy. It is also difficult to meet practical demands. Therefore, a word stock of bigram collocation based on micro-blog corpus is constructed, and the PMI-IR-P algorithm is proposed to calculate the semantic weight of collocation based on PMI-IR algorithm. Combining the sentiment dictionary, micro-blog sentimental feature vector is generated by adopting statistical method. The C4.5 algorithm is used to establish classification models, so as to classify the sentiment polarity of the micro-biog. In the experiment, different data sets are utilized to construct collocation stock and classification models, and the result with the method based on sentiment dictionary is compared with rules as well as the Naive Bayes method. Experimental results show that with the help of C4.5 algorithm, the accuracy rate of micro-blog text sentiment classification reaches 87%, which has better effect.
出处
《计算机工程》
CAS
CSCD
2014年第6期162-165,共4页
Computer Engineering
基金
国家社科基金资助项目(12BYY045)
教育部人文社会科学研究青年基金资助项目(10YJCZH247)
教育部人文社会科学基金资助一般项目(09YJCZH019)
教育部新世纪优秀人才支持计划基金资助项目(NCET-12-0939)
广东省科技计划基金资助项目(2010B031000014)
广东外语外贸大学校级基金资助项目(12Q22)
广东外语外贸大学研究生科研创新基金资助项目
关键词
搭配词库
微博情感特征
微博情感分类
机器学习
C4
5算法
collocation dictionary
micro-blog sentimental feature
micro-blog sentimental classification
machine learning
C4.5 algorithm