摘要
微博中的主观句包含着人们对事物的态度、倾向等信息。微博本身字数的限制和语言结构的自由,使得在微博中发现主观句面临着许多困难。借鉴传统文本处理使用的词性和情感词典两类特征,通过AdaBoost方法选择并组合分类器。对于已标注数据比例较小的数据集,为了进一步提升分类器的性能,尝试着通过Bootstrapping过程迭代重构分类器,也就是不断地通过已有的分类器标注未标注数据集中的可信句子,并加入已标注数据集中,再重新训练分类器。实验结果表明,Bootstrapping的引入不仅能够提升分类器的F值,而且能减少分类器所携带的特征的数量,使集成分类器的精度和速度均有显著提高。
Subjectivity in natural language refers to aspects of language used to express opinions,evaluation,tendencies and other information. For microblogs,it is more difficult to find the subjective sentences due to the limited number of words and free structure of text. In order to select features from the sentences,this paper applied the AdaBoost algorithms,and organized them into composite classifier. Considering the poor performance when working on the small amount of labeled dataset,it used Bootstrapping process to label the most confident unlabeled sentences in the the unlabeled dataset and added them into training process to reconstruct the AdaBoost classifier iteratively. The experiments show that the Bootstrapping process elevate the F1-score of classifier,and decrease the number of features in AdaBoost classifier,which lead to conspicuous improvement in precision and speed.
出处
《计算机应用研究》
CSCD
北大核心
2014年第7期2035-2039,共5页
Application Research of Computers
基金
福建省科技计划重大重点项目(2011H6016
2011H0028)