摘要
微博中的垃圾用户非常普遍,其异常行为及生产的垃圾信息显著降低了用户体验。为了提高识别准确率,已有研究或是尽可能多地定义特征,或是不断尝试提出新的分类检测方法;那么,微博反垃圾问题的突破点优先置于寻找分类特征还是改进分类检测方法,是否特征越多检测效果越好,新的方法是否可以显著提高检测效果。以新浪微博为例,试图通过不同的特征选择方法与不同的分类器组合实验回答以上问题,实验结果表明特征组的选择较分类器的改进更为重要,需从内容信息、用户行为和社会关系多侧面生成特征,且特征并非越多检测效果越好,这些结论将有助于未来微博反垃圾工作的突破。
Microblog has drawn attention of not only legitimate users but also spammers. The garbage information provided by spammers handicaps users' experience significantly. In order to improve the detection accuracy of spammers, most existing studies on spare focus on generating more classification features or putting forward new classifiers. Which kind of issues would be put the high priority of an enormous amount of research effort into? Are extensive features or novel classifiers better for the detection accuracy of spammers? It is tried to address these questions through combining different feature selection methods with different classifiers on a real Sina Weibo dataset. Experimental results show that selected features are more important than novel classifiers for spammer detection. In addition, features should be derived from a wide range, such as text contents, user behaviors, and social relationship, and the dimension of features should not be too high. These results will be useful in finding the breakpoint of Microblog anti-spam works in the future.
出处
《通信学报》
EI
CSCD
北大核心
2016年第8期24-33,共10页
Journal on Communications
基金
国家重点基础研究发展计划("973"计划)基金资助项目(No.2009CB320505)
国家科技支撑计划基金资助项目(No.2008BAH37B05)
国家自然科学基金资助项目(No.61170211
No.U1533104
No.61301245)
教育部博士点基金资助项目(No.20110002110056)~~
关键词
新浪微博
特征生成
特征选择
垃圾用户检测
Sina Weibo, feature definition, feature selection, spammer detection