摘要
针对博客社区和BBS论坛充斥Web垃圾信息的问题,提出相关度向量空间模型cVSM,并以此作为评论的特征,采用支持向量机分类算法自动识别垃圾评论。cVSM包括一种适合短文本的相关测度,用于衡量评论和文章的语义相关程度。在中文博客测试集和中文BBS测试集上的实验结果表明,相比纯粹使用评论文本特征的方法,应用该模型时F1至少提高6%。
A relevancy coefficient vectort space model named cVSM is proposed to aim at Web spams which flood in blogosphere and forums. The cVSM whose components are employed as features of comments and the support vector machine classification algorithms are used to automatically identify comment spams. The relevancy coefficient included in the cVSM is presented, which is used to evaluate relevancy grade of posts and comments. Chinese blog dataset and Chinese BBS dataset are tested. Experimental results show that compared with traditional method the FI has been improved at least 6% by this way.
出处
《计算机工程》
CAS
CSCD
北大核心
2009年第6期88-90,96,共4页
Computer Engineering
基金
长沙学院科研基金资助项目(CDJJ-07010110)
关键词
博客
垃圾评论
支持向量机
文本挖掘
相关测度
blog
comment spam
support vector machine
text mining
relevancy coefficient