摘要
网页作弊不仅造成信息检索质量下降,而且给互联网的安全也带来了极大的挑战。提出了一种基于Bagging-SVM集成分类器的网页作弊检测方法。在预处理阶段,首先采用K-means方法解决数据集的不平衡问题,然后采用CFS特征选择方法筛选出最优特征子集,最后对特征子集进行信息熵离散化处理。在分类器训练阶段,通过Bagging方法构建多个训练集并分别对每个训练集进行SVM学习来产生弱分类器。在检测阶段,通过多个弱分类器投票决定测试样本所属类别。在数据集WEBSPAM-UK2006上的实验结果表明,在使用特征数量较少的情况下,本检测方法可以获得非常好的检测效果。
Web spam not only declines the quality of information retrieval,but also causes troubles to the security of Internet.This paper proposed a Baggin-based integration of SVM to detect Web spam.In preprocessing stage,a technique referring to K-means is introduced to solve the class-imbalance problem of dataset firstly,and then an optimal feature subset is culled by using CFS.Finally the optimal feature subset is discretized by the information entropy.In the stage of classifier training,several training datasets are obtained by Bagging and each training dataset is utilized to produce weak classifier respectively after SVM learning.In detection stage,test samples are voted by weak classifiers obtained before detemining their categories.Experimental results on the WEBSPAM-UK2006 reveal that the proposed method can achieve better results with less number of features.
出处
《计算机科学》
CSCD
北大核心
2015年第1期239-243,共5页
Computer Science
基金
四川省学术和技术带头人后备人选培养基金(X800912371309)资助
关键词
网页作弊
集成分类器
特征选择
信息熵
弱分类器
Web spam
Integrated classifier
Feature selection
Information entropy
Weak classifier